Remembering Python and Kaggle

I think that every developer should periodically used more than one programming language and more than one programming paradigm to be knowledgable enough and to not develop a “tunnel vision” which makes us believe that some solutions are not possible just because our current paradigm does not support them.

For the last year or so, I have been using mainly one developing language (Clojure). Do not get me wrong, I believe Clojure is the future and I love it as a language, but I do think that being a polyglot developer is something we all should look forward to. Therefore, I have decided to fresh up my python skills going back to the basics and use it to solve one Kaggle competition.

I have selected the Digit Recognition problem, one of the most popular competitions in Kaggle. The challenge is to automatically detect which number has been manually drawn given an image of 28×28 pixels where each pixel has a single value associated with it, indicating the lightness or darkness of that pixel. A set of 42,000 labeled images and their pixel-by-pixel representation are given to us. The training file has one line with some headers and then, every other line represents a train example,

label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,...
1,0,188,255,94,0,...
9,0,18,125,44,0,...
3,0,138,76,0,1,...
1,0,134,25,4,89,...

In addition, the challenge provides another set  of 28,000 unlabelled instances to be used for prediction and sent to the kaggle platform for evaluation. The results that the system expects from the competitors have the following format,

ImageId,Label
1,2
2,0
3,9
4,4
...

This blogpost will focus on read the train and testing data provided by Kaggle, use the train set to build a classifier, and then predict the labels (i.e., which digit it represent) for each test element. Although this represents the minimum set of steps required to send a submission to the challenge, there are multiple drawbacks such as optimising the parameters of the classifier, comparing (or even combine) multiple classifiers, and calculating the expected quality of the model. All these improvements will be addressed in future blogposts.

The first step would be to be able to read the CSV files with the training data,

def read_csv(file_path, has_header = True):
    with open(file_path) as f:
        if has_header: f.readline()
        data = []
        for line in f:
            line = line.strip().split(",")
            data.append([float(x) for x in line])
    return data

After this, we use the scikit-learn library to use a Random Forest classifier given the training data which we have process using the previous code,

judgements = [str(int (x[0])) for x in train_data]
train_instances = [x[1:] for x in train_data]

#train the model
classifier = RandomForestClassifier(n_estimators=100))
classifier.fit(train_instances, judgements)
model = train(train_data)

At this point we have a trained model that is capable of predicting hand written digits given the grey-scale of a 28×28 pixels image. We can then predict each one of the test instances with the following line of code,

 
decisions = classifier.predict(test_data)

The only thing left to be done for this toy example is to generate a csv file with the output of each one of the predictions. The capability of generating the CSV file is provided by the method write_csv,

def write_csv(file_path, data):
with open(file_path,"w") as f:
    for line in data: f.write(",".join(line) + "\n")

However, we need the results in the specific format specified by the challenge. This piece of code will take care of that,

count = 1
for decision in decisions:
    formatted_decisions.append([str(count), decision])
    count += 1

Although these are all the individual components needed to send our first submission, there is no organisation in the code. The first and most obvious one would be to refactor the write_csv and read_csv into a new file cvs_io. I have done this small refactor and both files required to solve the challenge are accessible in GitHub (csv_io, kaggle_digits). The program assumes that the train and test files are in a datafolder, and that the library scikit-learn is accesible.

This very compact program gives a score (accuracy) of 0.968 in the challenge. As a result, we have a very decent digit recognition system and we are in the position 308 of the ranking (at the moment I sent the results). More improvements to come in future blogposts.

Other interesting resources about python and kaggle:
Getting Started With Python For Data Science
Up and running with python my first kaggle entry

Advertisements

4 thoughts on “Remembering Python and Kaggle

  1. Pingback: Python and Kaggle: Code improvements, logging and cross-validation | The Practical Academic

  2. Pingback: Python and Kaggle: Feature selection, multiple models and Grid Search. | The Practical Academic

  3. Rohan Saxena

    First of all, being bored at work does pay well if you have a Smartphone and you browse through blogs.Amazing information with facts thoughtfully incorporated within. Definitely going to come back for more! 🙂

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s