Python and Kaggle: Code improvements, logging and cross-validation

In the last blog, I focused on a basic piece of functionality that provided a solution for one of the Kaggle challenges using Python. This blogpost shows some improvements in the code itself, as well as the classification process:

Removing some of the functionality that was available in public repositories (pandas)
Adding logging capabilities
Include quality evaluation and cross-validation

The first improvement to the code is to remove the helper functions to read and write from csv files. Although coding this functionality was a good exercise in order to remember some basic python, this is already supported in the pandas library, and reinventing the wheel is always a bad idea. For full disclosure, using pandas for this was recommended by @ignacio_elola. The following piece of code shows an example of how to read and write from csv files, where output is a list with the following format, [[‘ImageId’, ‘Label’] [‘1’, ‘2’] [‘2’, ‘0] … [‘28000’, ‘2’]]

import pandas as pd
# Read csv files where the first line contains the headers
train_data = pd.read_csv('data/train.csv', header=0).values
test_data =  (pd.read_csv('data/test.csv', header=0).values

# Write the output results without printing the rows and
# columns index position
pd.DataFrame(output).to_csv('data/results.csv',
header=False, index=False)

This change makes our code base smaller which is great. Now, lets add some new features. The first addition is logging capabilities. Printing results in the console is not be the best way of monitoring our program. To address this problem, the logging library can be used. Moreover, time and datetime are also used to generate a timestamp for each event,

import logging
import time
import datetime

def log_info(message):
ts = time.time()
logging.info(message + &quot; &quot; + datetime.datetime
.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))

def init_logging():
logging.basicConfig(format='%(message)s',
level=logging.INFO, filename='history.log')

The way to use this functionality is to call init_logging at the beginning of the program to specify the log file and the type of events being recorded, and to use log_info every time we need to log an event. For instance,

init_logging('./history.log')
log_info('============== \nClassification started... ')
...
log_info('Other task... ')

This will generate a historical record similar to the one below,

==============
Classification started...  2015-02-09 16:48:21
...
Other task... 2015-02-09 17:51:19

The next functionality will allow us to provide an answer to one of the most critical questions when solving any predictive problem: How well does this solution solve the problem? In order to answer this question, we can split the labelled data into a two subsets, one to be used as a training set and one to be used as testing, where the predicted classes will be compare to the labeled information to calculate the quality of the classifier. However, the results will depend largely on the (usually random) split of the data and some of the documents might never be evaluated or trained. Cross-validation, a technique to observe how well a model will perform with unseen data, is commonly applied to solve these problems. In particular, the classification community tends to use k-fold cross-validation, where all the available labelled data is divided into k subsets. After this, an iterative process starts where all but one subsets is used for training, while the last one would be use for testing. This process is repeated k times, and after it, every document will have been used for testing (once) and training (k -1 times). Each one of the runs provides a quality estimation and the average of all of them is used as the quality estimation for the model.

For instance, imagine that our data is split into three subsets [s1, s2, s3], then, applying a 3-fold cross-validation,

s1, s2 used for training; s3 used for evaluation. This generates a quality estimation q1.
s1, s3 used for training; s2 used for evaluation. This generates a quality estimation q2.
s2, s3 used for training; s1 used for evaluation. This generates a quality estimation q3.
The final quality estimation would be obtain by averaging q1, q2 and q3.

Different versions of cross-validation are very common in machine learning pipelines and sklearn provides a very good coverage for them. In particular, our code uses the function StratifiedKFold which returns multiple collections containing the index of the documents to be used for training and testing in each one of the runs. “Stratified” refers to the fact that the distribution of different categories in the test set is as close as possible as the one in the train set. For example, given a set of 900 women and 100 men, a stratified sample of 100 people must have 90 women and 10 men.

def xval(classifier, train_instances, judgements):
log_info('Crossvalidation started... ')
cv = cross_validation.StratifiedKFold(np.array(judgements),
n_folds=5)

avg_quality = 0.0
for train_index, test_index in cv:
train_cv = train_instances[train_index]
test_cv = train_instances[test_index]
train_judgements_cv = judgements[train_index]
test_judgements_cv = judgements[test_index]

classifier.fit(train_cv, train_judgements_cv)
decisions_cv = classifier.predict(test_cv)
quality = accuracy_score(decisions_cv, test_judgements_cv)
avg_quality += quality
log_info('Quality of split... ' + str(quality))
quality = avg_quality/len(cv)
log_info('Estimated quality of model... ' + str(quality))

This code applies 5-fold stratified cross-validation after which, a quality estimation for the accuracy of the classifier is logged. Another method from sklearn, accuracy_score, is used to compute the accuracy of the predictions.

This week, we have focused on improving our code and tooling, and evaluating our solutions instead of trying to improve the quality of the results. This might seem counterintuitive, but I strongly believe that having a good infrastructure and experimental framework is always a good call in the long run. I promise the quality will be improved in the next blogpost about Kaggle…

This is the final program that illustrates all the points expressed in the blogpost.

4 thoughts

ris says:

February 21, 2015 at 12:34 pm

hi nice tutorial but iam getting following error
File “run.py”, line 91, in run
train_instances = np.array([x[1:] for x in train_data])
MemoryError

LikeLike

miguelmalvarez says:

February 21, 2015 at 1:34 pm

Hi Ris,

Thanks for the comment. The problem you are showing might appear because because you don’t have enough RAM to run the program.

Can you try to run the system with a sample train and test files? The way to do this would be to generate two new files train-sample.csv and test-sample.csv with 1000 lines instead of the full collection. Then you can change the path of the data:

train_data = pd.read_csv(‘data/train-sample.csv’, header=0).values

Then just run the program again and see if you get results.

Tip: You can do this in the shell by using “head -1000 train.csv> train-sample.csv” (and the same for the test set.

Regards,

LikeLike

ris says:

February 21, 2015 at 5:52 pm

finally solved the problem. i saved the test and train instances in .npy format on disk by
using np.save() and then loading the data using np.load(file.npy,mmap_mode=’r’) this considerabely saved my ram and finally got it running 🙂

LikeLiked by 1 person

Pingback: Python and Kaggle: Feature selection, multiple models and Grid Search. | The Practical Academic

The Practical Academic

Merging the best research in Text Analytics with practical and commercial perspectives

Python and Kaggle: Code improvements, logging and cross-validation

4 thoughts

Leave a reply to ris Cancel reply

Share this:

Related

4 thoughts

Leave a reply to ris Cancel reply