Python and Kaggle: Feature selection, multiple models and Grid Search.

I have spoken before about the Kaggle ecosystem and the Digit recognition challenge, and I have also shown how to improve the original version of the code. However, no quality improvement over the initial solution was attempted. This blogpost focuses exactly on that: What can we do to improve the quality of our results?

First, lets remember where the program capabilities were at the end of the last blogpost about this:

  1. Compute the expected quality of a model (i.e., Random Forest) using cross-validation.
  2. Train the model with all the train data from the challenge and classify the test instances.
  3. Log all the events into a log file to keep track of the changes.

The main three factors that this post focus on in order to improve the quality of our results are:

  1. Feature selection.
  2. Grid search to tune the hyper-parameters of a model.
  3. Candidates from multiple classifier families (i.e., Random Forest, SVM, kNN, …).

1. Feature Selection

Feature selection is a very important part of Machine Learning which main goal is to filter the features that do not contain useful information for the classification problem itself. Feature selection can be used to improve both the efficiency (fewer features means quicker programs) and even the effectiveness in some cases by decreasing overfitting.

The first and most trivial approach is to remove the features that have exactly the same values in all the training examples. Therefore, they do not carry any information whatsoever for the classification process. This can be done in Python using the VarianceThreshold(). This operator will remove any features that have a specific variance value. By default, it cleans the features with zero variance, meaning that the value of the feature is exactly the same in all the candidates. This can be seen in the following piece of code, where we also log the number of features used and ignored:

# Fitting a feature selector
def feature_selection(train_instances):
    log_info('Crossvalidation started... ')
    selector = VarianceThreshold()
    selector.fit(train_instances)
    log_info('Number of features used... ' +
              str(Counter(selector.get_support())[True]))
    log_info('Number of features ignored... ' +
              str(Counter(selector.get_support())[False]))
    return selector

#Learn the features to filter from train set
fs = feature_selection(train_instances)

#Transform train and test subsets
train_instances = fs.transform(train_instances)
test_instances = fs.transform(test_instances)

After this, both our training and testing instances would be represented by the features that have at least two different values in the training set.

2. Grid Search

Machine Learning models usually have a set of parameters that should be tuned for a given collection in order to achieve the maximum possible quality. One of the most common techniques to do this is to apply grid search. In grid search, the possible values for each of the variables is specified and, based on those, all the potential combinations are generated and tested. For instance, if we have the parameters and values a = [1, 2, 3], b = [0.1, 0.2], the grid search will generate 6 possible configurations for the classifier with each possible combination: [{a 1, b 0.1}, {a 1 b 0.2}, … {a 3, b 0.2}]. Furthermore, cross-validation is applied to evaluate and select the best setting. All this process is very well supported in python using sklearn:

# Example code for a model and a set of grid-search parameters
model = RandomForestClassifier()
parameters = [{"n_estimators": [250, 500, 1000]}]

# Returns the best configuration for a model using crosvalidation
# and grid search
def best_config(model, parameters, train_instances, judgements):
    log_info('Grid search for... ' + name)
    clf = GridSearchCV(model, parameters, cv=5,
                       scoring="accuracy", verbose=5, n_jobs=4)
    clf.fit(train_instances, judgements)
    best_estimator = clf.best_estimator_
    log_info('Best hyperparameters: ' + str(clf.best_params_))

    return [str(clf.best_params_), clf.best_score_,
            best_estimator]

This code takes a classifier and its set of grid-search parameters, as well as training data and judgements. The result is a triple representing the best configuration, the quality score (measure using accuracy) and the classifier object with the best configuration. The parameters we have used in the GridSearch call are 5-fold cross-validation, with model selection based on accuracy, verbose output and 4 jobs running in parallel while tuning the parameters.

3. Multiple Models

One of the main principles in machine learning is that different models would be better in different situations (or with different input data). For this reason, it is of critical importance that multiple models are tested and evaluated. Thanks to the functionality shown in the previous section, we only have to apply grid search for each candidate model and then picked the best of the best model configurations. The following code illustrates this concept, as well as showing the candidates and configurations that our model will explore:

# Returns the best model from a set of model families given
# training data using cross-validation.
def best_model(classifier_families, train_instances, judgements):
    best_quality = 0.0
    best_classifier = None
    classifiers = []
    for name, model, parameters in classifier_families:
        classifiers.append(best_config(model, parameters,
                                       train_instances,
                                       judgements))

    for name, quality, classifier in classifiers:
        log_info('Considering classifier... ' + name)
        if (quality > best_quality):
            best_quality = quality
            best_classifier = [name, classifier]

    log_info('Best classifier... ' + best_classifier[0])
    return best_classifier[1]

# List of candidate family classifiers with parameters for grid
# search [name, classifier object, parameters].
def candidate_families():
    candidates = []
    svm_tuned_parameters = [{'kernel': ['poly'],
                             'degree': [1, 2, 3, 4]}]
    candidates.append(["SVM", SVC(C=1), svm_tuned_parameters])

    rf_tuned_parameters = [{"n_estimators": [250, 500, 1000]}]
    candidates.append(["RandomForest",
                       RandomForestClassifier(n_jobs=-1),
                       rf_tuned_parameters])

    knn_tuned_parameters = [{"n_neighbors": [1, 3, 5, 10, 20]}]
    candidates.append(["kNN", KNeighborsClassifier(),
                       knn_tuned_parameters])

    return candidates

This code will produce the best configuration within the candidate classifiers specified.

Summary

This blogpost has shown the capability of applying grid search over multiple families of classifiers and select the best one of them based on cross-validation. In addition, it is able to remove the features that are not providing any information based on their variance. After running the full program, the best classifier that has been selected is the 2nd degree polynomial SVM. This configuration has achieved a quality of 0.97871 and a position 121 in the challenge ranking. This is a relatively high improvement compared to our previous results (0.968 accuracy and 308th position).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s