Classifying Reuters-21578 collection with Python

A long time ago I published a blogpost explaining how to represent the Reuters-21578 collection (and more in general, any textual collection for text classification). However, that blogpost never explained how to perform the classification step itself. This post will introduce some of the basic concepts of classification, quickly show the representation we came up with in the prior post and finally, it will focus on how to perform and evaluate the classification.

Core Concepts

Text classification  (a.k.a. text categorisation) is the task of assigning pre-defined categories to textual documents. This could be helpful to solve problems ranging from spam detection to language identification. Classification problems can be divided into different types according to the cardinality of the labels per document :

  • Binary: Only two categories exist and they are mutually exclusive. A document can either be in the category or not (e.g., Spam detection).
  • Multi-class: Multiple categories which are mutually exclusive (e.g., Language detection if we assume documents can only have one language)
  • Multi-label: Multiple categories with the possibility of multiple (or none) assignments (e.g., News categorisation, where a document could be about “Sports” and “Corruption” at the same time).

This list is not exhaustive (e.g., hierarchical classification, single class classification, …), but the majority of the problems fit one of these three traditional types of problems.

Representing Reuters

A detailed explanation of the representation step, as well as the description of the collection can be found in my previous post. However, the code required for representing the dataset, as well as a very brief introduction of its nature is also shown below:

from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re

cachedStopWords = stopwords.words("english")

def tokenize(text):
  min_length = 3
  words = map(lambda word: word.lower(), word_tokenize(text))
  words = [word for word in words if word not in cachedStopWords]
  tokens = (list(map(lambda token: PorterStemmer().stem(token),
  p = re.compile('[a-zA-Z]+');
  filtered_tokens =
    list(filter (lambda token: p.match(token) and
                               len(token) >= min_length,
  return filtered_tokens

Reuters-21578 is arguably the most commonly used collection for text classification during the last two decade and it has been used in some of the most influential papers on the field. It contains structured information about newswire articles that can be assigned to several classes, making it a multi-label problem. It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics. The collection originally consisted of 21,578 documents but a subset and split is traditionally used. The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.

Classifying Reuters

In order to classify the collection, we have to apply a number of steps which are standard for the majority of classification problems:

  1. Define our training and testing subsets to make sure that we do not evaluate with documents that the system has learnt from. In our case, this is trivial as the original dataset is already split (for replicability purposes).
  2. Represent all the documents in each subset. Remember that any optimisation (e.g., IDF calculations) should be done in the training set only.
  3. Train a classifier on the represented training data.
  4. Predict the labels for each one of the represented testing documents.
  5. Compare the real and predicted document labels to evaluate our solution.

We have chosen to use only one model (linear SVM) to simplify our solution. This model has traditionally produced good quality with text classification problems. Nonetheless, you should try multiple others models, as well as representations.

The problem we are solving has a multi-label nature, and because of this, there are two changes that we have to make in the code that are not needed for binary classification. Firstly, the data representation for the category assignment to the different documents is slightly different, viewing each document as a list of bits representing being or not in each of the categories. This change is done by using the MultiLabelBinarizer as the code shows. Secondly, we have to train our model (which is binary by nature) N times, once per category, where the negative cases will be the documents in all the other categories. This allows our model to make a binary decision per category and produce multi-label results. This can be done with the OneVsRestClassifier object in Scikit-learn. This step might change depending on the estimator you have chosen. For instance, some models (e.g., kNN) are multi-label by nature. You can find more info in the documentation.

from nltk.corpus import stopwords, reuters
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
stop_words = stopwords.words("english")

# List of document ids
documents = reuters.fileids()

train_docs_id = list(filter(lambda doc: doc.startswith("train"),
test_docs_id = list(filter(lambda doc: doc.startswith("test"),

train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]

# Tokenisation
vectorizer = TfidfVectorizer(stop_words=stop_words,

# Learn and transform train documents
vectorised_train_documents = vectorizer.fit_transform(train_docs)
vectorised_test_documents = vectorizer.transform(test_docs)

# Transform multilabel labels
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform([reuters.categories(doc_id)
                                  for doc_id in train_docs_id])
test_labels = mlb.transform([reuters.categories(doc_id)
                             for doc_id in test_docs_id])

# Classifier
classifier = OneVsRestClassifier(LinearSVC(random_state=42)), train_labels)

predictions = classifier.predict(vectorised_test_documents)


Measuring the quality of a classifier is a necessary step in order to potentially improve it. The main metrics for Text Classification are:

  • Precision: Number of documents correctly assigned to a category out of the total number of documents predicted.
  • Recall: Number of documents correctly assigned to a category out of the total number of documents in such category.
  • F1: Metric that combines precision and recall using the harmonic mean.

If the evaluation is being done in multi-class or multi-label environments, the method becomes slightly more complicated because the quality metrics have to be either shown per category, or globally aggregated. There are two main aggregation approaches:

  • Micro-average: Every assignment (document, label) has the same importance. Common categories have more effect over the aggregate quality than smaller ones.
  • Macro-average: The quality for each category is calculated independently and their average is reported. All the categories are equally important.

Scikit-learn has functionality that will help us during this step as we can see below:

from sklearn.metrics import f1_score,

precision = precision_score(test_labels, predictions,
recall = recall_score(test_labels, predictions,
f1 = f1_score(test_labels, predictions, average='micro')

print("Micro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}"
        .format(precision, recall, f1))

precision = precision_score(test_labels, predictions,
recall = recall_score(test_labels, predictions,
f1 = f1_score(test_labels, predictions, average='macro')

print("Macro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}"
        .format(precision, recall, f1))

Some of you might see a warning message saying that a specific quality metric is ill defined (e.g., “Precision being ill-defined”). The reason is that the quality calculation could force a mathematical indetermination when for instance, the classifier decides to not classify any articles in a specific category. This would imply a 0/0 precision. scikit-learn does what I consider to be the best solution by assuming a quality value of 0.0 and showing a warning message. This message could be expected when dealing with very skew collections where some of the classes might be very difficult to learn from and no documents being predicted to belong in the class is common.This code shows that this baseline with the first model we tested and no optimisation whatsoever already produces reasonable quality levels with a micro-average F1 of 0.86 and a macro-average of 0.46.

Micro-average quality numbers
Precision: 0.9455, Recall: 0.8013, F1-measure: 0.8674
Macro-average quality numbers
Precision: 0.6493, Recall: 0.3948, F1-measure: 0.4665

A full (slightly changed) version of this code can be found in this notebook. The next iterations would allow multiple estimators and representation functions to improve our quality.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s