Reuters-21578 is arguably the most commonly used collection for text classification during the last two decades, and it has been used in some of the most influential papers on the field. For instance, Text Categorization with Support Vector Machines: Learning with Many Relevant Features by Thorsten Joachims. This dataset contains structured information about newswire articles that can be assigned to several classes, therefore making this a multi-label problem. It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics.
The collection consists of 21,578 documents, including documents without topics and typographical errors. For this reason, a subset and split of the collection, refered to as “ModApte“, is traditionally used. This dataset includes 9,603 documents for training and 3,299 for testing. This split assigns documents from April 7, 1987 and before to the training set, and documents from April 8, 1987 and after to the test set. More information about the collection can be found in the collection README file. Furthermore, an additional step is to only focus on the categories that have at least one document in the training set and the test set. After this, the dataset has 90 categories with a training set of 7769 documents and a test set of 3019 documents.
The main reason why we will focus on this collection is that it is one of the most classic collection from text classification and it will allow us to compare our results with a large set of previously published results for several algorithms while being able to run it in our laptops.
This blogpost is intended to be the first one of a series and for the moment, we will only focus on how to represent the data. Further blogposts will improve this code and also predict the category for future documents.
1. Get the collection
Traditionally, we would have to download the collection and parse the multiple SGML files in order to recreate the original dataset. Fortunately, this step is much easier thanks to the NLTK library which has the reuters corpus already available. The following code shows some of its capabilities by printing some basic information for the collection.
from nltk.corpus import reuters def collection_stats(): # List of documents documents = reuters.fileids() print(str(len(documents)) + " documents"); train_docs = list(filter(lambda doc: doc.startswith("train"), documents)); print(str(len(train_docs)) + " total train documents"); test_docs = list(filter(lambda doc: doc.startswith("test"), documents)); print(str(len(test_docs)) + " total test documents"); # List of categories categories = reuters.categories(); print(str(len(categories)) + " categories"); # Documents in a category category_docs = reuters.fileids("acq"); # Words for a document document_id = category_docs document_words = reuters.words(category_docs); print(document_words); # Raw document print(reuters.raw(document_id));
After running this code, we can confirm that the collection has a split of 7769/3019 documents for train and test respectively, and a total of 90 different categories. Also, we can access the words for each one of the documents in the collection. Accidentally, this code shows the power of lambda functions and filters (both functional programming concepts) in Python.
It is important to remember that this collection is a multi-label collection, where a document might have more than one category assigned to it.
2. Document Representation and Weighting
Before applying traditional machine learning techniques, we have to represent and weight every document with respect to the set of textual features (e.g., the words that appear in the document). We are going to apply the following transformations:
- Lowercase the original content
- Tokenize the text
- Stem each one of the tokens
- Weight each of these features, for each document
- Normalise the representation
The first three steps involve lowercasing the content of the document, partitioning it into tokens, and obtaining the stem (i.e., the root of a word without any of its prefixes nor suffixes):
from nltk import word_tokenize from nltk.stem.porter import PorterStemmer import re from nltk.corpus import stopwords cachedStopWords = stopwords.words("english") def tokenize(text): min_length = 3 words = map(lambda word: word.lower(), word_tokenize(text)); words = [word for word in words if word not in cachedStopWords] tokens =(list(map(lambda token: PorterStemmer().stem(token), words))); p = re.compile('[a-zA-Z]+'); filtered_tokens = list(filter(lambda token: p.match(token) and len(token)>=min_length, tokens)); return filtered_tokens
Tokenize returns a list of stems that appear in the text that was passed as an argument. Stop-words are filtered out, as well as words that are too short. Furthermore, any string that contains other than letters is removed (e.g., numbers).
At this point, we want to weight each of the features according to their “importance” for the document. We are going to use tf-idf where the terms weight is higher the more common in the document, and the more uncommon in the collection they are (this is a very simplistic description of tf-idf, please follow the link for more detail information.
For the implementation, we are using TfidfVectorizer (from sklearn), which allows a great degree of flexibility to select a specific variation of the tf-idf algorithm. It also supports custom-built tokenisation functions, as well as other features such as stop-word removal (although only english is built-in). Most of the steps shown in our tokenisation function can be directly done using this function; however, I haven’t yet found a way to use the stop-words removal correctly (before the stemming phase) only by calling this function. If the stemming is performed before the stop-word removal, some incorrect words will be used to represent our documents. For instance, “this” would be stemmed to “thi” which would not be filtered using the stop-words. We have chosen to use the top 3000 features (according to their Document Frequency), as long as they appear in at least 3 documents and not on more than 90% of the collection. Also, we use IDF and the logarithmic version of TF and L2 normalisation.
# Return the representer, without transforming def tf_idf(docs): tfidf = TfidfVectorizer(tokenizer=tokenize, min_df=3, max_df=0.90, max_features=3000, use_idf=True, sublinear_tf=True, norm='l2'); tfidf.fit(docs); return tfidf;
The output of this method is a representer object that will learn from a train collection (i.e., to compute the IDF values) using the method fit, and can be applied to any subsequent set with transform. The output of the transform method is a matrix that shows the representation of each the documents in a collection. However, this is not the most human-readable format. The following code shows the weight of each feature (that is non-zero) as well as its name for readability purposes:
def feature_values(doc, representer): doc_representation = representer.transform([doc]) features = representer.get_feature_names() return [(features[index], doc_representation[0, index]) for index in doc_representation.nonzero()]
Now we have all the components required to represent our documents. We are going to fit the representer using the train collection. The last step is to show the represented test documents in our dataset:
def main(): train_docs =  test_docs =  for doc_id in reuters.fileids(): if doc_id.startswith("train"): train_docs.append(reuters.raw(doc_id)) else: test_docs.append(reuters.raw(doc_id)) representer = tf_idf(train_docs); for doc in test_docs: print(feature_values(doc, representer))
This blogpost has just scratched the surface of document representation and the capabilities of some of the python libraries to do so, and my intention is to continue improving this over time (all the code is available in GitHub). Some potential improvements would be better presentation of the data or different feature selection algorithms.
Why the max_feature value is 3000 here? How can we define the max_feature value?
Based on previous experiments with that collection and models, 3,000 is a reasonable number for the feature selection. You can test any other number of features by changing the max_features parameter in the TFIDFVectoriser.