The term “fake news” was almost non-existent in the general context and media providers prior to October 2016 but times have changed and I would not be surprised if you have heard the term being used today, in the news, the radio or just in the street.
Fake news is a term that has been used to describe very different issues, from satirical articles to completely fabricated news and plain government propaganda in some outlets. Fake news, information bubbles, news manipulation and the lack of trust in the media are growing problems with huge ramifications in our society. However, in order to start addressing this problem, we need to have an understanding on what Fake News is. Only then can we look into the different techniques and fields of machine learning (ML), natural language processing (NLP) and artificial intelligence (AI) that could help us fight this situation.
A long time ago I published a blogpost explaining how to represent the Reuters-21578 collection (and more in general, any textual collection for text classification). However, that blogpost never explained how to perform the classification step itself. This post will introduce some of the basic concepts of classification, quickly show the representation we came up with in the prior post and finally, it will focus on how to perform and evaluate the classification.
Reuters-21578 is arguably the most commonly used collection for text classification during the last two decades, and it has been used in some of the most influential papers on the field. For instance, Text Categorization with Support Vector Machines: Learning with Many Relevant Features by Thorsten Joachims. This dataset contains structured information about newswire articles that can be assigned to several classes, therefore making this a multi-label problem. It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics.