Last weekend I attended PyData London and, as expected, it was a fantastic event. I reconnected with a lot of colleagues and discovered new amazing people. For those of you who were not able to attend, the videos of the presentations should appear shortly in the PyData Youtube channel.
One of the benefits of PyData is that it brings together people from diverse disciplines and areas of expertise, under the umbrella topic of Data Science. Because of this, the attendees include academics, industry practitioners, data science managers or consultants, among others, all with different perspectives and knowledge to learn from. The community attracts people from many different industries and backgrounds, from Computer Science to Arts, all applying Data Science to solve challenges from discovering new stars to automatically detecting the species of a bird by listening to the noise it makes.
In addition to meeting new colleagues and understand more about what people are doing with Data Science, there are two major outcomes of PyData every time I attend:
- A list of new ideas both for myself and for Signal
- A list of interesting libraries and projects that I want to explore and play with
For this blogpost, I want to share some of the libraries and frameworks that have been mentioned in the conference for two specific topics: Word Embeddings and Model Explainability. Obviously, this is not a complete list and many other interesting topics were mentioned in the conference. I encourage you to watch the talks in Youtube once they are uploaded.
Text Analytics has been in one way or another closer to me during most of my research and industrial career and one of the major changes in the area in the last years is the semantic representation of documents (and other textual units such as sentences) using embeddings. As a result, many libraries, projects and pre-trained embeddings have appeared and many of them were mentioned in several talks, including the one by Lev Konstantinovskiy and, especially, Kostas Perifanos. My goal is to provide a brief description of each method and show the main paper and library related to them. I would also recommend this talk by Lev in PyData Berlin 2017 for anyone interested in the differences between different word embeddings. One of the most insightful parts of the talk is how different embeddings model different similarity aspects. For instance, given the word ‘king’, WordRank positions its attributes (e.g., monarch, crowned, throne, …), while interchangeable words have higher similarity in Word2Vec (e.g., Capet, Mormaer, Canute, …). FastText was shown to combine both aspects simultaneously.
If you are not familiar with this area, the goal of a word embedding is to represent words in a low dimension space (usually around 300 dimensions) such that words that share similar contexts would be closer in that space. This idea has proven to be beneficial for many NLP tasks as it allows to compute semantic similarity in an effective and efficient way, and it is widely used now.
Word2Vec was the method that started the new wave of word embeddings. It is a family of models that produce word embedding using a shallow, two-layer neural network built to reconstruct the context of a word. There are two main variants, the Continuous Bag of Words (CBOW) model that predicts the next word given a current context, and the SkipGram model that predicts the context given a current word.
Global Vectors (GloVe) uses an unsupervised model to obtain the word vectors using the word to word co-occurrence information for a given collection and matrix factorisation.
FastText uses a modification of the SkipGram model where the vector representations are applied on the character n-gram level. The authors argue that this captures the morphology of words better and it allows for the representation of words that were not represented in the original corpus, reducing the out of vocabulary problem.
StarSpace applies a very interesting approach that is not limited to words. The authors have shown how they can represent different entities such as users or items for recommendations, among others. The main idea is that you can represent certain entities as bags of words (e.g., using the documents they have clicked in a news recommendation platform). This also allows the comparison of different entities, making this a very flexible method for problems ranging from people matching to graph similarity. Kostas mentioned that this has worked really well for some of the problems he is working on at Argos.
This model proposes a new generative model using a dynamic topic model. The original paper also provides justification for nonlinear models like PMI, word2vec and GloVe. It also explains why low-dimensional semantic embeddings contain linear algebraic structure that captures word analogies.
Another model for word embeddings, it was not mentioned in this year PyData, but Lev mentioned it in his 2016 and 2017 talks in Berlin and London. The authors of WordRank argue that a word embedding could be seen as a ranking problem.
Debugging and improving models has always been at the core of machine learning and many tools and methods were available for this. Recently, we have seen a clear trend toward model explainability and this has caused the creation of many new libraries to better understand the decisions made by different models. This topic that was mentioned in many talks and some of the libraries were recommended in different talks including Gael’s.
LIME (short for local interpretable model-agnostic explanations) aims to provide an explanation for the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. Also, the system can explain individual predictions.
ELI5 helps debug machine learning classifiers and explain their predictions. It can be applied to classifiers built with some of the most popular frameworks such as scikit-learn or xgboost.
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations. It combines the power of several previous methods (e.g., including LIME) and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
This was a short blogpost that summarises and lists some of the most interesting libraries for model explanation and embeddings. I hope this is helpful, especially for some people starting to look at the space.
Note: The image at the header was taken from Nick Radcliffe‘s twitter