Word2Vec is a novel technique that produces a vector representation of documents where the meaning and relationships between words is encoded spatially. Therefore, words that are related to each other are closer on the defined feature space. Word2Vec is gaining huge traction in the machine learning community and it is definitely worth to know more about it. This blogpost will illustrate the main characteristics of this methods and it will provide an proof of concept using Clojure libraries.

1. Document Representation

Every machine learning model that requires training data (e.g., data previously labelled, usually by a human) relies on a specific representation function. This is, the translation between an instance over which a prediction has to be made, and the internal representation of such instance. This representation commonly follows the form of a vector of weights for each one of the features chosen to represent it. For instance, imagine that we want to guess the nationality of a person we just met, someone might focus on some of the visual characteristics of the person:

  • Colour of skin, eyes and hair
  • Height
  • General appearance and clothes
  • Other

In my case, although I believe that those features are important, I think that the accent (while speaking english) and the name of the person are probably much better signals in most of the cases. This is a very crude example of two different representations of the same person, one focusing on visual features, and another on the accent and the name. In addition to which features to focus on, the system must specify how to convert that data into a numeric value in order to be easily plugged into a machine learning pipeline.

2. Word2Vec Representation: How does it work?

Word2Vec was developed by Mikolov, Sutskever, Chen, Corrado and Dean in 2013 at Google Research and its foundations are based on a neural network that processes text and produces a map between words and a vector representation, also known as word embeddings, based on the context of individual words. This context is generated from multiword windows. Given enough data, this have proven to generate very accurate meaningful representations of words that could be used to calculate semantic distances between vectors. Google has also released a trained model based on Google News data which contains 300-dimensional vectors for 3 million words and phrases that can be downloaded here.

One of the best usages of this technology was done by Instagram to represent and compare emoticons (or emojis). They discover underlying patterns and similarities between this growingly popular element of online comunication. In addition, they also visualised such dependencies.

3. Clojure libraries and POC

I have looked for Clojure libraries and the best one I have found is this wrapper for the Java version of word2vec. I was very surprised that the amount of code required to get a proof of concept running was less than 20 lines. The first step is to integrate the library into our project.clj:

(defproject word2vec "SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[org.bridgei2i/word2vec "0.1.0"]])

After this, the code below is all we need to read a textual file, train the word2vec model and then generate the semantic representation for the main (e.i., most common) features:

(ns word2vec
(:require [clojure-word2vec.core :refer
[create-input-format get-matches]]
[clojure.java.io :as io]))

(def data-file "textualData-100K.txt")

;; Format input data
(def data (create-input-format data-file))

;; Train the model
(def model (clojure-word2vec.core/word2vec data))

;; Select the main features
(def main-features (take 200 (.getVocab model)))

;; Show the most important features and their embedded words
(doseq [feature main-features]
(println feature ":" (get-matches model feature)))

The library also allows to parameterise the model in great detail when we call the training model function. This can be seen in the following snippet taken from the original implementation from Bridgei2i where we can see all the options and default values for such function:

{:keys [ min-vocab-frequency window-size type layer-size
use-negative-samples downsampling-rate
num-iterations num-threads]
:or { min-vocab-frequency 5
window-size 8
type NeuralNetworkType/CBOW
layer-size 5
use-negative-samples 25
downsampling-rate 1e-5
num-iterations 100
num-threads (.availableProcessors
(Runtime/getRuntime))}

In my small experiments, I have ran that code with a set of 100,000 news articles and the results are interesting. While some representations make a lot of sense (e.g., “health”), others like “team” are clearly biased by specific spurious correlations:

development : (business operations energy strategy sales gas
production opportunities exploration leadership)
march : (due level change record current loss months
significant country half)
products : (process based data solution own system product
international provide support)
cash : (rate total cost debt lower value increase market
period increased)
help : (support own getting hard play world create little
means process)
team : (time taking comes held country life world lead
plan people)
home : (city john summer told series times event house air david)
companies : (management technology based customers process
international industry provide markets include)
health : (medical care patients women national education life
training patient country)

Obviously, I see the benefit of this tool, but you would need a lot of data to make it work well. This is probably why the Google guys have shared the output of their trained model with 3 million words. All the code shown in this page can be accessed in my github repository

In a slightly off-topic note, I know that a lot of people likes Python, this blogpost is a fantastic tutorial on how to use the library gensim to use word2vec in Python.

4. Summary

I hope you agree with me that Word2Vec is a very cool technology with a lot of potential. Especially, because it focuses on a step that is pivotal to most Text Analytics and Machine Learning tasks (i.e., data representation). The capability of having a semantic representation of a document can help in a multitude of different tasks such as clustering or recommendation systems.

Leave a comment