In this project, we will go through the process of solving a text classification problem using pre-trained word embeddings and a convolutional neural network.
“Word embeddings” are a family of natural language processing techniques aiming at mapping semantic meaning into a geometric space. This is done by associating a numeric vector to every word in a dictionary, such that the distance (e.g. L2 distance or more commonly cosine distance) between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space. For instance, “coconut” and “polar bear” are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But “kitchen” and “dinner” are related words, so they should be embedded close to each other.
Ideally, in a good embeddings space, the “path” (a vector) to go from “kitchen” and “dinner” would capture precisely the semantic relationship between these two concepts. In this case the relationship is “where x occurs”, so you would expect the vector kitchen - dinner (difference of the two embedding vectors, i.e. path to go from dinner to kitchen) to capture this “where x occurs” relationship. Basically, we should have the vectorial identity: dinner + (where x occurs) = kitchen (at least approximately). If that’s indeed the case, then we can use such a relationship vector to answer questions. For instance, starting from a new vector, e.g. “work”, and applying this relationship vector, we should get sometime meaningful, e.g. work + (where x occurs) = office, answering “where does work occur?”.
Word embeddings are computed by applying dimensionality reduction techniques to datasets of co-occurence statistics between words in a corpus of text. This can be done via neural networks (the “word2vec” technique), or via matrix factorization.
We will be using GloVe embeddings, which you can read about here. GloVe stands for “Global Vectors for Word Representation”. It’s a somewhat popular embedding technique based on factorizing a matrix of word co-occurence statistics.
Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia. You can download them here.
The task we will try to solve will be to classify posts coming from 20 different newsgroup, into their original 20 categories –the infamous “20 Newsgroup dataset”. You can read about the dataset and download the raw text data here.
Categories are fairly semantically distinct and thus will have quite different words associated with them.
Here’s how we will solve the classification problem: