Text Classification with Keras and GloVe Embedding

06 Feb 2017

In this project, we will go through the process of solving a text classification problem using pre-trained word embeddings and a convolutional neural network.

What are word embeddings?

“Word embeddings” are a family of natural language processing techniques aiming at mapping semantic meaning into a geometric space. This is done by associating a numeric vector to every word in a dictionary, such that the distance (e.g. L2 distance or more commonly cosine distance) between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space. For instance, “coconut” and “polar bear” are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But “kitchen” and “dinner” are related words, so they should be embedded close to each other.

Ideally, in a good embeddings space, the “path” (a vector) to go from “kitchen” and “dinner” would capture precisely the semantic relationship between these two concepts. In this case the relationship is “where x occurs”, so you would expect the vector kitchen - dinner (difference of the two embedding vectors, i.e. path to go from dinner to kitchen) to capture this “where x occurs” relationship. Basically, we should have the vectorial identity: dinner + (where x occurs) = kitchen (at least approximately). If that’s indeed the case, then we can use such a relationship vector to answer questions. For instance, starting from a new vector, e.g. “work”, and applying this relationship vector, we should get sometime meaningful, e.g. work + (where x occurs) = office, answering “where does work occur?”.

Word embeddings are computed by applying dimensionality reduction techniques to datasets of co-occurence statistics between words in a corpus of text. This can be done via neural networks (the “word2vec” technique), or via matrix factorization.

GloVe word embeddings

We will be using GloVe embeddings, which you can read about here. GloVe stands for “Global Vectors for Word Representation”. It’s a somewhat popular embedding technique based on factorizing a matrix of word co-occurence statistics.

Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia. You can download them here.

20 Newsgroup dataset

The task we will try to solve will be to classify posts coming from 20 different newsgroup, into their original 20 categories –the infamous “20 Newsgroup dataset”. You can read about the dataset and download the raw text data here.

Categories are fairly semantically distinct and thus will have quite different words associated with them.

Approach

Here’s how we will solve the classification problem:

convert all text samples in the dataset into sequences of word indices. A “word index” would simply be an integer ID for the word. We will only consider the top 20,000 most commonly occuring words in the dataset, and we will truncate the sequences to a maximum length of 1000 words.
prepare an “embedding matrix” which will contain at index i the embedding vector for the word of index i in our word index.
load this embedding matrix into a Keras Embedding layer, set to be frozen (its weights, the embedding vectors, will not be updated during training).
build on top of it a 1D convolutional neural network, ending in a softmax output over our 20 categories.

Source Code

Report

Srikanth Pagadala