英文原味:詞袋模型 vs. 深度序列模型

The rise of Machine Learning, Deep Learning, and Artificial Intelligence more generally has been undeniable, and it has already had a massive impact on the field of computer science. By now, you might have heard how deep learning has surpassed super-human performance in a number of tasks ranging from image recognition to the game of Go.

The deep learning community is now eyeing natural language processing (NLP) as the next frontier of research and application.

One beauty of deep learning is that advances tend to be very generic. For example, techniques that make deep learning work for one domain can often be transferred to other domains with little to no modification. More specifically, the approach of building massive, computationally expensive, deep learning models for image and speech recognition has spilled into NLP. One can see this in the case of the most recent state-of-the-art translation system, which outperformed all previous results, but required an exorbitant amount of computers. Such demanding systems can capture very complex patterns occasionally found in real world data, but this has led many to apply these massive models to all tasks. This raises the question:

Do all tasks always have the complexity that requires such models?

Let's look at the innards of a two layered MLP trained on bag-of-words embeddings for sentiment analysis.

英文原味:詞袋模型 vs. 深度序列模型

The innards of a simple deep learning system, known as the bag-of-words, classifying sentences as positive or negative. The visualization is a T-SNE of the last hidden layer from a in a two-layered MLP ontop of a bag-of-words. Each data point corresponds to a sentence and is coloured accordingly to the deep learning systems prediction and the true target. The bounding boxes are drawn according to the linguistic content in the sentences. Later you will get to inspect them for yourself with an interactive plot!

The boundary boxes in the plot above offers some important insights. Real world data comes in different difficulties, some sentences are easily classified while others contain complex semantic structures. In the case of easily classified sentences, running a high-capacity system might be unnessasary. A much simpler model could potentially do an equivalent job. This blog post will explore whether this is the case. It will show that we can often do with simple models.

Deep learning with text

Most deep learning methods require floating point numbers as input and, unless you have been working with text before, you might wonder:

How do I go from a piece of text to deep learning?

A core issue with text is how to represent an arbitrarily large amount of information, given the length of the material. A popular solution has been tokenizing text into either words, sub-words, or even characters. Each word is transformed into a floating point vector using well studied methods such as word2vec or GloVe. This provides for meaningful representations of a word through the implicit relationships between different words.

英文原味:詞袋模型 vs. 深度序列模型

Take a word, turn it into a high dimensional embedding (e.g. 300 dimensions) and use PCA or T-SNE (popular tools to reduce dimensionality, e.g. to two dimensions in this case) and you will find interesting relationships between words. As one can see above the distance between uncle and aunt is similar to the distance between man and woman. (Source: Mikolov et al., 2013)

By using tokenization and the word2vec methods we can turn a piece of text into a sequence of floating point representations of each word.

Now, what can we use a sequence of word representations for?

Bag-of-words

Now let's talk about the bag-of-words (BoW), perhaps one of the simplest machine learning algorithms you will ever learn!

英文原味:詞袋模型 vs. 深度序列模型

Take a number of word representations (the bottom gray boxes) and either sum or average them into a common representation (blue box) that should then contain some information from each word. In this post, the common representation is used to predict whether the sentence is positive or negative (red box).

Simply take the mean of the words across each feature dimension. It turns out that simply averaging word embeddings, even though it completely ignores the order of the sentence, works well on many simple practical examples and will often give a strong baseline when combined with deep neural networks (shown later). Furthermore, taking the mean is a cheap operation and reduces the dimensionality of the sentence to a fixed sized vector.

Recurrent Neural Networks

Some sentences require high precision or rely on sentence structure. Using a bag-of-words for these problems might not cut it. Instead, you might want to consider the amazing recurrent neural network!

英文原味:詞袋模型 vs. 深度序列模型

At each timestep (going from left to right) an input (e.g. a word) is fed to the RNN (grey box) together with the previous internal memory (blue box). The RNN then perform some computation that results in a new internal memory (blue box) that represents all previous units seen (e.g. all previous words). The RNN should now contain information on a sentence level that allows it to better predict whether the sentence is positive or negative (red box).

Each word embedding is, in order, fed to a recurrent neural network that then manages to store previously seen information and combine it with new words. When using an RNN powered by the famous memory cells such as the long-short term memory cell (LSTM) or the gated recurrent unit (GRU), the RNN is capable of remembering what has happened in sentences with up to many words! (because of the LSTM's success, the RNN with LSTM memory cells is often referred to as the LSTM). The biggest of these models stack eight of these on top of one another.

英文原味:詞袋模型 vs. 深度序列模型

Welcome to probably the most advanced deep learning model ever created, which uses RNNs with LSTM cells to translate language pairs. The pink, orange and green boxes are recurrent neural networks with LSTM cells. They also applies tricks of the trade such as skip connections between the lstm layers and a method known as attention. Also notice that the green LSTM is heading in the opposite direction. When combined with a normal LSTM this is called a bidirectional LSTM, as it gains information from the sequence of data in both directions. For more information check out this blog post by Stephen Merity. (Source: Wu et al., 2016)

However, the LSTM is much, much more expensive than the cheap bag-of-words model and will often require an experienced deep learning engineer to implement and support efficiently with high-performance computing hardware.


分享到:


相關文章: