This chapter introduces uses neural networks to learn a vector representation of individual semantic units like a word or a paragraph. These vectors are dense rather than sparse as in the bag-of-words model and have a few hundred real-valued rather than tens of thousand binary or discrete entries. They are called embeddings because they assign each semantic unit a location in a continuous vector space.
Embeddings result from training a model to relate tokens to their context with the benefit that similar usage implies a similar vector. As a result, the embeddings encode semantic aspects like relationships among words by means of their relative location. They are powerful features for use in the deep learning models that we will introduce in the following chapters.
Word embeddings represent tokens as lower-dimensional vectors so that their relative location reflects their relationship in terms of how they are used in context. They embody the distributional hypothesis from linguistics that claims words are best defined by the company they keep.
Word vectors are capable of capturing numerous semantic aspects; not only are synonyms close to each other, but words can have multiple degrees of similarity, e.g. the word ‘driver’ could be similar to ‘motorist’ or to ‘factor’. Furthermore, embeddings reflect relationships among pairs of words like analogies (Tokyo is to Japan what Paris is to France, or went is to go what saw is to see).
Word embeddings result from a training a shallow neural network to predict a word given its context. Whereas traditional language models define context as the words preceding the target, word embedding models use the words contained in a symmetric window surrounding the target.
In contrast, the bag-of-words model uses the entire documents as context and uses (weighted) counts to capture the co-occurrence of words rather than predictive vectors.
A word2vec model is a two-layer neural net that takes a text corpus as input and outputs a set of embedding vectors for words in that corpus. There are two different architectures to efficiently learn word vectors using shallow neural networks. - The continuous-bag-of-words (CBOW) model predicts the target word using the average of the context word vectors as input so that their order does not matter. CBOW trains faster and tends to be slightly more accurate for frequent terms, but pays less attention to infrequent words. - The skip-gram (SG) model, in contrast, uses the target word to predict words sampled from the context. It works well with small datasets and finds good representations even for rare words or phrases.
The dimensions of the word and phrase vectors do not have an explicit meaning. However, the embeddings encode similar usage as proximity in the latent space in a way that carries over to semantic relationships. This results in the interesting properties that analogies can be expressed by adding and subtracting word vectors.
Just as words can be used in different contexts, they can be related to other words in different ways, and these relationships correspond to different directions in the latent space. Accordingly, there are several types of analogies that the embeddings should reflect if the training data permits.
The word2vec authors provide a list of several thousand relationships spanning aspects of geography, grammar and syntax, and family relationships to evaluate the quality of embedding vectors (see directory analogies).
Similar to other unsupervised learning techniques, the goal of learning embedding vectors is to generate features for other tasks like text classification or sentiment analysis. There are several options to obtain embedding vectors for a given corpus of documents: - Use embeddings learned from a generic large corpus like Wikipedia or Google News - Train your own model using documents that reflect a domain of interest
GloVe is an unsupervised algorithm developed at the Stanford NLP lab that learns vector representations for words from aggregated global word-word co-occurrence statistics (see references). Vectors pre-trained on the following web-scale sources are available: - Common Crawl with 42B or 840B tokens and a vocabulary of 1.9M or 2.2M tokens - Wikipedia 2014 + Gigaword 5 with 6B tokens and a vocabulary of 400K tokens - Twitter using 2B tweets, 27B tokens and a vocabulary of 1.2M tokens
The following table shows the accuracy on the word2vec semantics test achieved by the GloVE vectors trained on Wikipedia:
| Category | Samples | Accuracy | Category | Samples | Accuracy | |--------------------------|---------|----------|-----------------------|---------|----------| | capital-common-countries | 506 | 94.86% | comparative | 1332 | 88.21% | | capital-world | 8372 | 96.46% | superlative | 1056 | 74.62% | | city-in-state | 4242 | 60.00% | present-participle | 1056 | 69.98% | | currency | 752 | 17.42% | nationality-adjective | 1640 | 92.50% | | family | 506 | 88.14% | past-tense | 1560 | 61.15% | | adjective-to-adverb | 992 | 22.58% | plural | 1332 | 78.08% | | opposite | 756 | 28.57% | plural-verbs | 870 | 58.51% |
There are several sources for pre-trained word embeddings. Popular options include Stanford’s GloVE and spaCy’s built-in vectors. - The notebook using_trained_vectors illustrates how to work with pretrained vectors.
The notebook evaluating_embeddings demonstrates how to test the quality of word vectors using analogies and other semantic relationships among words.
Many tasks require embeddings of domain-specific vocabulary that models pre-trained on a generic corpus may not be able to capture. Standard word2vec models are not able to assign vectors to out-of-vocabulary words and instead use a default vector that reduces their predictive value.
For example, when working with industry-specific documents, the vocabulary or its usage may change over time as new technologies or products emerge. As a result, the embeddings need to evolve as well. In addition, documents like corporate earnings releases use nuanced language that GloVe vectors pre-trained on Wikipedia articles are unlikely to properly reflect.
See the data directory for instructions on sourcing the financial news dataset.
The notebook financial_news_preprocessing demonstrates how to prepare the source data for our model
The notebook financal_news_word2vec_tensorflow illustrates how to build a word2vec model using the Keras interface of TensorFlow 2 that we will introduce in much more detail in the next chapter.
The TensorFlow implementation is very transparent in terms of its architecture, but it is not particularly fast. The natural language processing (NLP) library gensim that we also used for topic modeling in the last chapter, offers better performance and more closely resembles the C-based word2vec implementation provided by the original authors.
The notebook inancial_news_word2vec_gensim shows how to learn word vectors more efficiently.
In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings.
In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see Chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling (see references about data source and the notebooks in the folder sec-filings for details).
Text classification requires combining multiple word embeddings. A common approach is to average the embedding vectors for each word in the document. This uses information from all embeddings and effectively uses vector addition to arrive at a different location point in the embedding space. However, relevant information about the order of words is lost.
In contrast, the state-of-the-art generation of embeddings for pieces of text like a paragraph or a product review is to use the document embedding model doc2vec. This model was developed by the word2vec authors shortly after publishing their original contribution. Similar to word2vec, there are also two flavors of doc2vec: - The distributed bag of words (DBOW) model corresponds to the Word2Vec CBOW model. The document vectors result from training a network on the synthetic task of predicting a target word based on both the context word vectors and the document's doc vector. - The distributed memory (DM) model corresponds to the word2wec skipgram architecture. The doc vectors result from training a neural net to predict a target word using the full document’s doc vector.
The notebook doc2vec_yelp_sentiment applies doc2vec to a random sample of 1mn Yelp reviews with their associated star ratings.
Word2vec and GloVe embeddings capture more semantic information than the bag-of-words approach, but only allow for a single fixed-length representation of each token that does not differentiate between context-specific usages. To address unsolved problems like multiple meanings for the same word, called polysemy, several new models have emerged that build on the attention mechanism designed to learn more contextualized word embeddings (Vaswani et al. 2017). Key characteristics of these models are - the use of bidirectional language models that process text both left-to-right and right-to-left for a richer context representation, and - the use of semi-supervised pretraining on a large generic corpus to learn universal language aspects in the form of embeddings and network weights that can be used end fine-tuned for specific tasks
In 2018, Google released the BERT model, which stands for Bidirectional Encoder Representations from Transformers (Devlin et al. 2019). In a major breakthrough for NLP research, it achieved groundbreaking results on eleven natural language understanding tasks ranging from question answering and named entity recognition to paraphrasing and sentiment analysis as measured by the General Language Understanding Evaluation (GLUE) benchmark.
The BERT model builds on two key ideas, namely the transformer architecture described in the previous section and unsupervised pre-training so that it doesn’t need to be trained from scratch for each new task; rather, its weights are fine-tuned. - BERT takes the attention mechanism to a new (deeper) level by using 12 or 24 layers depending on the architecture, each with 12 or 16 attention heads, resulting in up to 24 x 16 = 384 attention mechanisms to learn context-specific embeddings. - BERT uses unsupervised, bidirectional pre-training to learn its weights in advance on two tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another).