# Encoding financial texts into dense representations

#### 11/09/2019

The market is driven by two emotions: greed and fear.

Have you ever heard that quote? It is quite popular in financial circles and there may just be some truth behind it. After all, when people, with short-term investments, think are going to lose a lot of money, many of them sell as fast as they can. When they think they can make money the same happens, they tend to buy as fast as they can. People are inclined to overreact to what they hear or read, specially if it’s on the news. All of these are facts, and they support a new way of making market estimations: paying attention to what people (investors) hear about the markets.

Using market-related data from social media and news feeds is not a recent idea. It has been applied for some years and the improvement in market estimations can be substantial. In this post, we will explore some techniques that allow us to analyze text data in order to predict market movements. They are part of a very exciting field of Machine Learning: Natural Language Processing. In particular, we will focus on dense text representation, that is: encoding text into relatively small vectors that retain all the useful information. Then, we can use such vectors to feed algorithms and teach them to find patterns hidden in the relationship between financial text and market movements.

First, we will describe some fundamental and well-tested techniques for text representation, later we will apply them to news articles talking about the American car company Tesla. We have chosen Tesla because they build, electric cars able to (kind of) drive themselves! But the choice was also driven (no pun intended) by the fact that Tesla has been on news headlines quite a lot, so we will have more data to play with.

## Bag of words

This traditional NLP method takes advantage of word frequencies inside a document and across all documents in a corpus in order to find the most important words for each one. It does so by computing the Term Frequency – Inverse Document Frequency (TF-IDF) value for each word. There are several variations on the TF-IDF implementation, but the simplest formula to get the index of a word i for a document j is as follows:

$$w_{i j} = tf_{i j} * log{\frac{N}{df_i}}$$

Where $$w_{i j}$$ is the TF-IDF of word i in document j; $$tf_{i j}$$ is the number of occurrences of word i in document j; N is the number of documents and $$df_i$$ is the number of documents containing word i at least once. As we can see. If a word appears many times in a document, its importance increases, but if it also appears in many other documents, its importance decreases. This favors words that appear many times but only in a few documents, they are supposed to give more information. As an example, we will show 4 documents, each one consisting of a single sentence. We will also compute the frequency matrix, which contains the term frequencies of each word in each document. Finally, we will show the TF-IDF of each word in each document. Note that a word will have a TF-IDF value for each document it appears on.

Figure 1: A bunch of docs.

Figure 2a (left): Frequency matrix (left) and Figure 2b (right) TF-IDF matrix.

Document #4 shows how TF-IDF differs from a simple word count. In Figure 2a we see that the word “a” appears twice in the document, that is the highest frequency in the document, one may think the most repeated word is the most important one, but the TF-IDF for that word (figure 2b) is lower than that of the word “window” for example. This happens because the word “a” appears in every single document, while the word “window” only appears in document #4, therefore “window” is more relevant than “a”, which is, indeed, true.

However, the bag of words algorithm present some deficiencies. For instance, it doesn’t take into account word order or the context. There is often a lot of meaning embedded in the context and a bag of words couldn’t care less about it. As an example, consider “I have to eat food” vs “I have food to eat”. To a bag of words, both sentences are one and the same. There is also a technical problem regarding data sparsity. Bag of words regards each word as a single dimension, therefore, the dimensionality of the problem is as big as the vocabulary size. This number is usually quite massive. With that many dimensions, the curse of dimensionality becomes a serious issue. In particular, when we aggregate the TF-IDF values for a document, we form a vector that we can use to compare different documents. Those vectors lay in a high dimensional space and they will be quite far apart, we are dealing with a sparse vector space.

While bags of words are very popular, they are not the only NLP technique. There are a wide array of methods and some of them address the issues that bag of words have. In this article, we will talk about word and sentence embeddings, a kind of techniques that allow us to encode text information in small and dense vector spaces. These vectors can also retain information about the context and word order.

## Word embeddings

In his seminal paper, Mikolv et al (Efficient Estimation of Word Representations in Vector Space, 2013) proposed a technique to map words to dense vector representations, reducing the sparsity issue that plagues most NLP tasks. Those words embeddings are also able to encode semantic and context information, so, with different grammar but similar meaning, they end up closer together in the vector space. It is even possible to perform sensible vector arithmetic. A typical example for word vector arithmetic goes as follows: subtract the vector representing “man” from the one representing “king” and add the vector representing “woman”, you will get the female version of a king, also known as “queen”.

Mikolov presented two different techniques to achieve effective word embeddings:

### Skipgram model:

This technique tries to maximize the following equation:

$$\frac{1}{T} \sum_{t=1}^T\sum_{-c \leq j \leq c, j \neq 0} log(w_{t+j}|w_t)$$

That is: the average log probability of the context words given the central word. As a graphical representation:

Figure 3: Skipgram architecture (left) and examples of using context words around central words (right).

As we can see in the figure, starting from the word “sat”, the model tries to predict the most probable context words, which in this example will be “the cat on the mat”. Now, if our training set has many words containing the same context, such as “the cat laid on the mat”, the model will learn similarity between these central words as well as the relationship between each central word and the context words. Even more, the task of predicting context words takes order on account, so time dependencies will be encoded in the embeddings as well.

The model has a single hidden unit with linear neurons, which are the word embedding itself and contain enough information to predict the context words for the central word they belong to.

### Continuous bag of words (CBOW)

Figure 4: CBOW architecture.

This technique is the inverse Skipgram. Indeed, this time the inputs are the context words and the target is the central word. This means the hidden layer will contain enough information to predict the word for which it forms an embedding, starting only from its context words.

In practice, embeddings from both methods perform similarly. They outperform bag of words methods for text representation tasks as it is shown in the paper cited.

Note that both approaches receive words with a 1-of-n encoding (or One-Hot encoding, each word is a vector of zeros as long as the vocabulary size, and it has a value of 1 only on the entry corresponding to that word) and the output uses 1-of-n encoding as well. Those are sparse representations, the useful part is the hidden layer which is a dense vector containing the word embedding.

## Doc2vec

The last technique we will review is an extension of word embeddings. In fact, the main author involved is, once again, the great Mikolov (Distributed Representations of Sentences and Documents, 2014). Doc2vec takes word embeddings one step further and creates document embeddings, vectors that contain encoded information about a whole document. Of course, this technique can be used to encode just sentences a very popular and effective setting.

Doc2vec has two different approaches, similarly to word embeddings:

### Distributed Memory Model of Paragraph Vectors (PV-DM)

Figure 4: PV-DM architecture

This approach is similar to CBOW in the sense that we are using context words as inputs and a central word as a target. The addition results on a new vector input: the document embedding (or paragraph embedding). While the context words inputs are sparse (one-hot), the paragraph vector is dense and it is shared across all data points from the same document. In this way, for each document, the algorithm learns an embedding for each central word and it learns the paragraph vector at the same time.

### Distributed Bag of Words version of Paragraph Vector (PV-DBOW)

Figure 5: PV-DBOW architecture.

For this approach, the input is just the document id. From the id, the algorithm must retrieve a sentence from it. This approach works better when the document consists of a single sentence since there won’t be ambiguity on what words the model must output.

In practice, PV-DM works better, although a combination of both methods works even better, and such a model is what the paper recommends.

## Experiments on financial news

In the following section we are applying both word and document embeddings to Tesla financial news, we won’t consider bags of words this time around.

We have collected news including the keyword “tesla” from a single source, Reuters. The earliest article collected was published on the 11-11-2011, the total amount of articles is 3770. After collecting all the data, we have discarded the parts of each article not mentioning the keyword. This decision was made because the amount of information available is not enough to train a general-purpose Language Model, therefore we must train a specialized model that focuses only on information about the company. The total number of sentences is 13631, or if you spell it backward: 13631, you gotta love palindromic numbers.

### Data Analysis

Exploring the data we discover it has some flaws. For example, one of the lowest returns (-6.17 normalized return) happened on the 13-01-2012 and there wasn’t any news about Tesla the previous days or even during that day or the following day. As a possible explanation is that in 2012 Tesla was not as big of a company as it is today, and news feeds didn’t pay too much attention. This data flaw could be mitigated by aggregating data from several sources.

We have decided to compare news from a given day with stock returns from the following day (stock returns from TESLA Inc). This approach is not always valid, we will illustrate this with examples:

On the 5-11-2013 there was a very low return (-4.64 normalized return) and the day before, Tesla appeared on the news with sentences such as ‘Tesla sales and profit forecasts disappoint’, text like that has good predictive power. Another example, this time for a positive return took place on the 8-5-2013 when Tesla had a big upward swing (7.68 normalized return) and the day before, the news was like “Tesla recent fortunes and growth prospects have surprised analysts suppliers and even Tesla executives”. On the other hand, some times, news that could predict a big swing is published the same day as the mentioned swing. As an example, on the 16-7-2013: “… shares of us electric carmaker Tesla Motors Inc tumbled overnight following a target price downgraded by Goldman Sachs Group…”. It seems like Reuters didn’t know this was going to happen as they only published news after the market swing had occurred. Once again, including several data sources could help with this issue.

Having imperfect data is not a rare phenomenon in Machine Learning. We will do our best to fit our models.

### Applying embedding models

First of all, we will train a Skipgram model on our corpus. The resulting vectors will have a length of 64.

Now we show words with similar vectors according to the model. The similarity will be measured using the Cosine Distance metric.

• ‘growth’: ‘grow’, ‘shortages’, ‘grown’, ‘sustainability’, ‘shortage’
• ‘surprised’: ‘surprises’, ‘convinced’, ‘surprise’, ‘wed’, ‘surprising’
• ‘removed’: ‘removing’, ‘removes’, ‘moved’, ‘loved’, ‘misrepresented’
• ‘crash’: accident’, ‘crashed’, ‘crashing’, ‘fatal’

The model does a good job at finding words with a similar lexicon, but it has a hard time finding words with similar meaning but a different lexicon. However, it manages to do it for the word “crash”. We can also find some signs of overfitting. For example, it seems like “growth shortages” are a common theme in the corpus. Consequently, the word vector for “shortages” is very similar to the word vector for “grow”. As an additional remark: weddings are a surprising event to the model!

We can do it better than just looking at single vectors. We can perform vector addition and then, look for vectors closer to the resulting sum. Combined vectors tend to describe much better a news article. We will now combine words that make up highlights for some articles:

• ‘growth’ + ‘prospects’ + ‘surprised’: ‘sustained’, ‘surprises’, ‘bemoaned’, ‘concerned’, ‘delighted’
• ‘comission’ + ‘removed’ + ‘chief’: ‘removing’, ‘directed’, ‘collected’, ‘thundered’

Once again, we see overfitting signs. Indeed, our dataset is not that big and the combinations we tried are likely to appear in just one article, not enough frequency by any means for the model to learn the patterns without overfitting.

We also tried to fit a sentence embedding model, that is, applying the document embedding technique we described earlier and considering each sentence to be a document. Vectors will have a length of 128, twice the length we used with word embeddings. Data variance is higher now that we combine words together to form sentences, the model must learn more dependencies.

Let’s look for similar sentences to the one we target, that’s coming right up:

• ‘tesla automotive gross margins dropped in the quarter to <mediumNum> % from <mediumNum> %’
• ‘tesla recommend to a friend rating fell to <mediumNum> % in the first quarter from a high of <mediumNum> % two years prior the glassdoor data showed’
• ‘tesla leapt to fifth place from the <mediumNum> th spot during the first quarter’
• ‘tesla incinerated more than <bigNum> million of greenbacks in the quarter to june ‘
• ‘the securities and exchange commission is pushing to have elon musk removed as both chief executive and board member at the <mediumNum> billion electric’
• chief executive elon musk contemplated <mediumNum> billion take
• chief executive elon musk told employees on tuesday that the <mediumNum> billion electric
• shares jumped almost <mediumNum> percent after chief executive elon musk said on twitter he is considering taking the electric car maker private at <bigNum> per share
• ‘tesla sales and profit forecasts disappoint’:
• ‘electric car maker tesla motors inc forecast profit and sales below wall street estimates and reported third’
• ‘tesla reports q <smallNum> sales and production missed targets’
• ‘tesla posts wider loss highlights energy storage demand’

The model seems to find somehow related sentences but the similarity comes from very localized parts of the sentences. For instance, in the first example, all sentences involved information about quarter statistics with numbers quantifying the changes, but there are sentences including upward as well as downward movements, so the model didn’t get that decisive difference. In the second example, all sentences talk about “chief Musk” but apart from that, the sentences are very different from the target. The third example includes sentences about Tesla sales forecasts, this time around, sentences retrieved are indeed quite similar.

All these examples are suggesting we have a model that achieves decent text representations but misses some key differences relevant to the financial markets. It also overfits to some degree.

### Using embeddings for return prediction

Just to get a general metric of the model, we will try to predict the next day return movements only from the current day news. We will perform the task using all our dataset, training on 80% of the data and testing on the remaining 20%. It must be said that using exclusively news data from the previous day and predicting returns out of that is quite the task, not easy by any means. For the experiment we will apply both embedding techniques:

• Word embeddings: An LSTM will receive all word vectors from all articles published on the previous day and it will output a prediction for the current day return.
• Sentence embeddings: An LSTM will receive all sentence vectors from all articles published on the previous day and it will output a prediction for the current day return.

The LSTM will have 16 hidden units and dropout of 0.5 to avoid overfitting. Connected to the LSTM, there will be a Dense layer with 8 units and a dropout of 0.4. Other models were tried but they resulted in overfitting or underfitting.

The following figures plot the predictions for the training and validation datasets using either word2vec or doc2vec. As we can see, the model does a good job with the training set but fails with the validation set. When we applied stronger overfitting (dropout > 0.5, l1_l2 regularization) the results on the training set got worse while the validation results didn’t improve. This was expected to some point due to the deficiencies shown by the word representation models. If the vector representations are not encoding the information properly, then we can not use them to predict returns.

Figures 6a, 6b, 6c and 6d (top to bottom): Predicted returns vs real returns for training and validations sets and using both methods: Doc2Vec and Word2Vec

For completion, we show another metric, the signed accuracy, which is nothing more than the success ratio of trying to guess if the returns will be positive or negative. This is an easier task, unfortunately the models struggle with it as well.

Figure 7: Sign accuracy for Word2vec and Doc2vec. Train and test scores displayed.

### Final notes

To wrap up, the author kindly sums up some recommendations for brave data scientists willing to give NLP a try:

• Learning text representations first, then using them for market prediction is perhaps too much too ask due to the high variance found in texts. Your model may learn some representation that just so happen to be not so useful from the financial perspective. The solution is to train the embeddings on labeled data. Instead of just giving raw text to the algorithm, you can label each sentence with financial information, for example the returns themselves. Just make sure you are splitting your train and test set beforehand.
• Use many different data sources to avoid bias or general weakness coming from a single source.
• Be willing to predict in shorter periods than a day. News information often comes hours before the market swing happens. Investors are usually fast at adapting to recent events and you are competing against them.

If you find the post useful or entertaining kindly let me know so I can celebrate. If otherwise, you have some complaints, I will be happy to receive your feedback and hopefully get a valuable lesson out of it. Finally, if you have performed similar experiments with better results, tell me about it so I can be properly jealous of you.

Hope you have a good day.