Markets are said to be driven by randomness, but this does not imply that they are 100% random and thus, completely unpredictable. In the end, there are always people behind investments and many of them are making decisions based on what they read in newspapers. We will be trying to estimate the returns of a time series, namely Bitcoin, only using text data from relevant articles. BERT, an NLP deep learning network, will be used to do sentiment analysis on the text.
I’ve chosen Bitcoin for this experiment since its value has enormous volatility and it is very prone to change by sudden hypes and fears, usually reflected in newspapers. Although Bitcoin is a cryptocurrency and not a stock, strictly speaking, it can be bought and sold in the same fashion. This makes it perfectly suitable for our needs.
Text regression: What are we up to?
The idea in this post is to use NLP to do text regression. This technique consists of encoding input text as numerical vectors and then use them to make a regression analysis and estimate an output value. In our case, input data will be text from articles related to Bitcoin encoded and transformed with BERT, and the target value will be the returns of Bitcoin’s close values at the publication date of such articles.

Background check: What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a neural net tasked to solve any kind of NLP problem. Developed by researchers at Google, it soon became the state of art, breaking records in many different NLP benchmarks (paper).
Language modeling networks are usually trained by randomly masking words in each sentence and trying to predict them given the previous or following ones. For example, we mask the word “mat” in the sentence:
“The cat sat on the ____.”
The network will try to guess the word “mat” knowing the previous ones. Given millions of ordered sentences, these networks learn to predict accurately the empty spaces looking at the text either before or after the masked word, but not both. If you train the network with information from both sides the model would overfit since you would be implicitly telling the answer to it when training with other sentences. This is where BERTs core novelty comes into play since it is designed to learn from the text before and after the masked words (that is what Bidirectional means). BERT overcomes this difficulty by using two techniques Masked LM (MLM) and Next Sentence Prediction (NSP), out of the scope of this post. This lets BERT have a much deeper sense of language context than previous solutions.

Creating the dataset
To retrieve articles related to Bitcoin I used some awesome python packages which came very handy, like google search and news-please. The former emulates a search in google and retrieves a set of URLs. The latter extracts a lot of information from an article (publication date, authors, main title, main text, etc….) given an URL.
Combining these two, I retrieved the top 5 articles written in English that came up in google news by introducing the search term: “Bitcoin | Cryptocurrency” for every day between 2019-01-01 and 2020-03-19.
Being able to simulate a google search guarantees that the top articles that appear on the list are the most relevant ones, and one could guess that the ones that had more impact. The bar between the two keywords in the search term acts as the OR function. In total, I collected a dataset of 2210 articles. Here you have a small sample:

Sentiment analysis with BERT
Here comes the interesting part, it’s time to extract the sentiment of all the text we’ve just gathered. BERT is a heavyweight when it comes to computational resources so, after some tests, I decided to work only with the text in the title and description of each article. I split all these pieces of text into sentences. In the end, my dataset consisted of a bunch of sentences grouped by the day they were published. I used a version of BERT available as a Huggingface transformer which is pre-trained to do sentiment analysis on product reviews. Given a product review, it predicts its “sentiment” as a number of stars (between 1 and 5). Even though product review text and newspaper text are fairly different, we will see that this model works surprisingly well on our data. Here you have an example:
sentence = "Bitcoin futures are trading below the cryptocurrency's spot price"
sentence_ids = tokenizer.encode(sentence)
bert_model.predict([sentence_ids])
# 1 star 2 stars 3 stars 4 stars 5 stars
[[ 0.62086743 0.7408671 0.599566 -0.50914824 -1.2169912 ]]
The transformer comes in two parts: the main model, in charge of making the sentiment predictions, and the tokenizer, used to transform the sentence into ids which the model can understand. The tokenizer does this by looking up each word in a dictionary and replacing it by its id. Before making predictions for all our sentences in our dataset we need to make sure that the model understands the most important words related to bitcoin. After word-counting my sentences based on a set of keywords, I added these terms to the tokenizer [‘bitcoin’, ‘cryptocurrency’, ‘crypto’, ‘cryptocurrencies’, ‘blockchain’], which got assigned to specific ids. Adding these terms lets the network distinguish them as individual words, otherwise, all of them would have been replaced by the id of “unknown words”.
5-star predictions to stock returns
Afterward, BERT did 5-star predictions for all the sentences, just as if they were reviews of products available in Amazon. I computed the averages of each of the stars for the sentences which belonged to each day and I trained a simple LSTM network on the resulting data.
def lstm_model(n_steps):
n_stars = 5
model = Sequential()
model.add(LSTM(100, input_shape = (n_steps, n_stars), return_sequences = True))
model.add(TimeDistributed(Dense(20, activation='elu')))
model.add(Flatten())
model.add(Dense(1, activation='elu'))
model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())
return model
I trained the model with last year’s data (articles between
1/1/2019 and 1/11/2019) and tested it on 2020 data (up to 19/3). After playing with the network’s hyperparameters for a while, it yielded the following result.

What did just happen?
I know, it’s far from perfect. But if you look a little closer you will notice how there are trends that have been recognized by BERT and the LSTM duo. The predictions start in November and go down correctly with the real series. Christmas is a complete mess since our model keeps predicting a downward trend when Bitcoin is rising abruptly. Then we can see how the model reacts and reaches a maximum, which beautifully, coincides with the one in the real series. About the predictions at the beginning of March, no comments. It is true that such a stark change is difficult to foresee (even by newspapers?).
In general, the model can predict small peaks and valleys more or less accurately. This is a very good sign if we take into account that the model is not using anything but text as input. Take a look at the figure with the raw returns below. The quality of our data is not extremely good. In the end, we are using the top 5 articles which are more “relevant” according to google (whatever that means). If we collected the text more thoroughly we would probably obtain better results.

Raw vs predicted returns for the test period
Tesla: a second opinion
I was curious about how this approach would do in a different case. Tesla is a stock which behaves pretty crazily. Elon Musk writes an enigmatic tweet and TSLA stock shakes. This is perfect for our experiment. I followed the exact same process. First, I retrieved the top 5 articles for each day. Then I passed the text through BERT and applied an LSTM network in the end. Here are the results:

We can see how the model predictions follow the starting upward trend identifying some valleys in January and February. Thus, it completely ignores the price drop in March (but who saw that coming, right?). Here you have the raw returns:

Key takeaways
The LSTM used sequences of 10 timesteps (that is, using data from the past 10 days to predict tomorrow’s returns). When using a larger or lower number of timesteps the predictions became unstable. To gain stability, we could use the price difference across days as target value instead of returns. Another option is to design a strategy that predicts a fixed growth depending on the positivity or negativity of the sentiment. To conclude this post, I want to highlight the following points:
- It seems that news articles influence market movements at times.
- BERT is an extremely powerful network capable of solving many NLP tasks, among other sentiment analysis.
- NLP models based on news data could be useful to complement investment portfolio strategies.
To check the code used for this post, take a look at my Github repository
related posts
Hi Juan, thanks for the great post. I’ve also been developping market forecasting models based on sentiment. In my case, for the FX market at least, using larger time frames seemed to perform better. For instance, averaging sentiment from 30 days in the past to forecast return (and not actual prices) 2 weeks into the future.