News have a huge impact in the global stock market, but it’s impossible even for professionals to be constantly updated. One can ask itself: is there any way to automate this procedure?
Recently in this blog we have covered the topic of summarizing news by applying the TF-IDF algorithm, which is a powerful algorithm to extract key insights from a given text in an automatic way.
In this post, we will try to estimate the NASDAQ 100 index price by training a Deep Learning model using as input the TF-IDF values from a set of news.
Data preparation
The first step is to get and extract the TF-IDF values from a given set of texts. In this case we will use the following dataset of news, which contains 215447 entries.
For our purpose, we will only use the article body and publish date in order to know when it was originally published.

In the previous figure we see that from 2010 to 2011 there isn’t enough data, therefore we will carry on our study from 2012 onwards.
After generating and merging all the TF-IDF vectors, the resulting input for our model looks like this:
gold | release | debt | … | level | |
2012-01-01 | NaN | 0.132733 | 0.123489 | … | 0.055327 |
2012-01-02 | 0.067420 | NaN | 0.062239 | … | NaN |
… | … | … | … | … | …. |
2020-01-01 | NaN | NaN | 0.030399 | … | 0.025132 |
In the previous matrix there are missing values because every term doesn’t appear in every document which leads to empty cells.
Model training
Once the features are in the correct form, we train our model to predict the expected return for the next day by using as input the last TF-IDF values. The dataset is divided into training (2012-01-01 / 2018-11-30) and test (2018-12-01 / 2020-02-01).
The model consists on 3 LSTM layers with 16 units each, followed by 4 Dense layers with 32 and 16 units. Since the prediction is not bounded, the activations are linear.
The model is trained over 300 epochs but around epoch 45, the training loss reaches a “Plateu” indicating that the model is not learning anymore.

This behaviour is the result of selecting a very simple model or using data that doesn’t present a relationship with the target variable.
Model validation
As we can see in the following figure, our model is not able to correctly anticipate the NASDAQ returns. In fact, the curve is almost flat, meaning that the model barely learns something from the data.

If we generate the expected price from the predicted returns (multiplying yesterday’s price by the expected return), we see that the curve doesn’t tell us much. In fact, the predicted series lags 1 day behind the original.

Conclusion
Despite TF-IDF is a powerful algorithm, the results show that it is not ideal to be used as indicator to predict stock prices.
This degradation on predictability might come from the difficulty of TF-IDF to capture word relationships or from the fact that our dataset contains news that do not anticipate the future but reflect the current market situation.
To tackle this issue, we could use more sophisticated tools such as Word2Vec or Transformers to extract features from the news which might increase our net performance.
Would you dare to try them?