Using PyTorch to test the attention mechanism applied to time series forecasting.
Introduction
In the previous post we saw what Transformers are and how they work in its basic form. In this post we will develop one possible way to adapt the original design, which was created [1] to target NLP tasks, for time series applications.
We will use PyTorch to implement some models and test the results over different datasets.
Adapting the Embedding
Since the original design of the transformer targets NLP, we need to use another embedding system to encode our time series. Here is where Time2Vec [3] appears in the game.
Time2Vec [3] provides a model-agnostic vector representation for time, and it is defined as follows:
$$
t2v(\tau)[i] \begin{cases}
\omega_{i} \tau +\phi_{i}, & \mbox{if } i=0\\
\mathcal{F} (\omega_{i} \tau + \phi_{i}), & \mbox{if } 1 \leq i \leq k.\\
\end{cases}
$$
where \( t2v(\tau)[i] \) is the \( i^{th} \) element of \( t2v( \tau) \), \( \mathcal{F} \) is a periodic activation function, and \( \omega_{i} \) and \( \phi_{i} \) are learnable parameters.
I used the implementation provided by Chicheng Zhang, which can be found in his repo.
Experiment Setup
Adapting the embedding will be enough for now, since we want to use the attention mechanism exposed in [1] before moving on to more complex frameworks.
We will measure the goodness of the attention and embedding systems by comparison with other models. 4 models will be used:
- Vanilla LSTM: a LSTM network without any embedding or attention mechanism included.
- Attention LSTM: a LSTM network coupled with the attention mechanism of [1].
- Embedding LSTM: a LSTM network coupled with Time2Vec [3] embedding system.
- Attention Embedding LSTM: a LSTM boosted by the attention [1] and embedding mechanisms [3].
By modularizing the models we will be able to measure the contribution of each part to the overall performance.
The use of LSTMs arises naturally because it is an architecture designed to deal with sequences.
The datasets where the models will be tested on are:
- Sine Waves: simple and easy to predict sine waves.
- White Noise: stationary white noise with mean 0 and standard deviation of 0.01.
- Venice High Waters: dataset containing the levels of water at different time stamps of the city of Venice.
- XAUUSD stock returns: (presumably) non-stationary relative returns of XAUUSD commodity.
The code for the experiments can found in this GitHub repository.
A note on Hyperparameters and feature engineering
Since the main objective is to compare the models under similar contexts, I have not perform any feature engineering nor hyperparameter optimization.
The only hyperparameter tuning made for this experiments was made to ensure the models have an approximately equal number of parameters to make the comparison as fair as possible.
Results
Loss results
The losses don’t tell us much: there is not a clear winner in terms of speed of convergence and overfitting.

Predictions
Sine waves
All models are perfectly able to fit the sine waves, though the models with an attention mechanism make a weird spike at the end of almost each feature.

Random noise
We can tell at glance the models are failing to capture the variance of the data. The poor performance in this dataset may be caused by the length of the lookback window, which was set to 20: the models probably need more data than 20 periods to make a successful prediction.
Although, being a random noise, we don’t fully expect the model to be able to capture the process.

Venice High Waters
The best model seems to be the Embedding + LSTM, followed by the Attention + Embedding + LSTM. Nevertheless, the differences in performance among the models are small and all of them seem to be capturing the underlying process.

Stock Returns
Similar to what we have observed in the random noise dataset, the models are not able to capture the variance of the data. Apart from the window length and the fact that returns can be considered random noise, another reason could be that stock returns are non-stationary (at least we assume it that way despite the fact that we cannot prove it statistically [4]). All have similar performance metrics.

Conclusions
In this post we have tested the performance of the attention mechanism as described in [1] using different models with a similar number of parameters. The assessment has been carried over 4 datasets, and the results show little improvement in the goodness of predictions.
One of the reasons this attention is no behaving as expected may be, as stated in [2], that “The typical attention mechanism reviews the information at each previous time step and selects relevant information to help generate the outputs; however, it fails to capture temporal patterns across multiple time steps.“
Of course, the models applied to random noise datasets are less likely to perform well since the process is, well, random (:p).
In future posts we will review other types of attention mechanisms specifically tailored for time series forecasting.
References
- Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia – Attention is All you Need.
- Shun-Yao Shih, Fan-Keng Sun, Hung-yi Lee – Temporal Pattern Attention for Multivariate Time Series Forecasting.
- Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, Marcus Brubaker – Time2Vec: Learning a Vector Representation of Time.
- Yves-Laurent Kom Samo – Stationarity and Memory in Financial Markets.