In recent years we’ve seen an increase in the accuracy of NLP models through the use of Transformers. These models rely on the attention mechanism to identify key features but, how do they work? And most importantly, can we somehow use them in finance?

## Transformers

The transformer is a relatively new network architecture that is** solely based on attention mechanisms**, dispensing with recurrence and convolutions entirely. It consists of an encoder-decoder architecture: the encoder maps an input sequence of symbol representations \( (x_{1}, …, x_{n}) \) to a sequence \( z=(z_{1}, …, z_{n}) \) while the decoder, given \( z \) , generates an output sequence \( (y_1, …, y_m) \). At each time step the model consumes the previously generated symbols as additional input when generating the next (it is auto-regressive) [2].

In the next sections we will explore the different parts of the transformer. In the next posts will learn how to use the attention mechanism for time series forecasting.

## Embeddings and positional encoding

The first thing the transformer does is transforming the input text into numbers. To do that, we generate a basic vocabulary by taking all different words that are contained in the training data. Each word will be represented by the index at which the word is stored in the vocabulary object. For example, for the quote “all models are wrong”, we would have a dictionary

$$

\begin{bmatrix}

\text{all} \

\text{are} \

\text{models} \

\text{wrong}

\end{bmatrix}

$$

and the final representation

$$

\begin{bmatrix}

\text{all} \

\text{models} \

\text{are} \

\text{wrong}

\end{bmatrix} =

\begin{bmatrix}

0 \

2 \

1 \

3

\end{bmatrix}

$$

Next, we attach to each word a vector embedding that will allow to better represent the relations between words. Similar words will be mapped with similar vectors.

We still need one more step before feeding our data into the model: **order does indeed matter**. For recurrent networks, the word positions are implicitly embedded inside the model since they are processed sequentially but remember the transformer ditched the recurrence in favour of an attention mechanism.

The solution of the authors was to **add a vector representing the position of the words** to the embedding vectors.

Geometrically, the word vectors will be pushed by the position vectors towards spaces that cluster words with similar positions [4], thus the model will implicitly learn that words in these clusters have a specific order.

What does that vector contain? **If we push the words too far away the semantic information will be irrelevant with respect to the positioning information.** Using Fourier analysis, the authors proposed an **index-dependant function** that encodes the position of each word as a sinusoidal wave while keeping low values.

For even positions, the function takes the value

$$

PE_{(pos, 2i)} = sin

\left(

\frac{pos}

{10000^{

\dfrac{2i}{d_{model}}

}

}

\right)

$$

while for odd positions takes the value

$$

PE_{(pos, 2i+1)} = cos

\left(

\frac{pos}

{10000^{

\dfrac{2i}{d_{model}}

}

}

\right)

$$

where \(pos\) is the position of the word, \(i\) is the \(i\)-th dimension of the word embedding and \(d_{model}\) is the number of dimensions in the embeddings.

Having a periodic function ensures big values will be mapped to a given interval at the same time we avoid saturation like it would happen using sigmoid or hyperbolic tangent functions. **Each position is completely identified by the frequency and offset of the wave** [3].

One thing to notice are the 10.000 values in the denominator. These values are **manually calibrated** to tune the waves for a specific task [4].

## Attention, self-attention and multi-head attention

Let’s try to demystify the attention mechanism. The original paper [2] states that the attention function can be expressed as

$$

\text{Attention}(Q,K,V) = softmax \left(\frac{QK^T}{\sqrt{d_{k}}} \right) V

$$

where \( Q\) is the query, \( K\) is the key, \( V\) is the value and \( d_{k}\) are the number of dimensions of \( K\). The division by the square root of \( d_{k} \) is introduced to scalate the values and reduce possible problems with gradients.

\( Q\) , \(K \) and \( V\) are connected to the matrix multiplication operations though a linear layer with learnable parameters and are **identical matrices**, built using embeddings and a positional encoding (result of applying the operation depicted in Figure 3). Why \( Q\) , \(K \) and \( V\) are identical?

The \( V\) is weighted with a linear combination of \( Q\) , \(K \), which means the **attention filter is selecting the relevant information in the same embeddings and applying that filter over the values**. Remember this module has learnable parameters that allow the model to optimize the filter during the backpropagation step.

**The transformer has more than one self-attention module**. Using these “heads” the model will hopefully be able to capture different features. All attention scores are then concatenated in a single matrix, whose dimensions are reduced through the use of a linear mapping.

## Residual Connections

Notice how the image depicted in figure 1 shows that** the transformer architecture provides the different layers with residual connections**. This is done to avoid problems with vanishing and exploding gradients: the gradients can flow backwards through these paths and avoid getting lost (or getting to infinity) due to a high number of product operations during the backpropagation process [6].

## Masked Multi-Head Attention

The last piece of the architecture we need to know is the masked multi-head attention module, which **ensures the model does not have access to future information** in order to make the predictions.

We know the transformer is an encoder-decoder architecture; to train the model, we provide the encoder with the first text sequence and **we expect the decoder to generate the correct next sequence, one word at a time**. We then show the decoder which word was the right one in order for the model to learn from its mistakes.

The **masking allows us to hide the future words by adding \( – \infty \) values to the attention filter** before feeding the matrix to the softmax layer. The negative infinity elements will have a value of 0 after applying the softmax function, removing any illegal connections [2, 5].

The **masking will change at every time **step, hiding less and less words. With this system the model is able to preserve the auto-regressive property [2].

## Putting things together

The encoder takes the text and converts it to a vectorized representation. The decoder takes this vectorized representation and the output generated so far (outputs); then makes a new prediction, that will be feed again [5].

The vectorized representation that is passed to the decoder are just **copies of the final output of the encoder**. We call them (again) key and value. The output embedding that is transformed through a masked multi-head attention and then passed to the multi-head attention is called, to no one’s surprise, query.

The **process is repeated for each word** until we reached an end token and… volilá! We have our transformer.

## Conclusion

In this post we have reviewed how the basic transformers work step by step, covering the positional embeddings and the attention modules. In the next post we will see how we can use the attention mechanism to develop models for time series forecasting.

## References

[1] Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua – Neural Machine Translation by Jointly Learning to Align and Translate

[2] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia – Attention is All You Need

[3] Rush, Alexander and Nguyen, Vincent and Klein, Guillaume – The Annotated Transformer, Harvard NLP

[4] Parcalabescu, Letitia – Positional embeddings in transformers EXPLAINED | Demystifying positional encodings

[5] Hedu – Math of Intelligence – Visual Guide to Transformer Neural Networks – (Episode 3) Decoder’s Masked Attention

[6] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian – Deep Residual Learning for Image Recognition