We have seen in previous posts what is machine learning and even how to create our own framework. Combining machine learning and finance always leads to interesting results. Nevertheless, in supervised learning, it is crucial **to find a set of appropriate labels to train your model**. In today’s post, we are going to see 3 ways to transform our data into a classification problem and 1 to transform it into a regression one.

## What is ‘labeling’?

Labeling is the process of designing a **supervisory signal** for a set of data so that a model can infer properties from it. In other words, **a label is an outcome we want our model to learn**. We say that *labeled *data are *annotated *data.

Like features, the way we label our data contains information about the problem itself. That is why is so important to do it right.

## Binary Labeling

Let’s start with the simplest one. **The easiest way to label returns is to assign a label depending on the returns sign**: we label positive returns as class 1 and negative returns as class 0. We can call this method *binary labeling*.

def binary_labelling(data, name='Close'): """Binary labelling. Label the data according to its sign. If it is positive, if will be labeled as 1, if it is negative, it will be labeled as 0. Returns equal to zero, if any, will be left as nan. Parameters ---------- data : pandas.DataFrame or pandas.Series The data from which the labels are to be calculated. The data should be returns and not prices. name : str, optional, default: 'Close' Column to extract the labels from. Returns ------- labs : pandas.DataFrame A pandas dataframe containing the returns and the labels for each return. """ # labs to store labels labs = pd.DataFrame(index=data.index, columns=[name, 'Label']) # get indices for each label idx_pos = data[data[name] > 0].index idx_neg = data[data[name] < 0].index # assign labels depending on indices labs[name] = data labs.loc[idx_pos, 'Label'] = 1 labs.loc[idx_neg, 'Label'] = 0 return labs

Result of applying this method to the XAUUSD relative returns time series.

The main drawback of this procedure is that **it does not capture the differences in magnitude from two returns of the same sign**; e.g. 0.01 has the same label as 1000. Therefore, it is not a very appropriate algorithm in most cases (but still useful to build intuition).

## Fixed-time horizon

The first thing we can do to take into account these differences is to **add a threshold from which the labels are computed**. In chapter 3 of [1], by Marcos López de Prado, a method called `Fixed-time horizon`

is presented as one of the main procedures to label data when it comes to processing financial time series for machine learning.

The method is simple and can be defined by the following expression:

$$

y_{i} =

\begin{cases}

-1, & \text{if $r_{t0,t1}

< – \tau $} \\ 0, & \text{if $| r_{t0,t1}|

\leq \tau $} \\ 1, & \text{if $r_{t0,t1} > \tau $}

\end{cases}

$$

def fixed_time_horizon(data, threshold, name='Close'): """Fixed-time horizon labelling. Compute the financial labels using the fixed-time horizon procedure. See references to understand how this method works. Parameters ---------- data : pandas.DataFrame or pandas.Series The data from which the labels are to be calculated. The data should be returns and not prices. name : str, optional, default: 'Close' Column to extract the labels from. threshold : int The predefined constant threshold to compute the labels. Returns ------- labs : pandas.DataFrame A pandas dataframe containing the returns and the labels for each return. References ---------- .. [1] Marcos López de Prado (2018). Advances in Financial Machine Learning Wiley & Sons, Inc. .. [2] Marcos López de Prado - Machine Learning for Asset Managers. """ # to store labels labs = pd.DataFrame(index=data.index, columns=[name, 'Label']) # get indices for each label idx_lower = data[data[name] < -threshold].index idx_middle = data[abs(data[name]) <= threshold].index idx_upper = data[data[name] > threshold].index # assign labels depending on indices labs[name] = data labs.loc[idx_lower, 'Label'] = -1 labs.loc[idx_middle, 'Label'] = 0 labs.loc[idx_upper, 'Label'] = 1 return labs

Results of applying the fixed-time horizon method to the XAUUSD relative returns.

This method improves the binary labeling procedure, but **it works assuming the market remains static** (no regime changes, no volatility clustering [3], etc) due to the fixed threshold value.

Can we do better while keeping a simple procedure? Yes, we can.

## Quantized labeling

Ideally, we would want our method to automatically adapt reasonably well to changes in the market. **Why don’t we use the varying properties of the returns distribution in our favour?** That is exactly how *quantized labeling* [2] works.

Quantized labeling consists in **bucketizing the returns into categories derived from the quantile values**. Computing the categories using a sliding/expanding window gives us the dynamic behaviour we seek.

def quantized_labelling( data, n_labels, name='Close', window=None, fillnan=None, mode=None ): """Quantized labelling. Label the data according to a quantile calculation. The quantiles can be computed in rolling or expanding modes, as well as for the whole dataset at once. Parameters ---------- data : pandas.DataFrame or pandas.Series The data from which the labels are to be calculated. The data should be returns and not prices. n_lables : int The number of labels you want to compute. name : str, optional, default: 'Close' Column to extract the labels from. window : int, optional, default: None The period size to compute the rolling/expanding quantiles. fillnan : object, optional, default: None If not None, the remaining rows, after bucketing, whose values are NaN will be filled with the passed value. mode : str, {'rolling', 'expanding', None} If None, the data will be bucketed using the whole dataset. If 'rolling' or 'expanding', the data will be bucketed using the selected mode, with a window equals 'window' parameter. Returns ------- labs : pandas.DataFrame A pandas dataframe containing the returns and the labels for each return. References ---------- .. [1] Udacity - AI for trading https://www.udacity.com/course/ai-for-trading--nd880 """ def get_qcuts(series, quantiles): """Helper function """ q = pd.qcut(series, q=quantiles, labels=False, duplicates='drop') return q[-1] name = 'Close' q_val = 1 / n_labels quantiles = np.arange(0, 1+q_val, q_val) labs = pd.DataFrame(index=data.index, columns=[name]) labs[name] = data if mode is None: qc = pd.qcut(data[name], q=quantiles, labels=False) # concat to avoid errors with indexes labs = pd.concat([data, qc], axis=1) labs.columns= [name, 'Label'] else: if window is None: raise ValueError(f"'window' with value {window} is not valid.") else: pd_obj = getattr(data, mode)(window) labs['Label'] = pd_obj.apply( lambda x: get_qcuts(x, quantiles), raw=True ) # fill nans if fillnan is not None: labs.fillna(fillnan, inplace=True) return labs

Note in the code above that the **procedure can be applied in rolling, expanding, or for the whole dataset at once.** Here is the result of applying quantized labeling to XAUUSD relative returns (we set *n_labels* to 7).

## Labeling for regression

The last algorithm we are going to see allows us to **transform our data into a regression problem**. Hence, the labels will be continuous.

The idea is simple: **we apply a rolling window on our returns and select ****n**** past returns and 1 future return as a label**.

def unfold_ts_for_regression( data, look_back=20, look_ahead=1, ): """Unfolds ts for regression. This functions receives as input a time series and returns two sets, X and y. Parameters ---------- data : pandas.DataFrame, pandas.Series or numpy.array The time series to process. look_back : int, optional, default: 20 The number of days to look back to predict the next day. look_ahead : int, optional, default: 0 If 'look_ahead' is 1, the label will be the next data of the batch. If it is greater, the labels will be 'look_ahead' data of the batch. Returns ------- X : numpy.array An array containing the features. y : numpy.array An array containing the labels. """ if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series): data = data.values elif isinstance(data, list): data = np.array(data) elif isinstance(data, np.ndarray): pass else: raise TypeError(f"Non-supported data type: {type(data)}") X = [] y = [] if look_ahead == 1: _range = range(0, len(data) - look_back) else: _range = range(0, len(data) - look_back - look_ahead) for idx in _range: batch_end = idx + look_back ahead_end = batch_end + look_ahead - 1 local_X = data[idx:batch_end] local_y = data[ahead_end] X.append(local_X) y.append(local_y) return np.array(X), np.array(y)

It seems complicated but it is not. Let’s see an example with a list of dummy values to understand the function.

x = [a for a in range(10)] X, y = unfold_ts_for_regression(data=x, look_back=2, look_ahead=1)

The above lines output the following arrays for X and y respectively:

# x = array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8]]) # y = array([2, 3, 4, 5, 6, 7, 8, 9])

See? **It is just a sliding window that looks ****n**** values in the past (look_back) and selects a value from the future** to forecast (look_ahead). Each iteration creates a new row in the features and labels matrix.

Let’s plot the results in an animated gif to see the sequence:

**Be careful using this function**, because you may incur a problem called *overlapping outcomes* (see chapter 4 of [1] for more information).

## Conclusions

In this post, we’ve briefly seen 4 simple ways to label your financial data. There are more complex procedures out there, like triple-barrier [1] that I encourage you to study and test.

## Bibliography

[1] Marcos López de Prado – Advances in Financial Machine Learning.

[2] Udacity – AI for trading.

[3] Rama Cont – Volatility Clustering in Financial Markets: Empirical Facts and Agent–Based Models.