# 4 simple ways to label financial data for Machine Learning

### Alejandro Pérez

#### 17/03/2021

We have seen in previous posts what is machine learning and even how to create our own framework. Combining machine learning and finance always leads to interesting results. Nevertheless, in supervised learning, it is crucial to find a set of appropriate labels to train your model. In today’s post, we are going to see 3 ways to transform our data into a classification problem and 1 to transform it into a regression one.

## What is ‘labeling’?

Labeling is the process of designing a supervisory signal for a set of data so that a model can infer properties from it. In other words, a label is an outcome we want our model to learn. We say that labeled data are annotated data.

Like features, the way we label our data contains information about the problem itself. That is why is so important to do it right.

## Binary Labeling

Let’s start with the simplest one. The easiest way to label returns is to assign a label depending on the returns sign: we label positive returns as class 1 and negative returns as class 0. We can call this method binary labeling.

def binary_labelling(data, name='Close'):
"""Binary labelling.

Label the data according to its sign. If it is positive, if will be
labeled as 1, if it is negative, it will be labeled as 0.

Returns equal to zero, if any, will be left as nan.

Parameters
----------
data : pandas.DataFrame or pandas.Series
The data from which the labels are to be calculated. The data should be
returns and not prices.
name : str, optional, default: 'Close'
Column to extract the labels from.

Returns
-------
labs : pandas.DataFrame
A pandas dataframe containing the returns and the labels for each
return.

"""
# labs to store labels
labs = pd.DataFrame(index=data.index, columns=[name, 'Label'])

# get indices for each label
idx_pos = data[data[name] > 0].index
idx_neg = data[data[name] < 0].index

# assign labels depending on indices
labs[name] = data
labs.loc[idx_pos, 'Label'] = 1
labs.loc[idx_neg, 'Label'] = 0

return labs


Result of applying this method to the XAUUSD relative returns time series.

The main drawback of this procedure is that it does not capture the differences in magnitude from two returns of the same sign; e.g. 0.01 has the same label as 1000. Therefore, it is not a very appropriate algorithm in most cases (but still useful to build intuition).

## Fixed-time horizon

The first thing we can do to take into account these differences is to add a threshold from which the labels are computed. In chapter 3 of [1], by Marcos López de Prado, a method called Fixed-time horizon is presented as one of the main procedures to label data when it comes to processing financial time series for machine learning.

The method is simple and can be defined by the following expression:

$$y_{i} = \begin{cases} -1, & \text{if r_{t0,t1} < – \tau } \\ 0, & \text{if | r_{t0,t1}| \leq \tau } \\ 1, & \text{if r_{t0,t1} > \tau } \end{cases}$$

def fixed_time_horizon(data, threshold, name='Close'):
"""Fixed-time horizon labelling.

Compute the financial labels using the fixed-time horizon procedure. See
references to understand how this method works.

Parameters
----------
data : pandas.DataFrame or pandas.Series
The data from which the labels are to be calculated. The data should be
returns and not prices.
name : str, optional, default: 'Close'
Column to extract the labels from.
threshold : int
The predefined constant threshold to compute the labels.

Returns
-------
labs : pandas.DataFrame
A pandas dataframe containing the returns and the labels for each
return.

References
----------
.. [1] Marcos López de Prado (2018). Advances in Financial Machine Learning
Wiley &amp; Sons, Inc.

.. [2] Marcos López de Prado - Machine Learning for Asset Managers.

"""
# to store labels
labs = pd.DataFrame(index=data.index, columns=[name, 'Label'])

# get indices for each label
idx_lower = data[data[name] < -threshold].index
idx_middle = data[abs(data[name]) <= threshold].index
idx_upper = data[data[name] > threshold].index

# assign labels depending on indices
labs[name] = data
labs.loc[idx_lower, 'Label'] = -1
labs.loc[idx_middle, 'Label'] = 0
labs.loc[idx_upper, 'Label'] = 1

return labs


Results of applying the fixed-time horizon method to the XAUUSD relative returns.

This method improves the binary labeling procedure, but it works assuming the market remains static (no regime changes, no volatility clustering [3], etc) due to the fixed threshold value.

Can we do better while keeping a simple procedure? Yes, we can.

## Quantized labeling

Ideally, we would want our method to automatically adapt reasonably well to changes in the market. Why don’t we use the varying properties of the returns distribution in our favour? That is exactly how quantized labeling [2] works.

Quantized labeling consists in bucketizing the returns into categories derived from the quantile values. Computing the categories using a sliding/expanding window gives us the dynamic behaviour we seek.

def quantized_labelling(
data,
n_labels,
name='Close',
window=None,
fillnan=None,
mode=None
):
"""Quantized labelling.

Label the data according to a quantile calculation. The quantiles can be
computed in rolling or expanding modes, as well as for the whole dataset
at once.

Parameters
----------
data : pandas.DataFrame or pandas.Series
The data from which the labels are to be calculated. The data should be
returns and not prices.
n_lables : int
The number of labels you want to compute.
name : str, optional, default: 'Close'
Column to extract the labels from.
window : int, optional, default: None
The period size to compute the rolling/expanding quantiles.
fillnan : object, optional, default: None
If not None, the remaining rows, after bucketing, whose values are NaN
will be filled with the passed value.
mode : str, {'rolling', 'expanding', None}
If None, the data will be bucketed using the whole dataset. If
'rolling' or 'expanding', the data will be bucketed using the selected
mode, with a window equals 'window' parameter.

Returns
-------
labs : pandas.DataFrame
A pandas dataframe containing the returns and the labels for each
return.

References
----------
.. [1] Udacity - AI for trading

"""
def get_qcuts(series, quantiles):
"""Helper function """
q = pd.qcut(series, q=quantiles, labels=False, duplicates='drop')
return q[-1]

name = 'Close'

q_val = 1 / n_labels
quantiles = np.arange(0, 1+q_val, q_val)

labs = pd.DataFrame(index=data.index, columns=[name])
labs[name] = data

if mode is None:
qc = pd.qcut(data[name], q=quantiles, labels=False)

# concat to avoid errors with indexes
labs = pd.concat([data, qc], axis=1)
labs.columns= [name, 'Label']

else:
if window is None:
raise ValueError(f"'window' with value {window} is not valid.")
else:
pd_obj = getattr(data, mode)(window)
labs['Label'] = pd_obj.apply(
lambda x: get_qcuts(x, quantiles),
raw=True
)

# fill nans
if fillnan is not None:
labs.fillna(fillnan, inplace=True)

return labs


Note in the code above that the procedure can be applied in rolling, expanding, or for the whole dataset at once. Here is the result of applying quantized labeling to XAUUSD relative returns (we set n_labels to 7).

## Labeling for regression

The last algorithm we are going to see allows us to transform our data into a regression problem. Hence, the labels will be continuous.

The idea is simple: we apply a rolling window on our returns and select n past returns and 1 future return as a label.

def unfold_ts_for_regression(
data,
look_back=20,
):
"""Unfolds ts for regression.

This functions receives as input a time series and returns two sets, X and
y.

Parameters
----------
data : pandas.DataFrame, pandas.Series or numpy.array
The time series to process.
look_back : int, optional, default: 20
The number of days to look back to predict the next day.
look_ahead : int, optional, default: 0
If 'look_ahead' is 1, the label will be the next data of the
batch. If it is greater, the labels will be 'look_ahead' data of the
batch.

Returns
-------
X : numpy.array
An array containing the features.
y : numpy.array
An array containing the labels.

"""
if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series):
data = data.values

elif isinstance(data, list):
data = np.array(data)

elif isinstance(data, np.ndarray):
pass

else:
raise TypeError(f"Non-supported data type: {type(data)}")

X = []
y = []

_range = range(0, len(data) - look_back)
else:
_range = range(0, len(data) - look_back - look_ahead)

for idx in _range:
batch_end = idx + look_back

local_X = data[idx:batch_end]

X.append(local_X)
y.append(local_y)

return np.array(X), np.array(y)


It seems complicated but it is not. Let’s see an example with a list of dummy values to understand the function.

x = [a for a in range(10)]

X, y = unfold_ts_for_regression(data=x, look_back=2, look_ahead=1)


The above lines output the following arrays for X and y respectively:

# x =
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6],
[6, 7],
[7, 8]])

# y =
array([2, 3, 4, 5, 6, 7, 8, 9])


See? It is just a sliding window that looks n values in the past (look_back) and selects a value from the future to forecast (look_ahead). Each iteration creates a new row in the features and labels matrix.

Let’s plot the results in an animated gif to see the sequence:

Be careful using this function, because you may incur a problem called overlapping outcomes (see chapter 4 of [1] for more information).

## Conclusions

In this post, we’ve briefly seen 4 simple ways to label your financial data. There are more complex procedures out there, like triple-barrier [1] that I encourage you to study and test.