Artificial Intelligence

4 simple ways to label financial data for Machine Learning

Alejandro Pérez

17/03/2021

No Comments

We have seen in previous posts what is machine learning and even how to create our own framework. Combining machine learning and finance always leads to interesting results. Nevertheless, in supervised learning, it is crucial to find a set of appropriate labels to train your model. In today’s post, we are going to see 3 ways to transform our data into a classification problem and 1 to transform it into a regression one.

What is ‘labeling’?

Labeling is the process of designing a supervisory signal for a set of data so that a model can infer properties from it. In other words, a label is an outcome we want our model to learn. We say that labeled data are annotated data.

Like features, the way we label our data contains information about the problem itself. That is why is so important to do it right.

Binary Labeling

Let’s start with the simplest one. The easiest way to label returns is to assign a label depending on the returns sign: we label positive returns as class 1 and negative returns as class 0. We can call this method binary labeling.

def binary_labelling(data, name='Close'):
    """Binary labelling.

    Label the data according to its sign. If it is positive, if will be 
    labeled as 1, if it is negative, it will be labeled as 0.

    Returns equal to zero, if any, will be left as nan.

    Parameters
    ----------
    data : pandas.DataFrame or pandas.Series
        The data from which the labels are to be calculated. The data should be
        returns and not prices.
    name : str, optional, default: 'Close'
        Column to extract the labels from.
    
    Returns
    -------
    labs : pandas.DataFrame
        A pandas dataframe containing the returns and the labels for each 
        return.
    
    """    
    # labs to store labels
    labs = pd.DataFrame(index=data.index, columns=[name, 'Label'])

    # get indices for each label
    idx_pos = data[data[name] > 0].index
    idx_neg = data[data[name] < 0].index

    # assign labels depending on indices
    labs[name] = data
    labs.loc[idx_pos, 'Label'] = 1
    labs.loc[idx_neg, 'Label'] = 0

    return labs

Result of applying this method to the XAUUSD relative returns time series.

Binary labeling applied to XAUUSD relative returns.
Binary labeling applied to XAUUSD relative returns.

The main drawback of this procedure is that it does not capture the differences in magnitude from two returns of the same sign; e.g. 0.01 has the same label as 1000. Therefore, it is not a very appropriate algorithm in most cases (but still useful to build intuition).

Fixed-time horizon

The first thing we can do to take into account these differences is to add a threshold from which the labels are computed. In chapter 3 of [1], by Marcos López de Prado, a method called Fixed-time horizon is presented as one of the main procedures to label data when it comes to processing financial time series for machine learning.

The method is simple and can be defined by the following expression:

$$
y_{i} =
\begin{cases}
-1, & \text{if $r_{t0,t1}
< – \tau $} \\ 0, & \text{if $| r_{t0,t1}|
\leq \tau $} \\ 1, & \text{if $r_{t0,t1} > \tau $}
\end{cases}
$$

def fixed_time_horizon(data, threshold, name='Close'):
    """Fixed-time horizon labelling.

    Compute the financial labels using the fixed-time horizon procedure. See
    references to understand how this method works.

    Parameters
    ----------
    data : pandas.DataFrame or pandas.Series
        The data from which the labels are to be calculated. The data should be
        returns and not prices.
    name : str, optional, default: 'Close'
        Column to extract the labels from.        
    threshold : int
        The predefined constant threshold to compute the labels.

    Returns
    -------
    labs : pandas.DataFrame
        A pandas dataframe containing the returns and the labels for each 
        return.

    References
    ----------
    .. [1] Marcos López de Prado (2018). Advances in Financial Machine Learning 
       Wiley & Sons, Inc.

    .. [2] Marcos López de Prado - Machine Learning for Asset Managers.

    """
    # to store labels
    labs = pd.DataFrame(index=data.index, columns=[name, 'Label'])

    # get indices for each label
    idx_lower = data[data[name] < -threshold].index
    idx_middle = data[abs(data[name]) <= threshold].index
    idx_upper = data[data[name] > threshold].index

    # assign labels depending on indices
    labs[name] = data
    labs.loc[idx_lower, 'Label'] = -1
    labs.loc[idx_middle, 'Label'] = 0
    labs.loc[idx_upper, 'Label'] = 1

    return labs

Results of applying the fixed-time horizon method to the XAUUSD relative returns.

Fixed-time horizon applied to XAUUSD relative returns for different threshold values.

Fixed-time horizon applied to XAUUSD relative returns.

This method improves the binary labeling procedure, but it works assuming the market remains static (no regime changes, no volatility clustering [3], etc) due to the fixed threshold value.

Can we do better while keeping a simple procedure? Yes, we can.

Quantized labeling

Ideally, we would want our method to automatically adapt reasonably well to changes in the market. Why don’t we use the varying properties of the returns distribution in our favour? That is exactly how quantized labeling [2] works.

Quantized labeling consists in bucketizing the returns into categories derived from the quantile values. Computing the categories using a sliding/expanding window gives us the dynamic behaviour we seek.

def quantized_labelling(
    data,  
    n_labels,
    name='Close',
    window=None,
    fillnan=None,
    mode=None
):
    """Quantized labelling.

    Label the data according to a quantile calculation. The quantiles can be
    computed in rolling or expanding modes, as well as for the whole dataset
    at once.

    Parameters
    ----------
    data : pandas.DataFrame or pandas.Series
        The data from which the labels are to be calculated. The data should be
        returns and not prices.
    n_lables : int
        The number of labels you want to compute.
    name : str, optional, default: 'Close'
        Column to extract the labels from.        
    window : int, optional, default: None
        The period size to compute the rolling/expanding quantiles.
    fillnan : object, optional, default: None
        If not None, the remaining rows, after bucketing, whose values are NaN 
        will be filled with the passed value.
    mode : str, {'rolling', 'expanding', None}
        If None, the data will be bucketed using the whole dataset. If 
        'rolling' or 'expanding', the data will be bucketed using the selected
        mode, with a window equals 'window' parameter.

    Returns
    -------
    labs : pandas.DataFrame
        A pandas dataframe containing the returns and the labels for each 
        return.

    References
    ----------
    .. [1] Udacity - AI for trading
       https://www.udacity.com/course/ai-for-trading--nd880
    
    """
    def get_qcuts(series, quantiles):
        """Helper function """
        q = pd.qcut(series, q=quantiles, labels=False, duplicates='drop')
        return q[-1]
    
    name = 'Close'

    q_val = 1 / n_labels
    quantiles = np.arange(0, 1+q_val, q_val)
    
    labs = pd.DataFrame(index=data.index, columns=[name])
    labs[name] = data
    
    if mode is None:
        qc = pd.qcut(data[name], q=quantiles, labels=False)
        
        # concat to avoid errors with indexes
        labs = pd.concat([data, qc], axis=1)
        labs.columns= [name, 'Label']
        
    else:
        if window is None:
            raise ValueError(f"'window' with value {window} is not valid.")
        else:
            pd_obj = getattr(data, mode)(window)
            labs['Label'] = pd_obj.apply(
                lambda x: get_qcuts(x, quantiles), 
                raw=True
            )

    # fill nans 
    if fillnan is not None: 
        labs.fillna(fillnan, inplace=True)

    return labs

Note in the code above that the procedure can be applied in rolling, expanding, or for the whole dataset at once. Here is the result of applying quantized labeling to XAUUSD relative returns (we set n_labels to 7).

Quantized labeling for different windows applied on XAUUSD relative returns for different window sizes.
Quantized labeling for different windows applied on XAUUSD relative returns.

Labeling for regression

The last algorithm we are going to see allows us to transform our data into a regression problem. Hence, the labels will be continuous.

The idea is simple: we apply a rolling window on our returns and select n past returns and 1 future return as a label.

def unfold_ts_for_regression(
    data,
    look_back=20,
    look_ahead=1,
):
    """Unfolds ts for regression.
    
    This functions receives as input a time series and returns two sets, X and
    y.

    Parameters
    ----------
    data : pandas.DataFrame, pandas.Series or numpy.array
        The time series to process.
    look_back : int, optional, default: 20
        The number of days to look back to predict the next day.
    look_ahead : int, optional, default: 0
        If 'look_ahead' is 1, the label will be the next data of the 
        batch. If it is greater, the labels will be 'look_ahead' data of the
        batch.

    Returns
    -------
    X : numpy.array
        An array containing the features.
    y : numpy.array
        An array containing the labels.
    
    """
    if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series):
        data = data.values

    elif isinstance(data, list):
        data = np.array(data)
        
    elif isinstance(data, np.ndarray):
        pass
        
    else:
        raise TypeError(f"Non-supported data type: {type(data)}")
    
    X = []
    y = []
    
    if look_ahead == 1:
        _range = range(0, len(data) - look_back)
    else:
        _range = range(0, len(data) - look_back - look_ahead)
    
    for idx in _range:
        batch_end = idx + look_back
        ahead_end = batch_end + look_ahead - 1

        local_X = data[idx:batch_end]
        local_y = data[ahead_end]

        X.append(local_X)
        y.append(local_y)
    
    return np.array(X), np.array(y)

It seems complicated but it is not. Let’s see an example with a list of dummy values to understand the function.

x = [a for a in range(10)]

X, y = unfold_ts_for_regression(data=x, look_back=2, look_ahead=1)

The above lines output the following arrays for X and y respectively:

# x = 
array([[0, 1],
       [1, 2],
       [2, 3],
       [3, 4],
       [4, 5],
       [5, 6],
       [6, 7],
       [7, 8]])

# y = 
array([2, 3, 4, 5, 6, 7, 8, 9])

See? It is just a sliding window that looks n values in the past (look_back) and selects a value from the future to forecast (look_ahead). Each iteration creates a new row in the features and labels matrix.

Let’s plot the results in an animated gif to see the sequence:

 Each frame represents the features (blue) and the label (red) computed using the function above on the relative returns of XAUUSD.
Each frame represents the features (blue) and the label (red) computed using the function above on the relative returns of XAUUSD.

Be careful using this function, because you may incur a problem called overlapping outcomes (see chapter 4 of [1] for more information).

Conclusions

In this post, we’ve briefly seen 4 simple ways to label your financial data. There are more complex procedures out there, like triple-barrier [1] that I encourage you to study and test.

Bibliography

[1] Marcos López de Prado – Advances in Financial Machine Learning.

[2] Udacity – AI for trading.

[3] Rama Cont – Volatility Clustering in Financial Markets: Empirical Facts and Agent–Based Models.

0 Comments
Inline Feedbacks
View all comments