Machine Learning

Dimensionality reduction method through autoencoders

T. Fuertes

11/12/2019

No Comments

We’ve already talked about dimensionality reduction long and hard in this blog, usually focusing on PCA. Besides, in my latest post I introduced another way to reduce dimensions based on autoencoders. However, in that time I focused on how to use autoencoders as predictor, while now I’d like to consider them as a dimensionality reduction technique.

Just a reminder about how autoencoders work. Its procedure starts compressing the original data into a shortcode ignoring noise. Then, the algorithm uncompresses that code to generate an image as close as possible to the original input.

autoencode neural network example
Autoencoder process

Practical case

Let’s move to a hot topic in finance: modeling of interest rates. We’ve already checked that PCA technique reveals that it is able, to sum up, the information of interest rates in only three factors, which represent the level, the slope and the curvature of the zero-coupon curve and they preserve around 95% of the information.

I was wondering if autoencoders are able to catch the same information as PCA by using only the “encoding process” because this part is the one that compresses data. So, let’s show how to get a dimensionality reduction thought autoencoders.

Get down to the business

First, you should import some libraries:

from keras.models import Model
from keras.layers import Input, Dense
from keras import regularizers
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

Once you have downloaded data, you can start. So, let’s see what kind of data to use. Remember that the idea is to use autoencoders to reduce dimensions of interest rates data. So, this is the data set: a zero-coupon curve of the USA from 1995 to 2018.

# Normalise
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)

Now, it’s a matter of seconds before an autoencoder model is created to reduce the dimensions of interest rates. At this point, you should decide how many layers you want in the “encoding process”. As the aim is to get three components in order to set up a relationship with PCA, it’s needed to create four layers of 8 (the original amount of series), 6, 4, and 3 (the number of components we are looking for) neurons, respectively.

# Fixed dimensions
input_dim = data.shape[1]  # 8
encoding_dim = 3
# Number of neurons in each Layer [8, 6, 4, 3, ...] of encoders
input_layer = Input(shape=(input_dim, ))
encoder_layer_1 = Dense(6, activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder_layer_2 = Dense(4, activation="tanh")(encoder_layer_1)
encoder_layer_3 = Dense(encoding_dim, activation="tanh")(encoder_layer_2)

In the next step, you create the model and use it to predict the compressed data. This data is supposed to contain all relevant information about the original data ignoring the noise.

# Crear encoder model
encoder = Model(inputs=input_layer, outputs=encoder_layer_3)
# Use the model to predict the factors which sum up the information of interest rates.
encoded_data = pd.DataFrame(encoder.predict(data_scaled))
encoded_data.columns = ['factor_1', 'factor_2', 'factor_3']

Now, I leave some questions: do autoencoders catch more information than PCA? Is this way of creating autoencoders the best one to reduce dimensions?

What else in terms of dimensionality reduction and autoencoders?

This technique can be used to reduce dimensions in any machine learning problem. Just by applying it you can deal with high dimensional problems if you reduce dimensions in both train and test sets. In this way, you’ll have reduced the dimensionality of your problem and, what is more important, you’ll have got rid of noise from the data-set.