# Learning with kernels: an introductory approach

### ogonzalez

#### 28/09/2016


Time series pervade financial markets and, although some embrace the so-called efficient market hypothesis, stating that current market prices reflect all available information about a security into its price, I’m more inclined to think they provide us with a lot of information that we rarely know how to exploit for our own benefit.

I agree that financial time series may be damn difficult to predict, but that does not mean we cannot beat a guessing monkey or perform better than our favourite benchmark. Frequently, and this is common to many other disciplines, the two basic characteristics of time series (statistical distribution of amplitudes and time structure), are studied independently.

We all have heard of fat-tailed financial returns and using the autocorrelation function à la Box-Jenkins to understand the time structure of the series but, sadly, in the vast majority of times analysis ends here.

However, real-world financial returns are hardly ever independently distributed, and the relationship between them is not necessarily linear, a fact that tools like the autocorrelation function fail to identify. Remember, the autocorrelation function is basically a sequence of Pearson correlation coefficients (i.e., a measure of linear dependence) between the time series and lagged versions of itself. Let me illustrate this point with an example.

The following figure compares the autocorrelation function of the daily returns of the S&P 500 index with that of white Gaussian noise:

The goal of this type of analysis is usually evaluating if it would be a good idea to come up with an autoregressive model of the form
$$y_t = w_0 + \sum_{i=1}^N w_i y_{t-i},$$
or, equivalently,
$$y_t = \mathbf{w}^T\mathbf{x}_t,$$
where $$\mathbf{x}_t=[1,y_{t-1}, y_{t-2}, \ldots, y_{t-N}]^T$$ denotes the vector of regressors or features used to model $$y_t$$.

In least-squares regression, the coefficients $$\mathbf{w}$$ are obtained by solving the training problem
$$\underset{\mathbf{w}}{\text{argmin}\,}\|\mathbf{X}^T\mathbf{w}-\mathbf{y}\|^2,$$ and then used to predict future outcomes by applying the prediction/classification rule:
$$y_{t+1} = \mathbf{w}^T\mathbf{x}_{t+1}.$$
However, from the autocorrelation function above, it looks like the time series at hand is rather white (autocorrelation function is approximately zero everywhere except for the zero-th lag) and, therefore, no substantial linear correlation exists among lagged versions of the time series. In this case, the aforementioned autoregressive model would not be of much help.

## Going non-linear

Nevertheless, a well-known property of financial returns is the so-called volatility clustering. According to this, similar price changes tend to cluster together, resulting in persistence of the amplitudes of price changes, but not their sign.

A quantitative manifestation of this fact is that squared returns show a slow-decaying autocorrelation function, revealing that the series is not as white as previously indicated:

This observation has historically motivated the introduction of more elaborate tools such as ARCH, GARCH or stochastic volatility models. Another alternative is introducing a non-linear model trying to explain returns as a non-linear function of previous returns. For example, a quadratic model will look like this:
$$y_t=w_0+w_1y_{t-1}+w_3y_{t-2}+w_4y_{t-1}^2+w_5y_{t-1}y_{t-2}+w_6y_{t-2}^2,$$ which writes $$y_t = \mathbf{w}^T\phi(\mathbf{x}_t),$$ with $$\phi(\mathbf{x}_t)$$ being the augmented feature vector
$$\phi(\mathbf{x}_t)=[1,y_{t-1},y_{t-2},y_{t-1}^2,y_{t-1}y_{t-2}, y_{t-2}^2]^T,$$ and everything stays the same as before, with $$\mathbf{x}_t$$ replaced by $$\phi(\mathbf{x}_t)$$. The training problem and prediction rule stay exactly the same, but with a subtle difference in output: by augmenting our feature space, our model has gained in expressiveness, at the cost of making these two tasks a little bit more computationally demanding. For our toy quadratic example this may be affordable, but what if we want to include higher order terms into our model? The computational cost will soon get prohibitive, unless we use the kernel trick.

## Trick and treat

Many classification and regression problems can be written in the same general form, consisting of a training problem,
$$\underset{\mathbf{w}}{\text{argmin}\, }L(\mathbf{X}^T\mathbf{w},\mathbf{y}),$$ where $$L$$ denotes a general loss function which depends on the particular problem, and a prediction/classification rule that depends only on $$\mathbf{w}^T\mathbf{x}$$, where $$\mathbf{x}$$ is the latest available data point.

Some representative examples belonging to this general family of problems are shown in the following table:

Algorithm Loss function name Loss function $$L(z,y)$$
Linear regression Squared loss $$\|z-y\|^2$$
SVM classification Hinge loss $$\sum\max(0,1-y_iz_i)$$
Logistic regression Logistic loss $$-\sum\log(1+e^{-y_iz_i})$$

A collection of results known as representer theorems state that it’s possible to write the solution to any problem in this family as a linear combination of dot products in the feature space. As you may have already noticed, current formulations of the representer theorem sound a bit cryptic (if not, try Wikipedia’s entry). In the following, I will try to put it in simpler terms by resorting to elementary linear algebra facts involving two of the four fundamental subspaces.

First, note that $$\mathbf{X}^T\mathbf{w}$$ is simply the projection of $$\mathbf{w}$$ in the subspace spanned by the columns of $$\mathbf{X}$$ and, consequently, $$\mathbf{w}$$ can be written as the sum of two orthogonal vectors:
$$\mathbf{w} = \mathbf{X} \mathbf{v} + \mathbf{r},$$
where $$\mathbf{X} \mathbf{v}$$ is, by construction, lying in the column space of $$\mathbf{X}$$ and $$\mathbf{r}$$ lies in the null space of $$\mathbf{X}^T$$, i.e., fulfills $$\mathbf{X}^T\mathbf{r} = 0$$.

That being said, both our general training problem (with its loss function) and our prediction/classification rule can be reparameterised as a function of $$\mathbf{v}$$ instead of $$\mathbf{w}$$. In particular, the training problem turns into:
$$\underset{\mathbf{v}}{\text{argmin}\,}L(\mathbf{X}^T\mathbf{X}\mathbf{v},\mathbf{y}),$$ while the prediction/classification rule depends solely on: $$\mathbf{v}^T\mathbf{X}^T\mathbf{x}.$$

And it’s here that the kernel trick comes to fruition. Note that the loss function depends only on what is called the kernel matrix $$\mathbf{K} = \mathbf{X}^T\mathbf{X}$$, which basically contains dot products between all data pairs in the training set. Likewise, the prediction/classification rule depends on $$\mathbf{k} = \mathbf{X}^T\mathbf{x}$$, that is, the dot products between data points in the training set and a newly available data point. Therefore, the computational effort involved in solving the training problem and making a prediction depends only on our ability to quickly evaluate such dot products. In order to apply this idea to large dimensional feature spaces, what we really need is a kernel function, $$k(\mathbf{x}_i,\mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$$, able to compute dot products efficiently in the feature space, and that will do the trick. A commonly used example is the Gaussian or radial basis function (RBF) kernel: $$k(\mathbf{x}_i,\mathbf{x}_j)=\exp(-\gamma\|\mathbf{x}_i-\mathbf{x}_j\|^2).$$

Toolkits like Python’s scikit-learn implement a variety of kernel functions and even allow you to define custom kernels that can be passed to many methods as parameters. As shown in the previous blog post, the best known example of kernel methods are support vector machines (SVMs), which are derived from the hard margin version of the generalised portrait algorithm (which was, indeed, the first algorithm using kernels), but many other kernel methods are also available. Some of them are already available in scikit-learn, such as kernel principal component analysis (K-PCA), kernel ridge regression and Gaussian processes.

However, as you can imagine, the ideas above can be applied to a plethora of machine learning algorithms, thus turning them into kernel methods. To cite some examples, we have: relevance vector machines, and the kernelized version of discriminant analysis, Fisher discriminant analysis, canonical correlation analysis, independent component analysis, K-means, spectral clustering,…

Last but not least, I would like to stress that, although I have omitted it here for the sake of clarity, the training problem for these methods usually includes a regularisation term to avoid overfitting; a common phenomenon in such a high-dimensional feature space.

## Getting fundamental

Going back to where we started, would it be possible to apply this kernelisation idea to perform a non-linear extension of the concept of autocorrelation function? The answer is yes and, indeed, the resulting function is known as generalised autocorrelation function.

Interestingly, when the Gaussian kernel is applied, the generalised autocorrelation function shows some interesting connections with information-theoretic concepts such as the quadratic Renyi’s entropy of the underlying random process, thus closing the gap between the two apparently separated (purely statistical and time structural) approaches I started this post with. For this reason, the generalised autocorrelation function is also known as autocorrentropy, but I will leave the details for a different post.

Meanwhile, you can find out more about the interplay of information theory and efficient market hypotheses in yet-another great Turing Finance post.