Ladies and gentlemen, place your bets! Today we’re going to do our best to beat one of the most frightening opponents that you can face in finance: a monkey.

As you probably already know, on this blog we are all quite obsessed with predicting trends and returns. You can find other brave attempts in ‘Markov Switching Regimes say… bear or bullish?’ by mplanaslasa or ‘Predict returns using historical patterns’ by fjrodriguez2.

Today, we are trying to **predict the sign of tomorrow’s return** for different currency pairs, and I can assure you that a monkey making random bets on the sign and getting it right 50% of the time is going to be a tough benchmark.

We’re going to use an off the shelf** machine learning algorithm**: the support vector classifier. Support Vector Machines are an incredibly powerful method to solve regression and classification tasks.

## The Support Vector Machines

The SVM is based on the idea that we can separate classes in a p-dimensional feature space by means of a hyperplane. The SVM algorithm uses a **hyperplane** and a margin to create a decision boundary for the two classes.

In the most simple case, linear classification is posible, and the algorithm selects the decision boundary in such a way that **maximizes the distance margin** between classes.

In most financial series you’re not going to encounter easy, linearly divisible sets, but the non-divisible case is going to be the norm. The SVM gets around this issue by implementing the so-called** soft margin method**.

In this case, some misclassification cases are allowed but they penalize the function to minimize with a factor that is proportional to C (cost or budget of errors that are allowed) and the distance of the mistakes to the margin.

Basically, the machine is going to maximize the margin between classes while minimizing the penalization term that is weighted by C, and that’s basically a boundry for the number of misclassified observations.

A very cool feature of SVM classification is that **the position and size of the margin is only decided by a subset of the data**, namely, the one closest to the margin. This algorithm characteristic makes it quite robust against outliers or extreme values that are far from the margin.

Too complex for you? Well, I’m afraid the fun is only just getting started.

## The kernels

Imagine now the following situation:

Do you think this situation is going to be easy for our linear margin classifier? Well, it’s simple to classify but clearly it can’t be done linearly. However, we *can* try the kernel trick.

The kernel trick is a very intelligent mathematical technique that **allows us to solve implicitly the linear separation problem** in a higher dimensional feature space. Let’s see how this is done:

When a data set is not linearly classifiable in, for example, ℝ^{2}, you can use a mapping function Φ(x), that maps the whole dataset from ℝ^{2} to ℝ^{3}. It is sometimes the case that you can separate the dataset in ℝ^{3 }(a linear boundary is going to be a plane now instead of a line!) and come back afterwards to ℝ^{2} and, by applying the inverse mapping on the plane, you can **get a non-linear decision** boundary on your original input space.

In general, if you have d-inputs you can use a mapping from your d-dimensional input space to a p-dimensional **feature space**. Performing the minimization problem stated above will yield as solution a p-dimensional separating hyperplane that will be mapped back into your original **input space**.

In the example above, a 3-dimensional hyperplane (basically just a plane) is mapped back into an ellipse in the original 2-d space. Cool, right?

## But there’s (still) more.

If your head is already exploding, feel free to skip straight to the monkey challenge…

From the mathematical solution of the aforementioned optimisation problem, with a little bit of hard work, it can be shown that the solution depends only on the **dot products **of the sample in the feature space.

This mathematical result is the key for performing the kernel trick. As long as you only need the dot products to perform the margin optimisation, the mapping does not need to be explicit, and the dot products in the high dimensional feature space can be safely computed implicitly from the input space by means of a kernel function (and a little help from Mercer’s theorem).

For example, let’s say that you want to solve your classification problem in a very vast feature space, let’s say 100000 – dimensional. Can you imagine the computational power that you would need? I seriously doubt that it’s even possible. On the other hand, kernels allow you to compute these dot products, and therefore the margin, from the comfort of your lower dimensional input space.

Some widely used kernels are:

- Polynomial Kernel: ( Γ<
,*x*> + r )*x*^{d} - Gaussian Kernel: exp( -Γ|
–*x*|*x*^{2 })

The Gaussian kernel actually allows you to **compute dot-products** that are implicitly performed in an infinite-dimensional space, but don’t try to figure out what exactly an infinite dimensional feature space is or your brain might explode.

As we’ve just seen in the kernel formulation, there’s a second hyperparameter that you need to tune if you want to use a kernel. This parameter Γ controls the characteristic distance of the influence of a single observation.

Both C and Γ should be carefully chosen in order to perform a nice classification:

For further information there is a very clear explanation of the kernel trick by Eric Kim. Also, you might be interested in this nice and rigorous summary of SVM by Alex Smola.

## The challenge and the monkeys

Now we are prepared to face the challenge of beating Jeff’s predicting abilities. Let’s meet Jeff:

Jeff is a currency market expert and just by random betting is able to get 50% prediction accuracy in forecasting the sign of the next day’s return.

We are going to **use different fundamental series** in addition to the series of spot prices, including returns of up to 10 lags for each series, making 55 features in total.

The SVM that we are going to train is going to **use a polynomic kernel of degree 3**. Choosing an appropriate kernel is another really tough task, as you can imagine. To calibrate the C and Γ parameters, 3-fold cross validation is performed on a grid of possible parameter combinations and the best set is finally chosen.

The results aren’t very encouraging:

We can see that both linear regression and SVM are able to beat Jeff. Even though the results are not promising, we are able to extract some information from the data – which is already good news, since daily returns in financial series aren’t exactly the most informative series in data science.

After cross validating, the dataset is split into training and test sets, and we record the prediction ability of the trained SVM. We repeat the random splitting 1000 times for each currency in order to have an idea of the stability of the performance.

SVM:

Linear regression:

So it seems that in some cases SVM outperforms simple linear regression, but the variance of the performance is also slightly higher. In the case of USDJPY we are able to predict the sign 54% of the time, on average. It’s a fairly good result. Let’s have a closer look.

Ted is Jeff’s cousin. Ted is, of course also a monkey, but he’s smarter than his cousin. Instead of random betting, Ted looks at the training sample, and bets that the sign is always going to be the one that is the most frequent in the training output. Let’s use Ted the smarty pants as a benchmark now:

As we can see now, most of the performance of SVM just came from the fact that **the machine learned that the classes were not equally likely a priori**. Actually, linear regression is not able to get any information at all from the features, but only the intercept is meaningful in the regression and accounts for the fact that one of the classes was more populated.

A little bit of good news, the SVM is able to get some extra non-linear information from the data that allows us to get an extra 2% of prediction accuracy.

Unfortunately, we have no idea of what this information can be as SVM has the drawback that it’s not as interpretable as we could wish for.