post list
QuantDare
categories
artificial intelligence

Random forest: many is better than one

xristica

artificial intelligence

Non-parametric Estimation

T. Fuertes

artificial intelligence

Classification trees in MATLAB

xristica

artificial intelligence

Applying Genetic Algorithms to define a Trading System

aparra

artificial intelligence

Graph theory: connections in the market

T. Fuertes

artificial intelligence

Data Cleansing & Data Transformation

psanchezcri

artificial intelligence

Learning with kernels: an introductory approach

ogonzalez

artificial intelligence

Clustering: “Two’s company, three’s a crowd”

libesa

artificial intelligence

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

artificial intelligence

Visualizing Fixed Income ETFs with T-SNE

j3

artificial intelligence

Hierarchical clustering, using it to invest

T. Fuertes

artificial intelligence

Markov Switching Regimes say… bear or bullish?

mplanaslasa

artificial intelligence

“K-Means never fails”, they said…

fjrodriguez2

artificial intelligence

What is the difference between Bagging and Boosting?

xristica

artificial intelligence

Outliers: Looking For A Needle In A Haystack

T. Fuertes

artificial intelligence

Machine Learning: A Brief Breakdown

libesa

artificial intelligence

Stock classification with ISOMAP

j3

artificial intelligence

Sir Bayes: all but not naïve!

mplanaslasa

artificial intelligence

Returns clustering with k-Means algorithm

psanchezcri

artificial intelligence

Confusion matrix & MCC statistic

mplanaslasa

artificial intelligence

Reproducing the S&P500 by clustering

fuzzyperson

artificial intelligence

Random forest vs Simple tree

xristica

artificial intelligence

Clasificando el mercado mediante árboles de decisión

xristica

artificial intelligence

Árboles de clasificación en Matlab

xristica

artificial intelligence

Redes Neuronales II

alarije

artificial intelligence

Análisis de Componentes Principales

j3

artificial intelligence

Vecinos cercanos en una serie temporal

xristica

artificial intelligence

Redes Neuronales

alarije

artificial intelligence

Caso Práctico: Multidimensional Scaling

rcobo

artificial intelligence

SVM versus a monkey. Make your bets.

P. López

15/09/2016

No Comments
SVM versus a monkey. Make your bets.

Ladies and gentlemen, place your bets, today we are going to do our best to beat one of the most frightening opponents that you can face in finance: a monkey.

As you probably already know, in this blog we are all quite obsessed with predicting trends and returns, you can find other nice attempts in ‘Markov Switching Regimes say… bear or bullish?’  by mplanaslasa or ‘Predict returns using historical patterns’ by fjrodriguez2.

Today, we are trying to predict the sign of tomorrow’s return for different currency pairs, and I can assure you that a monkey making random bets on the sign and getting it right 50% of the time is going to be a tough benchmark.

We are going to use an off the shelf machine learning algorithm, the support vector classifier. Support Vector Machines are an incredibly powerful method to solve regression and classification tasks.

The Support Vector Machines

The SVM is based on the idea that we can separate classes in a p-dimensional feature space by means of a hyperplane. The SVM algorithm uses a hyperplane and a margin to create a decision boundary for the two classes.

support vector machines margin

In the most simple case, linear classification is posible, and the algorithm selects the decision boundary in such a way that maximizes the distance margin between classes.

In most financial series that you can come across, you are not going to encounter easy, linearly divisible sets, but the non-divisible case is going to be the norm. The SVM gets around this issue by implementing the so-called soft margin method.

In this case, some misclassification cases are allowed but they penalize the function to minimize with a factor that is proportional to C (cost or budget of errors that are allowed) and the distance of the mistakes to the margin.

support vector machines diagram

Basically, the machine is going to maximize the margin between classes while minimizing the penalization term that is weighted by C and that is basically a bound for the number of misclassified observations.

A very cool feature of SVM classification is that the position and size of the margin is only decided by a subset of the data, the one that is the closest to the margin. This characteristic of the algorithm makes it quite robust against outliers or extreme values that are far from the margin.

Too complex for you? Well, I’m afraid the fun is only just getting started.

The kernels

Imagine now the following situation:

support vector machines 2d dataset diagramDo you think this situation is going to be easy for our linear margin classifier? Well, it’s simple to classify but clearly it can’t be done linearly. However, we can try the kernel trick.

The kernel trick is a very intelligent mathematical technique that allows us to solve implicitly the linear separation problem in a higher dimensional feature space. Let’s see how this is done:

support vector machines 3d kernel diagramsupport vector machines non linear kernelWhen a data set is not linearly classifiable in, for example, ℝ2, you can use a mapping function Φ(x), that maps the whole dataset from ℝ2 to ℝ3. It is sometimes the case that you can separate the dataset in ℝ(a linear boundary is going to be a plane now instead of a line!) and come back afterwards to ℝ2 and, by applying the inverse mapping on the plane, you can get a non-linear decision boundary on your original input space.

In general, if you have d-inputs you can use a mapping from your d-dimensional input space to a p-dimensional feature space. Performing the minimization problem stated above will yield as solution a p-dimensional separating hyperplane that will be mapped back into your original input space.

In the example above, a 3-dimensional hyperplane (basically just a plane) is mapped back into an ellipse in the original 2-d space. Cool, right?

But there is still more. Maybe go straight to the monkey challenge if your head is already exploding…

From the mathematical solution of the aforementioned optimization problem, with a little bit of pain, it can be shown that the solution depends only on the dot products of the sample in the feature space.

This mathematical result is the key for performing the kernel trick. As long as you only need the dot products to perform the margin optimization, the mapping does not need to be explicit, and the dot products in the high dimensional feature space can be safely computed implicitly from the input space by means of a kernel function (and a little help from Mercer’s theorem).

For example, let’s say that you want to solve your classification problem in a very vast feature space, let’s say 100000 – dimensional. Can you imagine the computational power that you would need? I seriously doubt that you could even do it. Well, kernels allow you to compute these dot products, and therefore the margin, from the comfort of your lower dimensional input space.

Some widely used kernels are:

    1. Polynomial Kernel: ( Γ< > + r )d
    2. Gaussian Kernel: exp( -Γ|– x|2  )

The Gaussian kernel actually allows you to compute dot-products that are implicitly performed in an infinite-dimensional space, but don’t try to figure out what exactly an infinite dimensional feature space is or your brain might explode.

As we have just seen in the kernel formulation, there is a second hyperparameter that you need to tune if you want to use a kernel. This parameter Γ controls the characteristic distance of the influence of a single observation.

Both C and Γ should be carefully chosen in order to perform a nice classification:

scikitsvm

For further information there is a very clear explanation of the kernel trick by Eric Kim. Also, you might be interested in this nice and rigorous summary of SVM by Alex Smola.

The challenge and the monkeys

Now we are prepared to face the challenge of beating Jeff’s predicting abilities. Let’s meet Jeff:

Smarty pants monkey doing machine learning

Jeff is a currency market expert and just by random betting is able to get 50% prediction accuracy in forecasting the sign of the next day’s return.

We are going to use different fundamental series in addition to the series of spot prices, including returns of up to 10 lags for each series, in total making 55 features.

The SVM that we are going to train is going to use a polynomic kernel of degree 3. Choosing an appropriate kernel is another really tough task, as you can imagine. To calibrate the C and Γ parameters, 3-fold cross validation is performed on a grid of possible parameter combinations and the best set is finally chosen.

The results aren’t very encouraging:

svm results on forex data

We can see that both linear regression and SVM are capable to beat Jeff. Even though the results are not promising, we are able to extract some information from the data which is already good news since daily returns in financial series aren’t precisely the most informative series in data science.

After cross validating, the dataset is split into training and test sets and we record the prediction ability of the trained SVM. We repeat the random splitting 1000 times for each currency in order to have an idea of the stability of the performance.

SVM:

support vector machines predicting forex

Linear regression:

linear regression predicting forex

So it seems that in some cases SVM outperforms simple linear regression but the variance of the performance is also slightly higher. In the case of USDJPY we are able to predict the sign 54% of the times, on average. It is a fairly good result but let’s have a closer look.

Ted is Jeff’s cousin. Ted is of course also a monkey  but he is smarter than his cousin. Instead of random betting, Ted looks at the training sample, and bets that the sign is always going to be the one that is the most frequent in the training output. Let’s use Ted the smarty pants as a benchmark now:

svm results on forex data

As we can see now, most of the performance of SVM just came from the fact that the machine learned that the classes were not equally likely a priori. Actually, linear regression is not able to get any information at all from the features but only the intercept is meaningful in the regression and accounts for the fact that one of the classes was more populated.

A little bit of good news, the SVM is able to get some extra non-linear information from the data that allows us to get an extra 2% of prediction accuracy.

Unfortunately, we have no idea of what this information can be as SVM has the main drawback that is not as interpretable as we could wish.

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz