Is everyone around you talking about Machine Learning? Have you heard about some algorithms and techniques but are missing the bigger picture? This is a good place to start…
A new generation of intellect
Machine Learning is a hot topic in the science world right now. By combining the powers and capabilities of both computers and humans, perplexing and unimaginable problems are being resolved as we speak.
Machines nowadays can more easily handle the ginormous amount of data constantly being produced, and decipher the complexity of scientific discoveries. Researchers have begun to recognise the potential this science can have in a vast variety of fields, and it’s finally being put into practice.
On researching the topic, many of the techniques and algorithms will seem familiar to a lot of statisticians, engineers, programmers, mathematicians and quants. This is because they have actually been around for years. Machine Learning is (relatively) new as a term, but is not completely foreign territory for data scientists.
This post is a compilation of interesting bits from my initial research of the topic. I wanted to understand how everything was related, and categorise the different parts, therefore enabling me to choose the best solutions in my current projects.
Although I say nothing new and I am far from an expert, I hope this post and the links are useful for those of you who, just like me, are confused about how to start making use of techniques on offer in the fascinating world of Machine Learning.
So…What is Machine Learning?
In Machine Learning, we let machines learn for themselves. They learn by examples given to them via data. The human factor is to then use the conclusions made by the machines to improve, speed up and automate processes and tasks.
More precisely, and repeating two popular quotes from two “giants” in the field:
“Machine Learning is the science of getting computers to act without being explicitly programmed.” – Andrew Ng (Coursera)
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at task in T, as measured by P, improves with experience E.” – Tom M. Mitchell (1997)
I’ve heard this before!
Machine Learning (ML) sounds very similar to other types of science that you may have already come across. Let’s see how it differs from them.
You may be more familiar with the term Artificial Intelligence (AI). AI is the science of developing systems or programs that replicate human rationale and then independently achieve set goals. ML is a large part of AI, dealing with creating algorithms that adapt behaviour to empirical data.
Much of ML actually originates from Statistics, just with different names! As opposed to traditional Statistics, the machines in ML do not assume anything about the data, nor are its conclusions limited by the initial assumptions. Think about how often in Statistics you hear ‘let X follow a normal distribution’ or ‘given i.i.d. random variables’. Have you ever thought about how likely these assumptions are?
On the other hand, one disadvantage with ML approaches is the lack of visible and interpretable relationships between variables, something that statistical reasoning tends to be good at. To achieve more accurate predictions, the models tend to be more complex and hence difficult to construe.
Losing interpretability is scary to most data scientists but necessary to improve results in challenging problems. Often in ML the most important thing is resolving the whole story without needing to analyse the smaller details. Are you capable of detecting information hidden in the data without getting caught up in the technicalities? Try and make out three objects hidden among the colourful shapes. What do you see in the picture?
ML is also similar to Data Mining, but whereas DM is the science of discovering unknown patterns and relationships in data, ML applies previously inferred knowledge to new data to make decisions in real-life applications.
The key is finding a balance between performance and interpretability (e.g. predictive capacity vs understanding why).
Let’s break it down…
The best way I’ve found to understand ML and start using it in projects is to categorise the different parts. Most people familiar with ML will know the main division: supervised and unsupervised learning.
In simple terms, in supervised learning we already know the answers we want (found in past or completed data).
The idea is to find a model that can predict the answers when we don’t know them (future or incomplete data). We give the machine the data, with inputs and outputs (the answers), and let it learn from the relationships between them.
In unsupervised learning we want to find unknown structures or trends. The data has no associated labels, but we want to organise the data (groups, clusters) or simplify it somehow (reduce dimensions, remove unnecessary variables or detect anomalies).
We can further divide these two types of ML into subcategories. Let’s visualise them:
Supervised learning can be split according to the variable type being predicted. If the predicted value is continuous, we are looking at a Regression problem.
On the other hand, if the variable to be predicted is one of various independent categories, called classes, it’s known as Classification. These classes can be qualitative or quantitative discrete values.
- Predicting next week’s returns of the S&P500 index. Since returns are continuous variables it must be a Regession type solution.
- Deciding whether the current EURUSD trend is up or down. There are two possible choices: bull or bear movements, making this a Classification problem.
The main subdivisions in unsupervised learning problems are Cluster Analysis, Density Estimation and Dimension Reduction.
In Cluster Analysis, data is grouped according to similarities or distances between them. In Density Estimation, patterns and data are represented by distributions or defined shapes. In Dimension Reduction, duplicated or unnecessary variables are removed to produce a smaller subset of the original data.
We can categorise specific techniques according to the type of learning and the problem to be tackled. Here are some examples:
ML techniques use a train + test system (commonly known as cross-validation) before using findings in real situations.
The machine tries loads of parameter combinations, so we have to be aware of overfitting and running time in ML. Too much accuracy in the training stage often leads to over-optimisation and then worse results than expected in the test stage. It also takes the algorithm a lot longer to converge to the final answer (to reduce the cost function enough).
So what about some practical applications?
Machine Learning is everywhere. There are many everyday examples of its use that we don’t even realise. To name a few, the techniques are applied and used in: search engines, spam filtering, facial recognition, social network analysis, market segmentation, data analysis, fraud detection and risk analysis.
Sometimes a description is not enough. With these complex algorithms, it’s often more beneficial to see practical examples of how to use the techniques with actual data and real-life applications. Let’s take a look at some of the different ways to use ML in finance…
Unsupervised learning techniques can be applied to analyse and understand financial data. For example, PCA can be used as an asset allocation tool, k-means clustering as a way of grouping returns in the equity market, or other clustering techniques to reproduce the S&P 500 or even ISOMAPs to help classify stocks into sectors.
Supervised learning techniques are extraordinarily well suited to financial problems. They can be used to make predictions and help make decisions in investment and risk strategies. For example, techniques like Nearest Neighbours, Neural Networks, Decision trees & Random Forests, and Naïve Bayes have applications to detect the type of market movement in currency crosses and the stock market.
More Useful Links…
Want to get started? Check out this ML course at Coursera or have a go in Python with their specially created toolbox scikit-learn. Also, why not try out some of the techniques for yourself in Kaggle; the ‘Titanic’ competition is great for beginners.
Just want to start coding and trying stuff out? The caret Package contains a vast amount of useful details, functions and examples. Also, check out this great Cheat Sheet with Python and R codes for ways to implement some of the main ML techniques.
Didn’t like this post? Want more information or a different perspective? Check out this detailed post giving an Introductory Primer to Machine Learning. Alternatively, take a look at this innovative introductory visualisation.
There are many words used to describe the same idea in ML. Here are some common terms used for input and output variables:
Also find more ML jargon in this Glossary of helpful terminology.
If you know of any other valuable or interesting links, please feel free to leave them in the comments. Thanks!