Machine Learning

What is the difference between feature extraction and feature selection?



No Comments

What is feature extraction/selection?

Straight to the point:

  • Extraction: Getting useful features from existing data.
  • Selection: Choosing a subset of the original pool of features.

Why must we apply feature extraction/selection?

Feature extraction is a quite complex concept concerning the translation of raw data into the inputs that a particular Machine Learning algorithm requires. The model is the motor, but it needs fuel to work. Features must represent the information of the data in a format that will best fit the needs of the algorithm that is going to be used to solve the problem.

While some inherent features can be obtained directly from raw data, we usually need derived features from these inherent features that are actually relevant to attack the underlying problem. A poor model fed with meaningful features will surely perform better than an amazing algorithm fed with low-quality features – “garbage in, garbage out”.

Feature extraction fills this requirement: it builds valuable information from raw data – the features – by reformatting, combining, transforming primary features into new ones…  until it yields a new set of data that can be consumed by the Machine Learning models to achieve their goals.

Feature selection, for its part, is a clearer task: given a set of potential features, select some of them and discard the rest. Feature selection is applied either to prevent redundancy and/or irrelevancy existing in the features or just to get a limited number of features to prevent from overfitting.

Note that if features are equally relevant, we could perform PCA technique to reduce the dimensionality and eliminate redundancy if that was the case. Here we would be doing feature extraction, as we were transforming the primary features and not just selecting a subset of them.

When should we apply feature extraction/selection?

First of all, we have to take into account what kind of algorithm we are going to feed with the produced features. Abstraction skills, irrelevancy and redundancy sensitivities vary a lot depending on the specific Machine Learning technique. 

In general, a minimum of feature extraction is always needed. The unique case when we wouldn’t need any feature extraction is when our algorithm can perform feature extraction by itself as in the deep learning neural networks, that can get a low dimensional representation of high dimensional data (go in depth here). In spite of this, it must be pointed out that getting success is always easier with good features.

We should apply feature selection, when there is a suspicion of redundancy or irrelevancy, since these affect the model accuracy or simply add noise at best. Sometimes, despite having relevant and non-redundant features, feature selection may be performed only to reduce the number of features, in order to favor interpretability and computing feasibility or to avoid the curse of dimensionality phenomena, i.e., too many features to describe not enough samples.  

How to apply feature extraction/selection?

The response is pretty well defined regarding feature selection, which is enclosed:

  • Wrappers: a wrapper evaluates a specific model sequentially using different potential subsets of features to get the subset that best works in the end. They are highly costly and have a high chance of overfitting, but also a high chance of success, on the other hand. Learn more here.
  • Filters: for a much faster alternative, filters do not test any particular algorithm, but rank the original features according to their relationship with the problem (labels) and just select the top of them. Correlation and mutual information are the most widespread criteria. There are many easy to use tools, like the feature selection sklearn package.
  • Embedded: this group is made up of all the Machine Learning techniques that include feature selection during their training stage. LASSO is an example.   
Feature selection: filter, wrapper, embedded

However, things are not so clear when discussing feature extraction. There are no limits to the ways of creating features.  Drawing these meaningful features often means a great deal of extensive exploration that involves expertise, creativity, intuition and time, lots of time. And this is the reason why, whereas automatic feature selection is already here, there is no so much development in feature extraction.

A feature extraction pipeline varies a lot depending on the primary data and the algorithm to use and it turns into something difficult to consider abstractly. This is an example:

Feature extraction pipeline

Furthermore, there is not a complete consensus regarding which of the above tasks take part in feature extraction in effect:

What is feature construction? Sometimes it is used to refer to manual versus (automatic) feature extraction.

What about feature learning? It is meant as an automatic feature extraction.

And feature transformation? It usually denotes less sophisticated transformations over the features, like re-scaling data, bucketing, etc. Some people do not consider it proper feature extraction, thinking that extraction refers only to the most scientific part of the work to do.

And preprocessing? Normally it is the name for tasks like organizing data, cleaning it, dealing with missing values, outliers, encoding categorical values, etc. Again not everyone considers this feature extraction, but a previous task.

And feature engineering? Well, sometimes it is used as a synonym for feature extraction, although contrary to extraction, there seems to be a relatively universal consensus that engineering involves not only creativity constructions but pre-processing tasks and naïve transformations as well.

And are these concepts related to data mining? Yes, of course, but… stop!!

Perhaps it is too soon to try to label any tasks involved in the Machine Learning field and it is good enough just knowing what makes sense as an input to help our model to success until an automatic feature extraction came up with an alternative.

While we wait, maybe less than we think, don’t take features for granted; most of the time problems are in the data, not in the algorithm. Good recipes need good ingredients, so take care of your features.