Scaling/ normalisation/ standardisation: a pervasive question

J. González


No Comments

One of the most asked questions when dealing with several features is how you can summarise or transform them to similar scales. As you probably know, many Machine Learning algorithms demand the input features be in similar scales. But, what if they aren’t? Can we just work with raw data in the hope that our analysis will be right? Well, in some cases, the answer is “yes”.

  • When you use classification trees there is no need to scale the features, since the algorithms evolve answering questions like feature_k > value?  So it does not matter if the values of each feature are in one range or another.
  • The problem occurs in algorithms where the evolution depends on the rate of change (methods based on gradient descent optimization, for example) or when trying to maximize some aspect (the variance in PCA, for instance, to avoid overweighting the larger measurements) or when distances are involved (see issue 2. in Nearest Neighbours), which are influenced by the scale of the features used. In these cases, it is important to work with features in similar scales. So, how can we transform the features?

Standardization (Z-score normalization)

It consists in transforming each feature subtracting the mean value and dividing by the standard deviation:

When z is 0, the observation is at the sample’s mean. When z is 1 then the observation is one standard deviation away from the mean.

  1. This z-scoring can be done as a cross-sectional transformation (the most frequent interpretation), in which the mean and standard deviations are calculated over the values across individuals in a given time point (for example the features momentum, volatility, or others, of several assets at a given date). With this transformation, the mean z-score for every feature would be the same (0) every date.
  2. Another way of interpreting this transformation is within each individual across time. Following with the previous example, with this transformation each asset would have the same mean level (0) in every statistic. Sometimes the features are only centralised (mean subtracted) in order to evaluate if the level of the feature is higher or lower from the mean (ipsatization conversion).
  3. And yet another way to calculate it (mostly with longitudinal data) in order to keep mean-level evolution over time, is by transforming over time and within individuals. In other words, you should calculate the grand mean and grand standard deviation (taking all data of each variable across individuals and time) to transform the variables.

The advantages of z-scoring are its simplicity and the possibility of ruling out outliers (“winsorising”) when the z-score is extreme. It also keeps unchanged the sample correlation between features.

The main disadvantage is the loss of information regarding the mean and standard deviation level in different time points when you apply cross-sectional normalization. Obviously, if the original features are not normally distributed, the transformed ones won’t be either.

POMS / min-max transformation

POMS (Proportion Of Maximum Scoring) or min-max scaling consists in rescaling all features to 0 – 1 range by applying the following transformation:

Again, with the use of this scaling, we lose the mean and standard deviation level evolution, unless you calculate across individuals and time points, with fixed min and max values. Besides, the control of the outliers is problematic, since the minimum and the maximum will always be 0 and 1, no matter the dispersion of the raw data.

You can see this type of transformation in neural networks.

Rank transformation

With this transformation, you just transform any feature into a uniform distribution.

Each value x becomes y = k, being x the kth largest value of the feature X.

This transformation defines a specific range of values and smooths the effect of outliers. It is possible to aggregate several rankings with the Kemeny-Young method.

The main problem with this transformation is that it distorts correlation and original distances across and within features.

Other methods

More complex transformations to obtain normality in the features distributions are the so-called power transformations, like the Box-Cox transform (for positive values) and the Yeo-Johnson transform, which need a maximum likelihood process to choose the appropriate parameters. You can find plenty of information on the Internet about these and other methods.

The previous transformations and more can be tested with the excellent Scikit-learn library in the preprocessing package. Give it a try!

Hope you enjoyed the post. Don’t hesitate to share your favourite scaling method with us.