Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of a data set, finding the causes of variability and sorting them by importance.
If you have a set of observations (features, measurements, etc.) that can be projected on a plane (X, Y) such as:
You can display the previous graph from X* and Y* axes, which remain orthogonal.
If your observations were these:
And you do the same base change, X* Y*:
It turns out that you can explain all observations in a single X* dimension. The Y* axis does nothing because it contains no information and can be ignored. This is because it takes the same value for all observations.
I have therefore reduced the dimensions from two to one, without losing information.
So how did I obtain X* and Y* axes?
The first Principal Component (X*) is defined as the linear combination of the original variables that has maximum variance. The values in this first component will be represented as:Where O is the matrix of observations that has average zero, and therefore X* too.Where S is the matrix of variance and covariance of the observations. And imposing the restriction a’1a1=1 and by the Lagrange multiplier:Maximizing this expression implies deriving respect to a1 and equalising to zero.Which happens to be Sa1=λa1 where a1 is an eigenvector of the matrix S and the corresponding eigenvalue λ .
Ufff, algebra… so to sum up?
You have to find the X* axis, such that the orthogonal distance to the points is minimum. X* will contain greater variability of data and will therefore be the first Principal Component.
And with Y*? Simply take one that is orthogonal to X*, and this will be your second Principal Component.
Okay, so thus far, you’ve made a base change, and can represent the points in a different way in the plane and sort them by importance… Want an example?
Given 4 assets, 2 fixed income and 2 equities. If we take the values of Annual Return, volatility and maximum drawdown, we have the following matrix O.
You can transform this representation to another 2D without losing anything if, for example, volatility and maximum draw down provide the same information to the whole set, or if they’re correlated.
The associated eigenvalues with the normalized covariance matrix O are:
And the representations of the new components are related to the original variables, as follows:
Taking nearly 100% of the information contained in O…You have reduced dimensionality maintaining the relationship between sets. This lets you view the status of assets in the plane in two new axes, one that measures the risk as a combination of volatility and MDD. The other measures the return.
What about adding more dimensions?
It’s definitely possible. PCA allows you to understand multidimension data sets with the most representative subset.
And N Assets in a set could be the dimensions, and their returns the observations. If every eigenvalue groups the closest Asset, you could filter duplications and build rich universes with fewer elements.
And, given an Asset Allocation such as:
where x are weights and R returns of n Assets, and the first Principal Component of the covariance matrix of n Assets is the one that contains the most information, the associated eigenvector creates new weights that maximize the variance of W.
Uffff, enough for today!!
[…] Principal Component Analysis [Quant Dare] Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of a data set, finding the causes of variability and sorting them by importance. >How? If you have a set of observations (features, measurements, etc.) that can be projected on a plane (X, Y) such as: DataSet representation You can display the previous graph from X* and Y* axes, which remain orthogonal. New axes […]