In recent articles, we talked about PCA and ISOMAP, as techniques for dimensionality reduction. On this occasion, we put the focus on T-SNE, in relation with visualisation and understanding of multidimensional datasets in a low dimension space, where the human eye can find patterns easily.
T-SNE was developed in 2008 by Laurens van der Maaten and Geoffrey Hinton. It comprises of two main stages:
- Stage One: t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an infinitesimal probability of being picked.
- Stage Two: t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points on the map.
In this google talk, Laurens van der Maaten explains how the algorithm works, and he compares with PCA and ISOMAP. He gives a clear example where he tries to group handwritten numbers coded in an image, like in the photo on the right:
Each color, in the picture below, represents one of the numbers, between 0 to 9. With PCA and ISOMAP you can see some groups like orange (number 1) or the red (number 0), are clearer than others, but with T-SNE the differentiation is amazing. Is important to realise that the algorithm only sees images of numbers. The colours are added afterwards to validate the response.
So how can I apply this to finance?
I have 67 ETFs, only Fixed Income from North America in Dollars, and I want to plot the ETFs by the correlation between them. I calculate it in a common period of 5 years to have a dataset with 67 observations by 67 features and 7 different Fixed Income asset types.
library(quantmod) library(RDRToolbox) library(tsne) tickers<-c("IEF", "SHY", "TLT", "TFI", "AGG", "TIP", "MUB", "HYG", "GBF", "CSJ",#10 "TLH", "IEI", "INY", "PZA", "AGZ", "CIU", "GVI", "MBB", "PHB", "BSV", "EDV", "IPE", "JNK", "CXA", "LWC", "TLO", "VGIT", "VGSH", "VMBS", "CLY", "CMF", "NYF", "SUB", "BAB", "VCIT", "VCSH", "CPI", "US13.PA", "US10.PA", "US57.PA", "US1.PA", "US3.PA", "US7.PA", "SMB", "SMMU", "STIP","TUZ", "CSBGU7.MI", "IDTM.L", "XUT3.L", "XUTD.L", "XUIT.L", "ITPS.MI", "IBTS.MI", "HYD", "HYLD", "MUNI", "ITM", "MLN", "CORP", "STPZ", "LTPZ", "ZROZ", "UDN", "CRED", "MINT", "SCHP") type<-c('Govern','Govern','Govern', 'Govern', 'Aggreg', 'Govern', 'Govern', 'High Yield', 'LongT', 'Aggreg', 'Govern', 'Govern', 'Govern', 'Govern', 'Aggreg', 'Aggreg', 'Govern', 'Aggreg', 'High Yield', 'Short-Med T', 'Govern', 'Aggreg', 'High Yield', 'Govern', 'LongT', 'LongT', 'LongT', 'Govern', 'LongT', 'LongT', 'Govern', 'Govern', 'Govern', 'LongT', 'Corp', 'Corp', 'Short-Med T', 'LongT', 'LongT', 'LongT', 'Govern', 'Govern', 'Govern', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'LongT', 'Govern', 'Govern', 'Govern', 'Short-Med T', 'Short-Med T', 'Govern', 'High Yield', 'LongT', 'Govern', 'Govern', 'Corp', 'Inf Linked', 'Inf Linked', 'Govern', 'Short-Med T', 'Aggreg', 'Short-Med T', 'Inf Linked') typeId<-c(1,1,1,1,5,1,1,2,4,5,1,1,1,1,5,5,1,5,2,3,1,5,2,1,4,4,4,1,4,4,1,1,1,4,7,7,3,4,4,4,1,1,1,3,3,3,3,3,4,1,1,1,3,3,1,2,4,1,1,7,6,6,1,3,5,3,6) datas <- getSymbols(tickers, from="2011-01-01", to = "2016-01-01") CloseReturns <- do.call(merge, lapply(datas, function(x) dailyReturn(Cl(get(x))))) CloseReturns[is.na(CloseReturns)]<-0 correlation<-cor(CloseReturns) # Colors colors = rainbow(length(unique(type))) names(colors) = unique(type) # PCA dev.new() pca_iris = princomp(1-correlation)$scores[,1:2] plot(pca_iris, t="n") text(pca_iris, labels=type, col=colors[typeId]) title("PCA") # Isomap iso <- Isomap(1-correlation, dims=2, k=2, plotResiduals = TRUE) plot(iso$dim2, t="n") text(iso$dim2, labels=type, col=colors[typeId]) title("ISOMAP") # TSNE tsneM = tsne(correlation, perplexity=7, max_iter=2000) plot(tsneM, t="n") text(tsneM, labels=type, col=colors[typeId]) title("TSNE")
I use PCA, ISOMAP and T-SNE for a 2 dimension reduction. Are any of these algorithms able to create groups in data without knowing the type tags? I create these 3 plots:
In this case T-SNE doesn’t perform as well as in the other example. PCA puts data in a better order in relation with the type tags. Maybe because this technique is defined in such a way that the two first principals dimensions have the largest possible variance, and that’s what we are looking for.