In recent articles we were talking about PCA and ISOMAP, as techniques for **dimensionality reduction**. In this occasion, we put the focus on T-SNE, in relation with visualization and understanding of multidimensional datasets in a low dimension space, where the human eye can find patterns easily.

T-SNE was developed in 2008 by Laurens van der Maaten and Geoffrey Hinton. It comprises of **two main stages**.

- First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an infinitesimal probability of being picked.
- Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points on the map.

In this google talk, Laurens van der Maaten explains **how the algorithm works**, and he compares with PCA and ISOMAP. He gives a clear example where he tries to group handwritten numbers coded in an image, like in the photo on the right:

Each color, in the picture below, represents one of the numbers, between 0 to 9. With PCA and ISOMAP you can see some groups like orange (number 1) or the red (number 0), some clearer than others, but with T-SNE the differentiation is amazing. Is important to realize that the algorithm only see images of numbers, the colours are added after to validate the response.

So then I wanted to apply this to finance. I have 67 ETFs, only Fixed Income from North America in Dollars, and I want to plot the ETFs by the correlation between them. I calculate it in a common period of 5 years to have a dataset with 67 observations by 67 features and 7 different Fixed Income asset types.

library(quantmod) library(RDRToolbox) library(tsne) tickers<-c("IEF", "SHY", "TLT", "TFI", "AGG", "TIP", "MUB", "HYG", "GBF", "CSJ",#10 "TLH", "IEI", "INY", "PZA", "AGZ", "CIU", "GVI", "MBB", "PHB", "BSV", "EDV", "IPE", "JNK", "CXA", "LWC", "TLO", "VGIT", "VGSH", "VMBS", "CLY", "CMF", "NYF", "SUB", "BAB", "VCIT", "VCSH", "CPI", "US13.PA", "US10.PA", "US57.PA", "US1.PA", "US3.PA", "US7.PA", "SMB", "SMMU", "STIP","TUZ", "CSBGU7.MI", "IDTM.L", "XUT3.L", "XUTD.L", "XUIT.L", "ITPS.MI", "IBTS.MI", "HYD", "HYLD", "MUNI", "ITM", "MLN", "CORP", "STPZ", "LTPZ", "ZROZ", "UDN", "CRED", "MINT", "SCHP") type<-c('Govern','Govern','Govern', 'Govern', 'Aggreg', 'Govern', 'Govern', 'High Yield', 'LongT', 'Aggreg', 'Govern', 'Govern', 'Govern', 'Govern', 'Aggreg', 'Aggreg', 'Govern', 'Aggreg', 'High Yield', 'Short-Med T', 'Govern', 'Aggreg', 'High Yield', 'Govern', 'LongT', 'LongT', 'LongT', 'Govern', 'LongT', 'LongT', 'Govern', 'Govern', 'Govern', 'LongT', 'Corp', 'Corp', 'Short-Med T', 'LongT', 'LongT', 'LongT', 'Govern', 'Govern', 'Govern', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'Short-Med T', 'LongT', 'Govern', 'Govern', 'Govern', 'Short-Med T', 'Short-Med T', 'Govern', 'High Yield', 'LongT', 'Govern', 'Govern', 'Corp', 'Inf Linked', 'Inf Linked', 'Govern', 'Short-Med T', 'Aggreg', 'Short-Med T', 'Inf Linked') typeId<-c(1,1,1,1,5,1,1,2,4,5,1,1,1,1,5,5,1,5,2,3,1,5,2,1,4,4,4,1,4,4,1,1,1,4,7,7,3,4,4,4,1,1,1,3,3,3,3,3,4,1,1,1,3,3,1,2,4,1,1,7,6,6,1,3,5,3,6) datas <- getSymbols(tickers, from="2011-01-01", to = "2016-01-01") CloseReturns <- do.call(merge, lapply(datas, function(x) dailyReturn(Cl(get(x))))) CloseReturns[is.na(CloseReturns)]<-0 correlation<-cor(CloseReturns) # Colors colors = rainbow(length(unique(type))) names(colors) = unique(type) # PCA dev.new() pca_iris = princomp(1-correlation)$scores[,1:2] plot(pca_iris, t="n") text(pca_iris, labels=type, col=colors[typeId]) title("PCA") # Isomap iso <- Isomap(1-correlation, dims=2, k=2, plotResiduals = TRUE) plot(iso$dim2, t="n") text(iso$dim2, labels=type, col=colors[typeId]) title("ISOMAP") # TSNE tsneM = tsne(correlation, perplexity=7, max_iter=2000) plot(tsneM, t="n") text(tsneM, labels=type, col=colors[typeId]) title("TSNE")

I use PCA, ISOMAP and T-SNE for a 2 dimension reduction ¿Is any of these algorithms able to create groups in data without knowing the type tags? I create this 3 plots:

In this case T-SNE doesn’t perform as well as in the other example. **PCA puts data in a better order** in relation with the type tags. Maybe because this technique is defined in such a way that the two first principals dimensions has the largest possible variance, and that is what we are looking for.