Awesome data visualizations are one of my favorite things in data science. I’m constantly searching for new ways to represent data and relationships. I am a true believer that “actions speak louder than words” and “a picture is worth a thousand words”. I maintain this philosophy in data science, research and analytics in general.
A good analysis of a dataset requires a thorough understanding of the data itself and a clear, concise explanation. For both, data visualization is key. A sensible choice of how to represent the data can simplify and quicken understanding enormously.
A simple example
Recently whilst carrying out a research study, I needed to analyze and observe relationships between categorical variables with multiple classes. I was interested in the distribution of the values among classes and how they intersect.
Specifically I was analyzing the positions of a portfolio in terms of asset class (risk family), region and sector. I wanted to understand the weights within each variable but I also how they related to the other variables.
I disregarded circle charts almost instantly. Pie charts, donut charts and bubble charts are notoriously difficult to read. Our brain isn’t trained to compare circular areas. There are certainly a lot of people with very strong opinions on them.
The humble bar chart
So to visualize risk family and region together I began by creating bar charts of one variable conditioned on classes of the other:

But did you notice there are Fixed Income positions in both regions? Possibly not. It wasn’t clear due to the color labels. So I tried inverting the order of the variables:

Now the Fixed Income category is clearer but we have lost the comparison of Americas and Global. So I tried stacking the variables instead:

Not bad, but I still don’t know which region has more weight. So another stacking alternative:

That’s it. Much better!!
But I still wasn’t happy.
The awesome Sankey
Bar charts are boring. The words “awesome data visualizations” don’t come to mind when I think or look at bar charts. So I began the Google search to find something awesome but that also tells the data’s story.
After a while, I came across the Sankey diagram, also sometimes known as alluvial diagrams. The Sankey wasn’t an obvious choice since it is typically used to show flow or paths in a network. The proportions of related variables are visualized over time. But it is also often used to depict relationships between categorical data.
And in my opinion this graph trumps all the previous bar charts:

This example is a very simple version of the data I usually work with. As the number of categories increases for each variable, the relationships get more complex.
A bit more complex…
Let’s look at the following example that compares the region and sector weights of another portfolio:

There are many more flow lines and they are thinner and impossible to quantify. Although the exact proportions are not clear, the sankey still gives great intuition into the positions’ exposure of the portfolio. We can deduce many things from this chart almost instantly. Here are just five insights:
- Europe is the main region with around half the weight.
- The main sectors are Consumer Discretionary and Energy.
- Almost all of the investment in North America is in the Energy sector.
- The Asia Pacific investment is mainly Industrials.
- All IT sector investment is in Emerging Markets.
We may have deduced all this information from a bar chart (or two) but it wouldn’t have been as easy or as “pretty”. The main advantage of the sankey is that both variables have equal importance. We don’t have to choose which one is conditional on which.
But, of course, the sankey shouldn’t be used mindlessly. Make sure the resulting chart is meaningful rather than just look impressive.
As for implementation in python, both matplotlib and pyplot have inbuilt sankey options available. However, I found this alternative package very useful. It is easy to use and the graphs look good.
If you’re an avid Quantdare reader, you’ll know I’m also a fan of more awesome data visualizations: the sunburst chart. Use it to represent tree structures (for example, fund classification) in an understandable and attractive way.