Data Science Talk: September 2015

Sunday, September 27, 2015

On The Data-Visualization Revolution

I recently came across an article in Scientific American titled "The Data-Visualization Revolution" by Hidalgo and Almossawi (you can read the full article here). The article proposes that our new abilities to visualize data are as historically significant as Galileo's first Jupiter observations by telescope.

We definitely live in a time where data is more pervasive and increasingly used to justify decisions. However, I think that we still have a long ways to go in ensuring that we not only use data to make our decisions, but that a dataset's nuances and implications are well understood before we make a certain decision. Making a decision that is data informed is about more than simply calculating means, fitting trend lines, and observing past time series. Data visualization offers us a unique opportunity to go beyond simple summary statistics and play with the underlying data, dynamically reducing the number of dimensions available to us or perceiving trends that could not be perceived through numerical displays.

Hidalgo and Almossawi state that the increasing amount of publicly available data, in combination with our ability to dynamically visualize this data, is a true game changer. As a data practitioner, I am in agreement and believe that the intersection of these two factors will push us closer to decision making that is truly driven by a thorough understanding of complex datasets.

Thursday, September 24, 2015

Voter Turnout: How has Presidential Voter Turnout Worldwide Changed Over Time?

My presidential election voter turnout by country over time visualization raised some interesting questions (you can check it out here ).

First off, why use d3 to visualize this data? As data practitioners, we often have a multiplicity of tools available at our disposal. d3 really shines though when we want to use its interactivity to explore highly dimensional data.

I started out wanting to know more about presidential voter turnout -- how had it changed over time worldwide? Were there outlier countries that bucked the worldwide trend? Could I look at the time series for each individual country and see the participation rise and fall in response to the country's own historical events?

Ideally, I'd be able to answer these questions using a tool that allows me to generate plots quickly. My two favourite libraries are Python's matplotlib and R's ggplot2 . After cleaning the data (you can check out an iPython notebook with the whole process here), I was able to generate this plot.

Bleh! Even though this plot should help us to answer our questions, since it graphs a voter turnout time series, with each line representing a different country. We even have some nice colours that help us distinguish between our different lines. This is virtually illegible, though -- our data simply has too many dimensions, and we have used a visual attribute to represent each time series that does not allow for the human eye to effectively distinguish between the individual time series. We are using colour differences to tell the lines apart from each other, but given the nature of the dataset, there is simply too much noise for this to be a useful mapping. Furthermore, we can't tell which of these lines belongs to which country, and thus can't answer questions about the countries' individual events. We could add some labels to each of the lines, maybe play around with the line weights a little more, but the result would be visually noisy and uneappealing.

What we really need is a better mapping of our data features to visual attributes -- enter d3.js . I went ahead and took the opacity of each line down to 0.2 -- this gives us some line outlines and allows for us to see the general shape of the data, without being too noisy. Then, I added an interactive feature that allows for users to either hover over or tap each line and have its opacity be increased to 1 and present a tooltip showing which country's time series the user has selected. We now are not just using colours to tell our lines apart -- we have added shading and a textual element.

We can see our selected time series (Panama), the world average time series, and the shape of the overall dataset in light blue behind our two time series.

The less busy composition of the d3 visualization also allows for us to leave the world average time series continuously at an opacity of 1, facilitating comparisons between the world trend and the individual country's trend. And there we have it, an actually useful visualization! With a slightly different set of visual attributes used to represent our data that make use of d3's interactive components, we've created an easy way for users to get far more information than they were able to with the simple line graph created in matplotlib.

Git repo with d3 and python files used to generate these visualizations.

Sunday, September 20, 2015

Gender and the Olympics: How has female participation in the Olympic Games changed over time?

Now that Machine Learning Month has ended, I'm going to use this space to discuss some project that I have undertaken recently, as well as commentary on data-oriented tools and visualizations that catch my eye.

I found an intriguing dataset earlier this week -- a time series of Olympic medals available at each of the games. I set to exploring the data, and found some interesting patterns. Check out my iPython notebook here -- I go over the data cleaning process (I used Python's pandas and numpy library, together with matplotlib to create exploratory visualizations).

After some data cleaning, I made this plot:

which showed some intriguing trends. First, we could clearly see where World War I and World War II had prevented the Olympic games from happening -- thus resulting in no medals being offered for two distinct time intervals.

Additionally, the number of medals available to women did not represent a consistent proportion of the total medals available. While in some years we see the medal count spike (hey there, 1920) or dip (1932), the proportion of medals available to women seems to move independently of these larger trends. Most surprisingly, the number of medals available to women is still significantly less than the number of medals available to men in 2008.

I was interested in seeing whether some of the increases in the number of medals available to women had happened due to certain political concerns -- had certain games been more controversial than others for their inclusion of women in particular sports? This then led to this d3 visualization, where I labeled certain watershed moments for women's sports in the Olympics.

Surprisingly, I did not find any particular year that marked a turning point or watershed moment for female participation in the Olympics. I'd be interested to know if readers of this blog are familiar with other key dates and historical moments that I may not have included in my visualization. It appears that the inclusion of women in the Olympics has been a slow and steady project, with additional sports being included at each competition.

Check out the code that I used to create my final visualization with d3 here.