Data Science Talk: 2015

Wednesday, November 11, 2015

Correlation and The Skyscraper Theory

I was recently inspired by a Planet Money podcast that discussed the Skyscraper Bubble Theory (full podcast and transcript here). In short, the theory posits that there is a correlation between the number of skyscrapers that are built and the bursting of economic bubbles. Several factors encourage the building of skyscrapers: an increased demand for office spaces and low interest rates. Both of these are strongest right before a bubble bursts, and thus herald the onset of an economic depression.

I thought I would go ahead and graph the data myself to see if my analysis would support the hypothesis. A simple correlation graph shouldn't take too long, and it would be interesting to examine the results myself, since plenty of articles cited the theory, though few provided actual graphical evidence for it. The first dataset that I explored described tall building completion rates over time. If we look at the number of buildings completed per year that measure more than 200m, we are presented with this graph, courtesy of The Skyscraper Center:

Apologies for the dual y axis. We are presented with the first problem in analyzing this dataset: completing a skyscraper is a multi-year process, with the average time taken to complete a skyscraper ranging from 2 to 5 years. So does the theory posit that skyscrapers would start being built during a boom, or finish being built right before a boom? Either way, given that the periods of time between bubbles bursting could be less than a decade, this makes it harder to draw any definite conclusions.

If we look at the grey bars that represent the number of skyscrapers that were completed in each of these years, we also see that there is a strong increasing trend. As time goes on, additional tall buildings are being built. If we wish to see whether additional skyscrapers are built before economic depressions, we would need to identify a divergence from the average number of skyscrapers that are built over the course of a decade. This makes our analysis significantly more difficult, as we are attempting to find cycles within an increasing time series. Given our limited dataset, this is a difficult task.

I poked around some more to see if I could find visual representations of data that supported the Skyscraper hypothesis. One of the most convincing graphs that I found was this one:

Plot by Eric Ross, full blog article available here. This graph illustrates 'super-tall buildings' (not just the tallest ever built), as well as economic crashes. However, we can clearly see some inconsistencies. First, the selection criteria is unclear. As the first chart showed us, dozens of buildings greater than 200 m have been built between 2000 and the present day. Is the graph only showing the top 3 or so tallest buildings? Secondly, GDP and building height over time are both increasing time series -- which means of course that are data sets are not independent and identically distributed, and we cannot look at simple correlation (for a great dicussion of this with some excellent visualizations, check out this article).

In short, it's possible that there is a correlation between the two, but we cannot find it by simply drawing a line of correlation between the two unprocessed time series. We'd have to use some more complex model than a simple linear regression, such as vector autoregression, which Mizrach and Mundra undertook, and did not find changes in height predicted changes in GDP, though GDP did seem to predict changes in height (check out the full article here).

Sunday, September 27, 2015

On The Data-Visualization Revolution

I recently came across an article in Scientific American titled "The Data-Visualization Revolution" by Hidalgo and Almossawi (you can read the full article here). The article proposes that our new abilities to visualize data are as historically significant as Galileo's first Jupiter observations by telescope.

We definitely live in a time where data is more pervasive and increasingly used to justify decisions. However, I think that we still have a long ways to go in ensuring that we not only use data to make our decisions, but that a dataset's nuances and implications are well understood before we make a certain decision. Making a decision that is data informed is about more than simply calculating means, fitting trend lines, and observing past time series. Data visualization offers us a unique opportunity to go beyond simple summary statistics and play with the underlying data, dynamically reducing the number of dimensions available to us or perceiving trends that could not be perceived through numerical displays.

Hidalgo and Almossawi state that the increasing amount of publicly available data, in combination with our ability to dynamically visualize this data, is a true game changer. As a data practitioner, I am in agreement and believe that the intersection of these two factors will push us closer to decision making that is truly driven by a thorough understanding of complex datasets.

Thursday, September 24, 2015

Voter Turnout: How has Presidential Voter Turnout Worldwide Changed Over Time?

My presidential election voter turnout by country over time visualization raised some interesting questions (you can check it out here ).

First off, why use d3 to visualize this data? As data practitioners, we often have a multiplicity of tools available at our disposal. d3 really shines though when we want to use its interactivity to explore highly dimensional data.

I started out wanting to know more about presidential voter turnout -- how had it changed over time worldwide? Were there outlier countries that bucked the worldwide trend? Could I look at the time series for each individual country and see the participation rise and fall in response to the country's own historical events?

Ideally, I'd be able to answer these questions using a tool that allows me to generate plots quickly. My two favourite libraries are Python's matplotlib and R's ggplot2 . After cleaning the data (you can check out an iPython notebook with the whole process here), I was able to generate this plot.

Bleh! Even though this plot should help us to answer our questions, since it graphs a voter turnout time series, with each line representing a different country. We even have some nice colours that help us distinguish between our different lines. This is virtually illegible, though -- our data simply has too many dimensions, and we have used a visual attribute to represent each time series that does not allow for the human eye to effectively distinguish between the individual time series. We are using colour differences to tell the lines apart from each other, but given the nature of the dataset, there is simply too much noise for this to be a useful mapping. Furthermore, we can't tell which of these lines belongs to which country, and thus can't answer questions about the countries' individual events. We could add some labels to each of the lines, maybe play around with the line weights a little more, but the result would be visually noisy and uneappealing.

What we really need is a better mapping of our data features to visual attributes -- enter d3.js . I went ahead and took the opacity of each line down to 0.2 -- this gives us some line outlines and allows for us to see the general shape of the data, without being too noisy. Then, I added an interactive feature that allows for users to either hover over or tap each line and have its opacity be increased to 1 and present a tooltip showing which country's time series the user has selected. We now are not just using colours to tell our lines apart -- we have added shading and a textual element.

We can see our selected time series (Panama), the world average time series, and the shape of the overall dataset in light blue behind our two time series.

The less busy composition of the d3 visualization also allows for us to leave the world average time series continuously at an opacity of 1, facilitating comparisons between the world trend and the individual country's trend. And there we have it, an actually useful visualization! With a slightly different set of visual attributes used to represent our data that make use of d3's interactive components, we've created an easy way for users to get far more information than they were able to with the simple line graph created in matplotlib.

Git repo with d3 and python files used to generate these visualizations.

Sunday, September 20, 2015

Gender and the Olympics: How has female participation in the Olympic Games changed over time?

Now that Machine Learning Month has ended, I'm going to use this space to discuss some project that I have undertaken recently, as well as commentary on data-oriented tools and visualizations that catch my eye.

I found an intriguing dataset earlier this week -- a time series of Olympic medals available at each of the games. I set to exploring the data, and found some interesting patterns. Check out my iPython notebook here -- I go over the data cleaning process (I used Python's pandas and numpy library, together with matplotlib to create exploratory visualizations).

After some data cleaning, I made this plot:

which showed some intriguing trends. First, we could clearly see where World War I and World War II had prevented the Olympic games from happening -- thus resulting in no medals being offered for two distinct time intervals.

Additionally, the number of medals available to women did not represent a consistent proportion of the total medals available. While in some years we see the medal count spike (hey there, 1920) or dip (1932), the proportion of medals available to women seems to move independently of these larger trends. Most surprisingly, the number of medals available to women is still significantly less than the number of medals available to men in 2008.

I was interested in seeing whether some of the increases in the number of medals available to women had happened due to certain political concerns -- had certain games been more controversial than others for their inclusion of women in particular sports? This then led to this d3 visualization, where I labeled certain watershed moments for women's sports in the Olympics.

Surprisingly, I did not find any particular year that marked a turning point or watershed moment for female participation in the Olympics. I'd be interested to know if readers of this blog are familiar with other key dates and historical moments that I may not have included in my visualization. It appears that the inclusion of women in the Olympics has been a slow and steady project, with additional sports being included at each competition.

Check out the code that I used to create my final visualization with d3 here.