Data Science Talk: June 2014

Thursday, June 26, 2014

Week 3: Breiman's "Statistical Modeling: The Two Cultures" and a discussion of Week 2: Principal Component Analysis

Discussion of Week 2: Principal Component Analysis

I found Week 2's discussion of Principal Component Analysis quite illuminating. The notebook goes through the actual mathematical transformations that underlie the PCA libraries, detailing how the dimensionality reduction actually works. Though undertaking each of these steps each time PCA was required would be tedious, it was helpful to see it broken down in the notebook's example.

Along those lines, the most intuitive discussion I've found to date of PCA is the line of thought that is explained in this blog post: http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/ .

Week 3: Breiman's "Statistical Modeling: The Two Cultures"

This week's article is Leo Breiman's "Statistical Modeling: The Two Cultures". Published in 2001, the article theorizes that two diverging processes are at work in the statistical modeling community. While one presumes a stochastic data modeling process, the other treats the processes that underly data generation as unknown. The author theorizes that the statistical community's focus on the former process to the exclusion of the latter has limited the development of the field, and proposes an expansion of data modeling experts' preferred tools.

The article is available here: http://www.uni-leipzig.de/~strimmer/lab/courses/ss09/current-topics/download/breiman2001.pdf

I look forward to reading your thoughts in the comments. As always, if you have any recommendations for future content, please let me know.

Wednesday, June 18, 2014

Week 2: Principal Component Analysis, and a discussion of Week 1: Domingo's Article

Discussion of Domingo's Article:

Hopefully everyone had a chance to read last week's article. A number of points stood out to me as I read the article:

1. The problem of multiple testing. When using computational tools, it becomes possible to re-run our experiments thousands of times. It's statistically probable that some of our experiments will be significant, though it may not actually be. The article suggests 'control[ling] the fraction of falsely accepted non-null hypotheses, known as the false discovery rate,' in order to take this into account.

2. Section 6 (Intuition Fails in High Dimensions) 's distinction between the apparent dimensions and the effective dimensions in the sample space. Though the distance between points increases exponentially as we gain dimensions for a certain data set, the effective dimensions of the data set is often less. We take advantage of this property through Principal Component Analysis, or through other algorithms for reducing dimensionality. I thought this section summarized the problem and potential solutions effectively.

3. The importance of non-algorithmic components in reaching a better solution. If our goal is to make predictions for a certain data set, our instinct might be to focus on model selection, evaluation, and optimization. However, the easiest way to improve our performance might just be to obtain more data.

Next week's article:

We'll be focusing on Dimensionality Reduction this week. I've selected an iPython notebook, rather than a journal article. I looked for a good discussion of Principal Component Analysis that was also a journal article, and felt that the notebook's explanation was more practical than the articles I had a chance to check out. With that said, if you find an article that you feel is both practical and a good explanation of intermediate-level Principal Component Analysis, please send it my way!

This excellent tutorial was created by Sebastian Raschka and is available on Github, along with some other tutorials that they have created.

Dimensionality Reduction iPython Notebook

And finally, here's a poll to help guide future content for the 12-week Machine Learning Study Group series.

Week 2 Poll

Sunday, June 8, 2014

Week 1: Domingo's 'A Few Useful Things to Know about Machine Learning'

Welcome to Data Science Talk! This blog is intended as an accessible blog to discuss Data Science topics.

We'll start off with a focus on Machine Learning -- I'll post an article a week, and encourage discussion in the article's comments. The idea is to have an online site to discuss intermediate-level machine learning topics. I presume that you are familiar with basic programming and math topics, and have taken at least one basic Machine Learning or Data Science MOOC or its equivalent (such as The Analytics Edge). The idea is to make each week's discussion accessible, yet also begin to explore Machine Learning topics more in depth.

This week, we'll be discussing Domingo's 'A Few Useful Things to Know about Machine Learning' article, which I thought was pretty interesting. It gives a nice overview of a number of general lessons that are useful to keep in mind when solving machine learning problems. Here is a link:
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

I'll go ahead and post some thoughts about the article next week, along with the new article.