Wednesday, June 18, 2014

Week 2: Principal Component Analysis, and a discussion of Week 1: Domingo's Article

Discussion of Domingo's Article:

Hopefully everyone had a chance to read last week's article. A number of points stood out to me as I read the article:

1. The problem of multiple testing. When using computational tools, it becomes possible to re-run our experiments thousands of times. It's statistically probable that some of our experiments will be significant, though it may not actually be. The article suggests 'control[ling] the fraction of falsely accepted non-null hypotheses, known as the false discovery rate,' in order to take this into account.

2. Section 6 (Intuition Fails in High Dimensions) 's distinction between the apparent dimensions and the effective dimensions in the sample space. Though the distance between points increases exponentially as we gain dimensions for a certain data set, the effective dimensions of the data set is often less. We take advantage of this property through Principal Component Analysis, or through other algorithms for reducing dimensionality. I thought this section summarized the problem and potential solutions effectively.

3. The importance of non-algorithmic components in reaching a better solution. If our goal is to make predictions for a certain data set, our instinct might be to focus on model selection, evaluation, and optimization. However, the easiest way to improve our performance might just be to obtain more data.


Next week's article:

We'll be focusing on Dimensionality Reduction this week. I've selected an iPython notebook, rather than a journal article. I looked for a good discussion of Principal Component Analysis that was also a journal article, and felt that the notebook's explanation was more practical than the articles I had a chance to check out. With that said, if you find an article that you feel is both practical and a good explanation of intermediate-level Principal Component Analysis, please send it my way!

This excellent tutorial was created by Sebastian Raschka and is available on Github, along with some other tutorials that they have created.

Dimensionality Reduction iPython Notebook


And finally, here's a poll to help guide future content for the 12-week Machine Learning Study Group series.

Week 2 Poll

2 comments:

  1. May I suggest the article "Statistical Modeling: The Two Cultures" by (the late) Leo Breiman? It's a bit lengthy, but really really useful!

    ReplyDelete
    Replies
    1. Absolutely! Keep an eye out for it in upcoming weeks, thank you for suggesting it.

      Delete