Scientific Computing with Python
Austin, Texas • July 6-12
Registration - 100% Full


Monday 8 a.m.–noon

Birds in Random Kaggle Forests

Matt Wescott

Audience level:


This tutorial is a hands-on intro to competing on Kaggle. Together we'll recreate my team's 2nd place model from a bird-classification contest. We’ll explore the data in an IPython web notebook, and discuss how to approach the learning problem. Then we'll implement one of those approaches using Pandas and Scikit-Learn. At the end, we’ll submit our predictions to Kaggle and see how we did.



Duration: ~30 min

  • What are our backgrounds and interests?
  • Overview of the contest, bird calls and the dataset
  • Distribute URLs for the IPython notebooks using a google document

Explore the Data

Duration: ~30 min

  1. Look at the labels
  2. Look at spectrograms
  3. Look at summary statistics

Discuss Learning Approaches

Duration: ~30 min

  1. Intro to Multi-Instance Multi-Label Classification
  2. What classifiers might we use?
  3. What features might we use?
  4. What structure of the data can we take advantage of?

Implement Approaches

Duration: ~40 min

  1. Window the data
  2. Reduce the dimensionality
  3. Train a random forest
  4. Clip predictions from window predictions?
  5. Make predictions on the test-set


Duration: ~20 min

  1. Submit to Kaggle
  2. Discuss further improvements
  3. Share what I know about the performance of possible improvements