08:00 AM
12:00 PM
Room: 106

To get the most out of the tutorials, you will need to have the correct software installed and running. Specific requirements for each tutorial are specified in the detailed description for each tutorial. But it's best to start with one of the scientific Python distributions to ensure an environment that includes most of the packages you'll need.

Data Processing with Python

Ben Zaitlen - Continuum Analytics
Clayton Davis -


Part 1

Part 2

Part 3


Ben Zaitlen
Ben received undergraduate degrees in Mathematics and Physics from UC Santa Cruz, and a Master's degree in physics from Indiana University. He worked at the Biocomplexity Institute developing and supporting a Multi-Scale Modeling environment for Developmental Biology. Mr. Zaitlen is also passionate about electronics and has developed a number of embedded and wearable hardware projects.

Clayton Davis


This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.

In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.


*Setup/Install Check (15) *NumPy/Pandas (30) *Series *Dataframe *Missing Data *Resampling *Plotting *PCA (15) *NumPy *Sci-Kit Learn *Parallel-Coordinates *MapReduce (30) *Intro *Disco *Hadoop *Count Words *EC2 and Starcluster (15) *IPython Parallel (30) *Bitly Links Example (30) *Wiki Log Analysis (30)

45 minutes extra for questions, pitfalls, and break

Each student will have access to a 3 node EC2 cluster where they will modify and execute examples. Each cluster will have Anaconda, IPython Notebook, Disco, and Hadoop preconfigured

Required Packages

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods and familiarity with common NumPy routines. Users should come with the latest version of Anaconda pre-installed on their laptop and a working SSH client.


Preliminary work can be found at: