# SciPy 2012 Tutorials

The Tutorials Schedule (July 16th & 17th) is in its final stages of confirmation. There may be changes made to the schedule between now and the conference.

Monday - July 16th | |||
---|---|---|---|

Time | Room 105 | Room 106 | |

08:00 AM - 12:00 PM | Introductory/Intermediate Introduction to NumPy and Matplotlib - Jones, Eric | Advanced scikit-learn - Vanderplas, Jake | |

01:00 PM - 05:00 PM | Introductory/Intermediate HDF5 is for lovers - Scopatz, Anthony | Advanced Advanced Matplotlib - May, Ryan |

Tuesday - July 17th | |||
---|---|---|---|

Time | Room 105 | Room 106 | |

08:00 AM - 12:00 PM | Introductory/Intermediate Efficient Parallel Python for High-Performance Computing - Smith, Kurt | Advanced Time Series Data Analysis with pandas - McKinney, Wes | |

01:00 PM - 05:00 PM | Introductory/Intermediate IPython in-depth: Interactive Tools for Scientific Computing - Perez, Fernando | Advanced statsmodels - Seabold, Skipper |

## Introduction to NumPy and Matplotlib - *Eric Jones*

### Bio

Eric has a broad background in engineering and software development and leads Enthought's product engineering and software design. Prior to co-founding Enthought, Eric worked with numerical electromagnetics and genetic optimization in the Department of Electrical Engineering at Duke University. He has taught numerous courses on the use of Python for scientific computing and serves as a member of the Python Software Foundation. He holds M.S. and Ph.D. degrees from Duke University in electrical engineering and a B.S.E. in mechanical engineering from Baylor University

### Description

NumPy is the most fundamental package for scientific computing with Python. It adds to the Python language a data structure (the NumPy array) that has access to a large library of mathematical functions and operations, providing a powerful framework for fast computations in multiple dimensions. NumPy is the basis for all SciPy packages which extends vastly the computational and algorithmic capabilities of Python as well as many visualization tools like Matplotlib, Chaco or Mayavi.

This tutorial will teach students the fundamentals of NumPy, including fast vector-based calculations on numpy arrays, the origin of its efficiency and a short introduction to the matplotlib plotting library. In the final section, more advanced concepts will be introduced including structured arrays, broadcasting and memory mapping.

### Outline

- NumPy: history and overview
- History
- Overview

- Basic plotting with Matplotlib
- Basic plotting with Matplotlib
- 2D plots
- Histograms
- Scatter plots
- Displaying images

- Fast computations with NumPy arrays
- Creating NumPy arrays
- Computations with NumPy arrays
- Types and shapes of NumPy arrays
- Built-in operations on a NumPy array
- Slicing and indexing
- From data files to arrays and back

- Advanced concepts
- The underlying data structure
- Broadcasting
- Structured arrays
- Memory mapped arrays

### Required Packages

It requires python 2.6+ or 3.1+, NumPy 1.6.1+, iPython 0.11+, and matplotlib 1.0+ to be installed on your laptop. All these packages are available in various one-click installers including EPDFree.

In addition:

- Download and unpack the tutorial files.

[ introduction_numpy_matplotlib.zip | 5.7MB ] - To test if your installation is working, follow the indications on page 7 of the manual. The speed of light folder is inside the class folder inside student/demo/speed_of_light/

## scikit-learn - *Jake Vanderplas*

### Bio

Jake Vanderplas is an NSF postdoctoral research fellow, working jointly between the Astronomy and Computer Science departments at the University of Washington, and is interested in topics at the intersection of large-scale machine learning and wide-field astronomical surveys. He is co-author of the book “Statistics, Data Mining, and Machine Learning in Astronomy”, which will be published by Princeton press later this year. In the Python world, Jake is the author of AstroML, and a maintainer of Scikit-learn & Scipy. He gives regular talks and tutorials at various Python conferences, and occasionally blogs his thoughts and his code at Pythonic Perambulations: http://jakevdp.github.com.### Description

Machine Learning has been getting a lot of buzz lately, and many software libraries have been created which implement these routines. scikit-learn is a python package built on numpy and scipy which implements a wide variety of machine learning algorithms, useful for everything from facial recognition to optical character recognition to automated classification of astronomical images. This tutorial will begin with a crash course in machine learning and introduce participants to several of the most common learning techniques for classification, regression, and visualization. Building on this background, we will explore several applications of these techniques to scientific data -- in particular, galaxy, star, and quasar data from the Sloan Digital Sky Survey -- and learn some basic astrophysics along the way. From these examples, tutorial participants will gain knowledge and experience needed to successfully solve a variety of machine learning and statistical data mining problems with python.

### Outline

- 0:00 - 0:20 Setup - making sure all participants have the necessary packages and datasets for tutorials (datasets will be posted on my web page)
- 0:20 - 1:20 Machine Learning 101: introduction to Supervised learning (Regression/Classification) and Unsupervised learning (clustering/dimensionality reduction) and the scikit-learn interface for these tools.
- 1:20 - 1:40 Working through a classification exercise using the iris dataset
- 1:40 - 2:20 Practical advice for machine learning (including examples) bias, variance, over-fitting, under-fitting, and cross-validation.
- 2:20 - 2:50 Classification exercise: star-quasar classification with Naive Bayes (example done together) and Gaussian Mixture Models (exercise done independently)
- 2:50 - 3:20 Regression exercise: photometric redshifts with decision trees and random forests (example done together). Applying cross-validation and learning curves method to choose optimal model complexity (exercise done independently)
- 3:20 - 3:50 Dimensionality reduction exercise: reducing the dimensionality of high-dimensional spectral data with PCA (example done together) and with manifold learning methods (exercise done independently)
- 3:50 - 4:00 Conclusion and review

This follows the general outline of the online tutorial I've prepared for scikit-learn (see link above). For the purpose of Scipy2012, I plan to convert most of this material into ipython notebooks for interactive instruction, though an updated version of the web page will be available as well.

### Packages Required

numpy, scipy, scikit-learn (bleeding-edge not necessary; I'll make everything compatible with the ubuntu distro versions)

ipython *including the recently released ipython notebook*. Participants should be able to run "ipython notebook" in the command line and see the ipython dashboard in their web browser (version 2+ is fine).

Note that an EPDfree installation contains all the necessary dependencies, with the exception of scikit-learn.

For installation instructions, click here.

## HDF5 is for lovers - *Anthony Scopatz*

### Bio

Anthony Scopatz is a computational nuclear engineer / physicist post-doctoral scholar at the FLASH Center at the University of Chicago. His initial workshop teaching experience came from instructing bootcamps for The Hacker Within - a peer-led teaching organization at the University of Wisconsin. Out of this grew a collaboration teaching Software Carpentry bootcamps in partnership with Greg Wilson. During his tenure at Enthought, Inc, Anthony taught many week long courses (approx. 1 per month) on scientific computing in Python.

### Description

HDF5 is a hierarchical, binary database format that has become a *de facto* standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!

This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.

### Outline

- Meaning in layout (20 min)
- Tips for choosing your hierarchy

- Advanced datatypes (20 min)
- Tables
- Nested types
- Tricks with malloc() and byte-counting

- Exercise on above topics** (20 min)
- Chunking (20 min)
- How it works
- How to properly select your chunksize

- Queries and Selections (20 min)
- In-core vs Out-of-core calculations
- PyTables.where()
- Datasets vs Dataspaces

- Exercise on above topics** (20 min)
- The Starving CPU Problem (1 hr)
- Why you should always use compression
- Compression algorithms available
- Choosing the correct one
- Exercise

- Integration with other databases (1 hr)
- Migrating to/from SQL
- HDF5 in other databases (JSON example)
- Other Databases in HDF5 (JSON example)
- Exercise

### Packages Required

This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables 2.3+. `ViTables`_ and MatPlotLib are also recommended. These may all be found in Linux package managers. They are also available through EPD or easy_install. ViTables may need to be installed independently. [http://vitables.org/]

## Advanced Matplotlib - *Ryan May*

### Bio

Ryan May is a Software Engineer at Enterprise Electronics Corporation and a Doctoral student in the School of Meteorology at the University of Oklahoma. His primary interest in Python is for its application for data visualization and for rapid development and testing of signal processing techniques for weather radar applications. He has also been a developer for the Matplotlib project since 2008, giving an introductory Matplotlib tutorial at SciPy 2010. Among Ryan's contributions to Matplotlib are improvements to its spectral analysis routines, wind barb support (for the meteorological community), and, most recently, simplified support for creating and saving animations.

### Description

Matplotlib is one of the main plotting libraries in use within the scientific Python community. This tutorial covers advanced features of the Matplotlib library, including many recent additions: laying out axes, animation support, Basemap (for plotting on maps), and other tweaks for creating aesthetic plots. The goal of this tutorial is to expose attendees to several of the chief sub-packages within Matplotlib, helping to ensure that users maximize the use of the full capabilities of the library. Additionally, the attendees will be run through a "grab-bag" of tweaks for plots that help to increase the aesthetic appeal of created figures. Attendees should be familiar with creating basic plots in Matplotlib as well as basic use of NumPy for manipulating data.

### Outline

- Basemap (45 minutes)
- Explain basics of Basemap with regard to projections
- Using a basemap instance to project lat/lon points
- Basemap support for maps
- Explain Basemap's wrapping of wrapping plot methods
- Exercise: Scatter plot of point data on map
- Exercise: Plotting information from shape files

- Animations (1 hour)
- Show basic draw() loop example and the problems
- Can't resize
- Can't interact

- Solution 1: Callback-based using only generic timer object
- Introduce artist animation
- Exercise: Animate a scatter plot of random data

- This works, but can be inefficient...FuncAnimation
- Eliminates need to create multiple artists
- Show example that sets array data
- Then just go full procedurally generated data
- Exercise: Create a line plot with data from a function

- Show saving movie files

- Show basic draw() loop example and the problems
- Axes Layouts (1 hour)
- plt.subplots()
- sharing axes/labels
- tight_layout()
- Exercise: Create common axes with nice spacing, eliminating superfluous labels as well as problems with overlapping text
- AxesGrid, ImageGrid, AnchoredArtist
- Exercise: Use image grid to create image plots with shared colorbars and labelled subplots (labelled with text boxes for a,b,c, etc.)
- Cover more advanced features of axesgrid/imagegrid
- More complex layouts
- Nested layouts
- Inset axes

- Tweaks for Aesthetic plots (45 minutes)
- Drop shadows on boxes
- Advanced legends with proxy artists
- hatching, line width, other graphics properties
- Exercise: Take a plot and make it look much better with a few tweaks
- Drop shadow on legend with rounded corners
- Move from outside axes title to an AnchoredArtist based text box at the top of the plot
- Improve scatter plot legend using proxy artists

### Packages Required

Matplotlib 1.1.0 (May need to bump if another release comes out), Basemap, Numpy. ffmpeg/mencoder would optionally be used to save animations as movie files.

Attendees need to have Matplotlib and Basemap installed. EPDFree is a good start, since matplotlib is included. Instructions for installing Basemap can be found at:

If needed, instructions for installing Matplotlib are located at:

## Efficient Parallel Python for High-Performance Computing - *Kurt Smith*

### Bio

Kurt Smith has been using Python in scientific computing for nearly ten years, and has developed tools to simplify the integration of performance-oriented languages with Python. He has contributed to the Cython project, implementing the initial version of the typed memoryviews and native cython arrays. He uses Cython extensively in his consulting work at Enthought. He received his B.S. in physics and applied mathematics from the University of Dallas, and his Ph.D. in physics from the University of Wisconsin-Madison. His doctoral research focused on the application of fluid plasma models to astrophysical systems, involving the composition of high-performance parallel simulations of plasma turbulence. Kurt Smith has trained hundreds of scientists, engineers, and researchers in Python, NumPy, Cython, and parallel and high-performance computing as part of Enthought's five-day scientific Python training course. He has developed course material for high-performance and parallel computing with Python, and taught the "Efficient Parallel Python for High-Performance Computing"### Description

This tutorial is targeted at the intermediate-to-advanced Python user who wants to extend Python into High-Performance Computing. The tutorial will provide hands-on examples and essential performance tips every developer should know for writing effective parallel Python. The result will be a clear sense of possibilities and best practices using Python in HPC environments.

Many of the examples you often find on parallel Python focus on the mechanics of getting the parallel infrastructure working with your code, and not on actually building good portable parallel Python. This tutorial is intended to be a broad introduction to writing high-performance parallel Python that is well suited to both the beginner and the veteran developer.

We will discuss best practices for building efficient high-performance Python through good software engineering. Parallel efficiency starts with the speed of the target code itself, so we will first look at how to evolve code from for-loops to list comprehensions and generator comprehensions to using Cython with NumPy. We will also discuss how to optimize your code for speed and memory performance by using profilers.

The tutorial will cover some of the common parallel communication technologies (multiprocessing, MPI, and cloud computing) and introduce the use of parallel map and map-reduce.

At the end of the tutorial, participants should be able to write simple parallel Python scripts, make use of effective parallel programming techniques, and have a framework in place to leverage the power of Python in High- Performance Computing.

### Packages Required

line_profiler, kernprof, runsnake, NumPy, SciPy, Cython, picloud

### Optional Packages

mpi4py

## Time Series Data Analysis with pandas - *Wes McKinney*

### Bio

Wes McKinney is the author of pandas and vbench, and a contributor to statsmodels. Prior to starting Lambda Foundry, he worked in quantitative finance at AQR Capital Management. He's interested in data analysis, visualization, high performance computing, and testing, profiling, and performance monitoring tools.

### Description

In this tutorial, I'll give a brief overview of pandas basics for new users, then dive into the nuts of bolts of manipulating time series data in memory. This includes such common topics date arithmetic, alignment and join / merge methods, resampling and frequency conversion, time zone handling, moving window functions like moving mean and standard deviation. A strong focus will be placed on working with large time series efficiently using array manipulations. I'll also illustrate visualization tools for slicing and dicing time series to make informative plots. There will be several example data sets taken from finance, economics, ecology, web analytics, or other areas.

The target audience for the tutorial includes individuals who already work regularly with time series data and are looking to acquire additional skills and knowledge as well as users with an interest in data analysis who are new to time series. You will be expected to be comfortable with general purpose Python programming and have a modest amount of experience using NumPy. Prior experience with the basics of pandas's data structures will also be helpful.

### Outline

- IPython Notebook Setup and Environment check (15 min)
- Pandas basics (30 min)
- Series
- DataFrame
- Indexing, data selection and subsetting
- Reading and writing files

- Date and time types, string conversion (15 minutes)
- Pandas TimeSeries basics (30 minutes)
- Indexing and selection, subsetting
- Data alignment
- DataFrame merge / joins
- Reading time series data from disk

- Fixed frequencies (20 minutes)
- Frequency aliases
- Date range generation
- Date offset classes
- Shifting (leading and lagging)

- Resampling (30 minutes)
- Downsampling / aggregation
- Upsampling + interpolation methods

- Periods and Period arithmetic (10 minutes)
- Time zone handling (10 minutes)

- Plotting and visualization (20 min)
- Time series plots
- Grouped plots (by year, say)
- Scatter plots

- Moving window functions (15 minutes)
- Examples (45 minutes)
- Stock Data: implement and backtest simple strategies
- Macroeconomic data and tranformations
- Weather / ecology or tide data
- Munging, aggregation, visualization

### Required Packages

Python 2.7 or higher (including Python 3), pandas >= 0.8.0 and its dependencies, NumPy >= 1.6.1, matplotlib >= 1.0.0, pytz, IPython >= 0.12 and the dependencies for the HTML notebook application: pyzmq and tornado. EPDFree is a good starting point, requiring only pandas, dateutil, and pytz to be installed in addition. Optionally: PyTables

## IPython in-depth: Interactive Tools for Scientific Computing - *Fernando Perez*

### Bio

Fernando Perez received his PhD in theoretical physics from the University of Colorado and then worked on numerical algorithm development at the Applied Mathematics Dept. at the same university. He now works as a scientist at the Helen Wills Neuroscience Institute at the University of California, Berkeley, focusing on the development of new analysis methods for brain imaging problems and high-level scientific computing tools. Towards the end of his graduate studies, he became involved with the development of Python tools for scientific computing. He started the open source IPython project in 2001 when looking for a more efficient interactive workflow for everyday scientific tasks. He continues to lead the IPython project along with a growing team of talented developers. He also is a member of the core matplotlib development team, and has contributed to numpy, scipy, sympy, mayavi and other Python projects.### Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing. We will show some uses of IPython for scientific applications, focusing on exciting recent developments, such as the network-aware kernel, web-based notebook with code, graphics, and rich HTML, and a high-level framework for interactive parallel computing.

### Outline

- Overview of IPython
- Introductory description of the project and architecture.
- Basics: the magic command system, shell aliases, full shell access, the history system, variable caching, object introspection tools.
- Development workflow: combining the interpreter session with python files via the %run command.
- Effective use of IPython at the command-line for typical development tasks: timing, profiling, debugging.
- Embedding IPython in various contexts.
- The IPython Qt console: unique features beyond the terminal.
- Configuring IPython: the profile and configuration system for multiple applications.
- The IPython notebook:
- interactive usage of the application
- the IPython display protocol
- defining custom display methods for your own objects
- generating HTML and PDF output.
- Parallelism with IPython
- basic architecture
- interactive control of a cluster
- standalone execution of applications
- integration with MPI
- blocking and asynchronous parallelism
- execution in batch-controlled (PBS, SGE, etc.) environments
- IPython engines in the cloud (illustrated with Amazon EC2 instances and starcluster).

For full details about IPython including documentation, previous presentations and videos of talks, please see the `project website

### Required Packages

- IPython â‰¥ 0.13
- pyzmq, zeromq â‰¥ 2.1.11
- numpy
- pandas
- scipy
- matplotlib
- tornado â‰¥ 2.1.0 (for the notebook)
- pygments (for the QtConsole)
- PyQt or PySide (for the QtConsole)

All of which are available in EPD

Recommended packages, used in some parallel demos:

- NetworkX
- mpi4py
- BeautifulSoup
- starcluster (for using Amazon EC2)

For installation instructions, click here.

## statsmodels - *Skipper Seabold*

### Bio

Skipper is a PhD candidate in economics at American University in Washington, D.C. He specializes in applied econometrics, information theory and entropy econometrics, and topics in growth theory. He has been working on statistics in Python throughout his studies. He has presented earlier work on statsmodels at the SciPy Conference and has been an active mentor for new developers working on statsmodels as part of the Google Summer of Code.

### Description

This tutorial will give users an overview of the capabilities of statsmodels, including how to conduct exploratory data analysis, fit statistical models, and check that the modeling assumptions are met.

The use of Python in data analysis and statistics is growing rapidly. It is not uncommon now for researchers to conduct data cleaning steps in Python and then move to some other software to estimate statistical models. Statsmodels, however, is a Python module that attempts to bridge this gap and allow users to estimate statistical models, perform statistical tests, and conduct data exploration in Python. Researchers across fields such as economics and the social sciences to finance and engineering may find that statsmodels meets their needs for statistical computing and data analysis in Python.

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods.

With this knowledge attendees will be ready to jump in and use Python for applied statistical analysis and will have an idea how they can extend statsmodels for their own needs.

### Outline

- Introduction to the package structure and philosophy ( 15 minutes )
- Linear Models ( 45 minutes )
- ANOVA and contrasts
- Regression (OLS, WLS, GLS)

Will cover the use of formulas in statsmodels, assessing model fit through Wald tests, robust covariance estimators, and diagnostic plotting.

- Robust Linear Models ( 15 minutes )
- Fitting models. Cover existing robust norms.

- Discrete Choice Models ( 45 minutes )

Models in which the response variable is not continuous. Examples will show usage of Poisson and Logit models. Both maximum likelihood estimators and generalized linear models will be employed.

- Time Series Analysis ( 45 minutes )
- Filtering, ARMA, and VAR modeling

Will cover model selection and estimation. Forecasting and impulse response functions.

- Overview of other parts of the package and ongoing development ( 5 minutes )
- Working on statsmodels extensions ( 15 minutes, time-permitting )

Show users how to get their hands dirty by implementing a new statistical model. This example will implement the Tobit model. A model used when the dependent variable is censored.

### Packages required

statsmodels (master), numpy (>= 1.4), scipy (>= 0.7), pandas (>=0.7.1), patsy (0.1.0), matplotlib (>= 1.0.1)

patsy: http://patsy.readthedocs.org/

For installation instructions, click here.

### Preliminary Materials

Tutorials will draw from our many examples and our documentation:

ANOVA, Linear Models, Formulas and Contrasts: