SciPy 2012 Tutorials

The Tutorials Schedule (July 16th & 17th) is in its final stages of confirmation. There may be changes made to the schedule between now and the conference.

Monday - July 16th
Time Room 105 Room 106
08:00 AM - 12:00 PMIntroductory/Intermediate
Introduction to NumPy and Matplotlib
- Jones, Eric
Advanced
scikit-learn
- Vanderplas, Jake
01:00 PM - 05:00 PMIntroductory/Intermediate
HDF5 is for lovers
- Scopatz, Anthony
Advanced
Advanced Matplotlib
- May, Ryan
Tuesday - July 17th
Time Room 105 Room 106
08:00 AM - 12:00 PMIntroductory/Intermediate
Efficient Parallel Python for High-Performance Computing
- Smith, Kurt
Advanced
Time Series Data Analysis with pandas
- McKinney, Wes
01:00 PM - 05:00 PMIntroductory/Intermediate
IPython in-depth: Interactive Tools for Scientific Computing
- Perez, Fernando
Advanced
statsmodels
- Seabold, Skipper

[ back to top ]

Introduction to NumPy and Matplotlib - Eric Jones

Bio

Eric has a broad background in engineering and software development and leads Enthought's product engineering and software design. Prior to co-founding Enthought, Eric worked with numerical electromagnetics and genetic optimization in the Department of Electrical Engineering at Duke University. He has taught numerous courses on the use of Python for scientific computing and serves as a member of the Python Software Foundation. He holds M.S. and Ph.D. degrees from Duke University in electrical engineering and a B.S.E. in mechanical engineering from Baylor University

Description

NumPy is the most fundamental package for scientific computing with Python. It adds to the Python language a data structure (the NumPy array) that has access to a large library of mathematical functions and operations, providing a powerful framework for fast computations in multiple dimensions. NumPy is the basis for all SciPy packages which extends vastly the computational and algorithmic capabilities of Python as well as many visualization tools like Matplotlib, Chaco or Mayavi.

This tutorial will teach students the fundamentals of NumPy, including fast vector-based calculations on numpy arrays, the origin of its efficiency and a short introduction to the matplotlib plotting library. In the final section, more advanced concepts will be introduced including structured arrays, broadcasting and memory mapping.

Outline

Required Packages

It requires python 2.6+ or 3.1+, NumPy 1.6.1+, iPython 0.11+, and matplotlib 1.0+ to be installed on your laptop. All these packages are available in various one-click installers including EPDFree.

In addition:

  1. Download and unpack the tutorial files.
    [ introduction_numpy_matplotlib.zip | 5.7MB ]
  2. To test if your installation is working, follow the indications on page 7 of the manual. The speed of light folder is inside the class folder inside student/demo/speed_of_light/

[ back to top ]

scikit-learn - Jake Vanderplas

Bio

Jake Vanderplas is an NSF postdoctoral research fellow, working jointly between the Astronomy and Computer Science departments at the University of Washington, and is interested in topics at the intersection of large-scale machine learning and wide-field astronomical surveys. He is co-author of the book “Statistics, Data Mining, and Machine Learning in Astronomy”, which will be published by Princeton press later this year. In the Python world, Jake is the author of AstroML, and a maintainer of Scikit-learn & Scipy. He gives regular talks and tutorials at various Python conferences, and occasionally blogs his thoughts and his code at Pythonic Perambulations: http://jakevdp.github.com.

Description

Machine Learning has been getting a lot of buzz lately, and many software libraries have been created which implement these routines. scikit-learn is a python package built on numpy and scipy which implements a wide variety of machine learning algorithms, useful for everything from facial recognition to optical character recognition to automated classification of astronomical images. This tutorial will begin with a crash course in machine learning and introduce participants to several of the most common learning techniques for classification, regression, and visualization. Building on this background, we will explore several applications of these techniques to scientific data -- in particular, galaxy, star, and quasar data from the Sloan Digital Sky Survey -- and learn some basic astrophysics along the way. From these examples, tutorial participants will gain knowledge and experience needed to successfully solve a variety of machine learning and statistical data mining problems with python.

Outline

This follows the general outline of the online tutorial I've prepared for scikit-learn (see link above). For the purpose of Scipy2012, I plan to convert most of this material into ipython notebooks for interactive instruction, though an updated version of the web page will be available as well.

Packages Required

numpy, scipy, scikit-learn (bleeding-edge not necessary; I'll make everything compatible with the ubuntu distro versions)

ipython *including the recently released ipython notebook*. Participants should be able to run "ipython notebook" in the command line and see the ipython dashboard in their web browser (version 2+ is fine).

Note that an EPDfree installation contains all the necessary dependencies, with the exception of scikit-learn.

For installation instructions, click here.

[ back to top ]

HDF5 is for lovers - Anthony Scopatz

Bio

Anthony Scopatz is a computational nuclear engineer / physicist post-doctoral scholar at the FLASH Center at the University of Chicago. His initial workshop teaching experience came from instructing bootcamps for The Hacker Within - a peer-led teaching organization at the University of Wisconsin. Out of this grew a collaboration teaching Software Carpentry bootcamps in partnership with Greg Wilson. During his tenure at Enthought, Inc, Anthony taught many week long courses (approx. 1 per month) on scientific computing in Python.

Description

HDF5 is a hierarchical, binary database format that has become a *de facto* standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!

This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.

Outline

Packages Required

This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables 2.3+. `ViTables`_ and MatPlotLib are also recommended. These may all be found in Linux package managers. They are also available through EPD or easy_install. ViTables may need to be installed independently. [http://vitables.org/]

[ back to top ]

Advanced Matplotlib - Ryan May

Bio

Ryan May is a Software Engineer at Enterprise Electronics Corporation and a Doctoral student in the School of Meteorology at the University of Oklahoma. His primary interest in Python is for its application for data visualization and for rapid development and testing of signal processing techniques for weather radar applications. He has also been a developer for the Matplotlib project since 2008, giving an introductory Matplotlib tutorial at SciPy 2010. Among Ryan's contributions to Matplotlib are improvements to its spectral analysis routines, wind barb support (for the meteorological community), and, most recently, simplified support for creating and saving animations.

Description

Matplotlib is one of the main plotting libraries in use within the scientific Python community. This tutorial covers advanced features of the Matplotlib library, including many recent additions: laying out axes, animation support, Basemap (for plotting on maps), and other tweaks for creating aesthetic plots. The goal of this tutorial is to expose attendees to several of the chief sub-packages within Matplotlib, helping to ensure that users maximize the use of the full capabilities of the library. Additionally, the attendees will be run through a "grab-bag" of tweaks for plots that help to increase the aesthetic appeal of created figures. Attendees should be familiar with creating basic plots in Matplotlib as well as basic use of NumPy for manipulating data.

Outline

Packages Required

Matplotlib 1.1.0 (May need to bump if another release comes out), Basemap, Numpy. ffmpeg/mencoder would optionally be used to save animations as movie files.

Attendees need to have Matplotlib and Basemap installed. EPDFree is a good start, since matplotlib is included. Instructions for installing Basemap can be found at:

If needed, instructions for installing Matplotlib are located at:

[ back to top ]

Efficient Parallel Python for High-Performance Computing - Kurt Smith

Bio

Kurt Smith has been using Python in scientific computing for nearly ten years, and has developed tools to simplify the integration of performance-oriented languages with Python. He has contributed to the Cython project, implementing the initial version of the typed memoryviews and native cython arrays. He uses Cython extensively in his consulting work at Enthought. He received his B.S. in physics and applied mathematics from the University of Dallas, and his Ph.D. in physics from the University of Wisconsin-Madison. His doctoral research focused on the application of fluid plasma models to astrophysical systems, involving the composition of high-performance parallel simulations of plasma turbulence. Kurt Smith has trained hundreds of scientists, engineers, and researchers in Python, NumPy, Cython, and parallel and high-performance computing as part of Enthought's five-day scientific Python training course. He has developed course material for high-performance and parallel computing with Python, and taught the "Efficient Parallel Python for High-Performance Computing"_ tutorial at the SciPy 2012 conference.

Description

This tutorial is targeted at the intermediate-to-advanced Python user who wants to extend Python into High-Performance Computing. The tutorial will provide hands-on examples and essential performance tips every developer should know for writing effective parallel Python. The result will be a clear sense of possibilities and best practices using Python in HPC environments.

Many of the examples you often find on parallel Python focus on the mechanics of getting the parallel infrastructure working with your code, and not on actually building good portable parallel Python. This tutorial is intended to be a broad introduction to writing high-performance parallel Python that is well suited to both the beginner and the veteran developer.

We will discuss best practices for building efficient high-performance Python through good software engineering. Parallel efficiency starts with the speed of the target code itself, so we will first look at how to evolve code from for-loops to list comprehensions and generator comprehensions to using Cython with NumPy. We will also discuss how to optimize your code for speed and memory performance by using profilers.

The tutorial will cover some of the common parallel communication technologies (multiprocessing, MPI, and cloud computing) and introduce the use of parallel map and map-reduce.

At the end of the tutorial, participants should be able to write simple parallel Python scripts, make use of effective parallel programming techniques, and have a framework in place to leverage the power of Python in High- Performance Computing.

Packages Required

line_profiler, kernprof, runsnake, NumPy, SciPy, Cython, picloud

Optional Packages

mpi4py

[ back to top ]

Time Series Data Analysis with pandas - Wes McKinney

Bio

Wes McKinney is the author of pandas and vbench, and a contributor to statsmodels. Prior to starting Lambda Foundry, he worked in quantitative finance at AQR Capital Management. He's interested in data analysis, visualization, high performance computing, and testing, profiling, and performance monitoring tools.

Description

In this tutorial, I'll give a brief overview of pandas basics for new users, then dive into the nuts of bolts of manipulating time series data in memory. This includes such common topics date arithmetic, alignment and join / merge methods, resampling and frequency conversion, time zone handling, moving window functions like moving mean and standard deviation. A strong focus will be placed on working with large time series efficiently using array manipulations. I'll also illustrate visualization tools for slicing and dicing time series to make informative plots. There will be several example data sets taken from finance, economics, ecology, web analytics, or other areas.

The target audience for the tutorial includes individuals who already work regularly with time series data and are looking to acquire additional skills and knowledge as well as users with an interest in data analysis who are new to time series. You will be expected to be comfortable with general purpose Python programming and have a modest amount of experience using NumPy. Prior experience with the basics of pandas's data structures will also be helpful.

Outline

  • IPython Notebook Setup and Environment check (15 min)
  • Pandas basics (30 min)
    • Series
    • DataFrame
    • Indexing, data selection and subsetting
    • Reading and writing files
  • Date and time types, string conversion (15 minutes)
  • Pandas TimeSeries basics (30 minutes)
    • Indexing and selection, subsetting
    • Data alignment
    • DataFrame merge / joins
    • Reading time series data from disk
  • Fixed frequencies (20 minutes)
    • Frequency aliases
    • Date range generation
    • Date offset classes
    • Shifting (leading and lagging)
  • Resampling (30 minutes)
    • Downsampling / aggregation
    • Upsampling + interpolation methods
  • Periods and Period arithmetic (10 minutes)
    • Time zone handling (10 minutes)
  • Plotting and visualization (20 min)
    • Time series plots
    • Grouped plots (by year, say)
    • Scatter plots
  • Moving window functions (15 minutes)
  • Examples (45 minutes)
    • Stock Data: implement and backtest simple strategies
    • Macroeconomic data and tranformations
    • Weather / ecology or tide data
      • Munging, aggregation, visualization

Required Packages

Python 2.7 or higher (including Python 3), pandas >= 0.8.0 and its dependencies, NumPy >= 1.6.1, matplotlib >= 1.0.0, pytz, IPython >= 0.12 and the dependencies for the HTML notebook application: pyzmq and tornado. EPDFree is a good starting point, requiring only pandas, dateutil, and pytz to be installed in addition. Optionally: PyTables

[ back to top ]

IPython in-depth: Interactive Tools for Scientific Computing - Fernando Perez

Bio

Fernando Perez received his PhD in theoretical physics from the University of Colorado and then worked on numerical algorithm development at the Applied Mathematics Dept. at the same university. He now works as a scientist at the Helen Wills Neuroscience Institute at the University of California, Berkeley, focusing on the development of new analysis methods for brain imaging problems and high-level scientific computing tools. Towards the end of his graduate studies, he became involved with the development of Python tools for scientific computing. He started the open source IPython project in 2001 when looking for a more efficient interactive workflow for everyday scientific tasks. He continues to lead the IPython project along with a growing team of talented developers. He also is a member of the core matplotlib development team, and has contributed to numpy, scipy, sympy, mayavi and other Python projects.

Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing. We will show some uses of IPython for scientific applications, focusing on exciting recent developments, such as the network-aware kernel, web-based notebook with code, graphics, and rich HTML, and a high-level framework for interactive parallel computing.

Outline

  • Overview of IPython
    • Introductory description of the project and architecture.
    • Basics: the magic command system, shell aliases, full shell access, the history system, variable caching, object introspection tools.
    • Development workflow: combining the interpreter session with python files via the %run command.
    • Effective use of IPython at the command-line for typical development tasks: timing, profiling, debugging.
    • Embedding IPython in various contexts.
    • The IPython Qt console: unique features beyond the terminal.
    • Configuring IPython: the profile and configuration system for multiple applications.
  • The IPython notebook:
    • interactive usage of the application
    • the IPython display protocol
    • defining custom display methods for your own objects
    • generating HTML and PDF output.
  • Parallelism with IPython
    • basic architecture
    • interactive control of a cluster
    • standalone execution of applications
    • integration with MPI
    • blocking and asynchronous parallelism
    • execution in batch-controlled (PBS, SGE, etc.) environments
    • IPython engines in the cloud (illustrated with Amazon EC2 instances and starcluster).

For full details about IPython including documentation, previous presentations and videos of talks, please see the `project website

Required Packages

  • IPython ≥ 0.13
  • pyzmq, zeromq ≥ 2.1.11
  • numpy
  • pandas
  • scipy
  • matplotlib
  • tornado ≥ 2.1.0 (for the notebook)
  • pygments (for the QtConsole)
  • PyQt or PySide (for the QtConsole)

All of which are available in EPD

Recommended packages, used in some parallel demos:

  • NetworkX
  • mpi4py
  • BeautifulSoup
  • starcluster (for using Amazon EC2)

For installation instructions, click here.

[ back to top ]

statsmodels - Skipper Seabold

Bio

Skipper is a PhD candidate in economics at American University in Washington, D.C. He specializes in applied econometrics, information theory and entropy econometrics, and topics in growth theory. He has been working on statistics in Python throughout his studies. He has presented earlier work on statsmodels at the SciPy Conference and has been an active mentor for new developers working on statsmodels as part of the Google Summer of Code.

Description

This tutorial will give users an overview of the capabilities of statsmodels, including how to conduct exploratory data analysis, fit statistical models, and check that the modeling assumptions are met.

The use of Python in data analysis and statistics is growing rapidly. It is not uncommon now for researchers to conduct data cleaning steps in Python and then move to some other software to estimate statistical models. Statsmodels, however, is a Python module that attempts to bridge this gap and allow users to estimate statistical models, perform statistical tests, and conduct data exploration in Python. Researchers across fields such as economics and the social sciences to finance and engineering may find that statsmodels meets their needs for statistical computing and data analysis in Python.

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods.

With this knowledge attendees will be ready to jump in and use Python for applied statistical analysis and will have an idea how they can extend statsmodels for their own needs.

Outline

  • Introduction to the package structure and philosophy ( 15 minutes )
  • Linear Models ( 45 minutes )
    • ANOVA and contrasts
    • Regression (OLS, WLS, GLS)
Will cover the use of formulas in statsmodels, assessing model fit through Wald tests, robust covariance estimators, and diagnostic plotting.
  • Robust Linear Models ( 15 minutes )
    • Fitting models. Cover existing robust norms.
  • Discrete Choice Models ( 45 minutes )
Models in which the response variable is not continuous. Examples will show usage of Poisson and Logit models. Both maximum likelihood estimators and generalized linear models will be employed.
  • Time Series Analysis ( 45 minutes )
  • Filtering, ARMA, and VAR modeling
Will cover model selection and estimation. Forecasting and impulse response functions.
  • Overview of other parts of the package and ongoing development ( 5 minutes )
  • Working on statsmodels extensions ( 15 minutes, time-permitting )
Show users how to get their hands dirty by implementing a new statistical model. This example will implement the Tobit model. A model used when the dependent variable is censored.

Packages required

statsmodels (master), numpy (>= 1.4), scipy (>= 0.7), pandas (>=0.7.1), patsy (0.1.0), matplotlib (>= 1.0.1)

patsy: http://patsy.readthedocs.org/

For installation instructions, click here.

Preliminary Materials

Tutorials will draw from our many examples and our documentation:

ANOVA, Linear Models, Formulas and Contrasts: