Tutorials Schedule:

The Tutorials Schedule (July 16th & 17th) is in its final stages of confirmation. There may be changes made to the schedule between now and the conference.

Tuesday - July 17th
Time Room 105 Room 106
08:00 AM - 12:00 PMIntroductory/Intermediate
Efficient Parallel Python for High-Performance Computing
- Smith, Kurt
Advanced
Time Series Data Analysis with pandas
- McKinney, Wes
01:00 PM - 05:00 PMIntroductory/Intermediate
IPython in-depth: Interactive Tools for Scientific Computing
- Perez, Fernando
Advanced
statsmodels
Video
- Seabold, Skipper

[ back to top ]

Efficient Parallel Python for High-Performance Computing - Kurt Smith

Bio

Kurt Smith has been using Python in scientific computing for nearly ten years, and has developed tools to simplify the integration of performance-oriented languages with Python. He has contributed to the Cython project, implementing the initial version of the typed memoryviews and native cython arrays. He uses Cython extensively in his consulting work at Enthought. He received his B.S. in physics and applied mathematics from the University of Dallas, and his Ph.D. in physics from the University of Wisconsin-Madison. His doctoral research focused on the application of fluid plasma models to astrophysical systems, involving the composition of high-performance parallel simulations of plasma turbulence. Kurt Smith has trained hundreds of scientists, engineers, and researchers in Python, NumPy, Cython, and parallel and high-performance computing as part of Enthought's five-day scientific Python training course. He has developed course material for high-performance and parallel computing with Python, and taught the "Efficient Parallel Python for High-Performance Computing"_ tutorial at the SciPy 2012 conference.

Description

This tutorial is targeted at the intermediate-to-advanced Python user who wants to extend Python into High-Performance Computing. The tutorial will provide hands-on examples and essential performance tips every developer should know for writing effective parallel Python. The result will be a clear sense of possibilities and best practices using Python in HPC environments.

Many of the examples you often find on parallel Python focus on the mechanics of getting the parallel infrastructure working with your code, and not on actually building good portable parallel Python. This tutorial is intended to be a broad introduction to writing high-performance parallel Python that is well suited to both the beginner and the veteran developer.

We will discuss best practices for building efficient high-performance Python through good software engineering. Parallel efficiency starts with the speed of the target code itself, so we will first look at how to evolve code from for-loops to list comprehensions and generator comprehensions to using Cython with NumPy. We will also discuss how to optimize your code for speed and memory performance by using profilers.

The tutorial will cover some of the common parallel communication technologies (multiprocessing, MPI, and cloud computing) and introduce the use of parallel map and map-reduce.

At the end of the tutorial, participants should be able to write simple parallel Python scripts, make use of effective parallel programming techniques, and have a framework in place to leverage the power of Python in High- Performance Computing.

Packages Required

line_profiler, kernprof, runsnake, NumPy, SciPy, Cython, picloud

Optional Packages

mpi4py

[ back to top ]

Time Series Data Analysis with pandas - Wes McKinney

Bio

Wes McKinney is the author of pandas and vbench, and a contributor to statsmodels. Prior to starting Lambda Foundry, he worked in quantitative finance at AQR Capital Management. He's interested in data analysis, visualization, high performance computing, and testing, profiling, and performance monitoring tools.

Description

In this tutorial, I'll give a brief overview of pandas basics for new users, then dive into the nuts of bolts of manipulating time series data in memory. This includes such common topics date arithmetic, alignment and join / merge methods, resampling and frequency conversion, time zone handling, moving window functions like moving mean and standard deviation. A strong focus will be placed on working with large time series efficiently using array manipulations. I'll also illustrate visualization tools for slicing and dicing time series to make informative plots. There will be several example data sets taken from finance, economics, ecology, web analytics, or other areas.

The target audience for the tutorial includes individuals who already work regularly with time series data and are looking to acquire additional skills and knowledge as well as users with an interest in data analysis who are new to time series. You will be expected to be comfortable with general purpose Python programming and have a modest amount of experience using NumPy. Prior experience with the basics of pandas's data structures will also be helpful.

Outline

  • IPython Notebook Setup and Environment check (15 min)
  • Pandas basics (30 min)
    • Series
    • DataFrame
    • Indexing, data selection and subsetting
    • Reading and writing files
  • Date and time types, string conversion (15 minutes)
  • Pandas TimeSeries basics (30 minutes)
    • Indexing and selection, subsetting
    • Data alignment
    • DataFrame merge / joins
    • Reading time series data from disk
  • Fixed frequencies (20 minutes)
    • Frequency aliases
    • Date range generation
    • Date offset classes
    • Shifting (leading and lagging)
  • Resampling (30 minutes)
    • Downsampling / aggregation
    • Upsampling + interpolation methods
  • Periods and Period arithmetic (10 minutes)
    • Time zone handling (10 minutes)
  • Plotting and visualization (20 min)
    • Time series plots
    • Grouped plots (by year, say)
    • Scatter plots
  • Moving window functions (15 minutes)
  • Examples (45 minutes)
    • Stock Data: implement and backtest simple strategies
    • Macroeconomic data and tranformations
    • Weather / ecology or tide data
      • Munging, aggregation, visualization

Required Packages

Python 2.7 or higher (including Python 3), pandas >= 0.8.0 and its dependencies, NumPy >= 1.6.1, matplotlib >= 1.0.0, pytz, IPython >= 0.12 and the dependencies for the HTML notebook application: pyzmq and tornado. EPDFree is a good starting point, requiring only pandas, dateutil, and pytz to be installed in addition. Optionally: PyTables

[ back to top ]

IPython in-depth: Interactive Tools for Scientific Computing - Fernando Perez

Bio

Fernando Perez received his PhD in theoretical physics from the University of Colorado and then worked on numerical algorithm development at the Applied Mathematics Dept. at the same university. He now works as a scientist at the Helen Wills Neuroscience Institute at the University of California, Berkeley, focusing on the development of new analysis methods for brain imaging problems and high-level scientific computing tools. Towards the end of his graduate studies, he became involved with the development of Python tools for scientific computing. He started the open source IPython project in 2001 when looking for a more efficient interactive workflow for everyday scientific tasks. He continues to lead the IPython project along with a growing team of talented developers. He also is a member of the core matplotlib development team, and has contributed to numpy, scipy, sympy, mayavi and other Python projects.

Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing. We will show some uses of IPython for scientific applications, focusing on exciting recent developments, such as the network-aware kernel, web-based notebook with code, graphics, and rich HTML, and a high-level framework for interactive parallel computing.

Outline

  • Overview of IPython
    • Introductory description of the project and architecture.
    • Basics: the magic command system, shell aliases, full shell access, the history system, variable caching, object introspection tools.
    • Development workflow: combining the interpreter session with python files via the %run command.
    • Effective use of IPython at the command-line for typical development tasks: timing, profiling, debugging.
    • Embedding IPython in various contexts.
    • The IPython Qt console: unique features beyond the terminal.
    • Configuring IPython: the profile and configuration system for multiple applications.
  • The IPython notebook:
    • interactive usage of the application
    • the IPython display protocol
    • defining custom display methods for your own objects
    • generating HTML and PDF output.
  • Parallelism with IPython
    • basic architecture
    • interactive control of a cluster
    • standalone execution of applications
    • integration with MPI
    • blocking and asynchronous parallelism
    • execution in batch-controlled (PBS, SGE, etc.) environments
    • IPython engines in the cloud (illustrated with Amazon EC2 instances and starcluster).

For full details about IPython including documentation, previous presentations and videos of talks, please see the `project website

Required Packages

  • IPython ≥ 0.13
  • pyzmq, zeromq ≥ 2.1.11
  • numpy
  • pandas
  • scipy
  • matplotlib
  • tornado ≥ 2.1.0 (for the notebook)
  • pygments (for the QtConsole)
  • PyQt or PySide (for the QtConsole)

All of which are available in EPD

Recommended packages, used in some parallel demos:

  • NetworkX
  • mpi4py
  • BeautifulSoup
  • starcluster (for using Amazon EC2)

For installation instructions, click here.

[ back to top ]

statsmodels - Skipper Seabold

Bio

Skipper is a PhD candidate in economics at American University in Washington, D.C. He specializes in applied econometrics, information theory and entropy econometrics, and topics in growth theory. He has been working on statistics in Python throughout his studies. He has presented earlier work on statsmodels at the SciPy Conference and has been an active mentor for new developers working on statsmodels as part of the Google Summer of Code.

Description

This tutorial will give users an overview of the capabilities of statsmodels, including how to conduct exploratory data analysis, fit statistical models, and check that the modeling assumptions are met.

The use of Python in data analysis and statistics is growing rapidly. It is not uncommon now for researchers to conduct data cleaning steps in Python and then move to some other software to estimate statistical models. Statsmodels, however, is a Python module that attempts to bridge this gap and allow users to estimate statistical models, perform statistical tests, and conduct data exploration in Python. Researchers across fields such as economics and the social sciences to finance and engineering may find that statsmodels meets their needs for statistical computing and data analysis in Python.

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods.

With this knowledge attendees will be ready to jump in and use Python for applied statistical analysis and will have an idea how they can extend statsmodels for their own needs.

Outline

  • Introduction to the package structure and philosophy ( 15 minutes )
  • Linear Models ( 45 minutes )
    • ANOVA and contrasts
    • Regression (OLS, WLS, GLS)
Will cover the use of formulas in statsmodels, assessing model fit through Wald tests, robust covariance estimators, and diagnostic plotting.
  • Robust Linear Models ( 15 minutes )
    • Fitting models. Cover existing robust norms.
  • Discrete Choice Models ( 45 minutes )
Models in which the response variable is not continuous. Examples will show usage of Poisson and Logit models. Both maximum likelihood estimators and generalized linear models will be employed.
  • Time Series Analysis ( 45 minutes )
  • Filtering, ARMA, and VAR modeling
Will cover model selection and estimation. Forecasting and impulse response functions.
  • Overview of other parts of the package and ongoing development ( 5 minutes )
  • Working on statsmodels extensions ( 15 minutes, time-permitting )
Show users how to get their hands dirty by implementing a new statistical model. This example will implement the Tobit model. A model used when the dependent variable is censored.

Packages required

statsmodels (master), numpy (>= 1.4), scipy (>= 0.7), pandas (>=0.7.1), patsy (0.1.0), matplotlib (>= 1.0.1)

patsy: http://patsy.readthedocs.org/

For installation instructions, click here.

Preliminary Materials

Tutorials will draw from our many examples and our documentation:

ANOVA, Linear Models, Formulas and Contrasts: