# SciPy 2011 Tutorials

This year, there will be two days of tutorials, July 11th and 12th, before the SciPy 2011 Conference. Each of the two tutorial tracks (introductory, advanced) will have a 3-4 hour morning and afternoon session both days, for a total of 4 half-day introductory sessions and 4 half-day advanced sessions.

## Introductory Track

#### Day 1

- Introduction to NumPy with IPython and Matplotlib -
*Jonathan Rocher* - Introduction to SciPy : optimization, linear algebra, statistics, and more... -
*Anthony Scopatz*

#### Day 2

- Guide to Symbolic Mathematics with SymPy -
*Mateusz Paprocki and Aaron Meurer* - Introduction to Matplotlib, Traits, and Chaco -
*Corran Webster*

## Advanced Track

#### Day 1

- An Introduction to Bayesian Statistical Modeling using PyMC -
*Christopher J. Fonnesbeck and Abie Flaxman* - Statistical Learning with scikit-learn -
*Gael Varoquaux*

#### Day 2

- High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit -
*Jeff Daily* - Interactive Parallel Computing with IPython and PyZMQ -
*Min Ragan-Kelley*

# Descriptions

## Introduction to NumPy with IPython and Matplotlib - *Jonathan Rocher*

##### Bio

Jonathan holds a PhD in physics from the university of Paris and used to be a research scientist specialized in astrophysics and particle physics. He is currently a scientific software developer at Enthought Inc, in Austin TX. He contributes to open source projects as well as commercial applications specializing in geophysics and fluid dynamics.

##### Description

Numpy is a fundamental package for scientific computing with Python. It adds to Python a data structure (the numpy array) that has access to a large library of mathematical functions and operations, providing a powerful framework for fast computations in multiple dimensions. This data structure also provides powerful tools for memory access and allows interfacing with all sorts of data sources. It is the basis for all Scipy packages which extends vastly the computational and algorithmic capabilities of Python.

##### Outline

- NumPy: history and overview
- History
- Overview
- Basic plotting with Matplotlib
- Basic plotting with Matplotlib
- 2D plots
- Histograms
- Scatter plots
- Displaying images
- Fast computations with NumPy arrays
- Creating NumPy arrays
- Computations with NumPy arrays
- Types and shapes of NumPy arrays
- Built-in operations on a NumPy array
- Slicing and indexing
- From data files to arrays and back
- Advanced concepts
- The underlying data structure
- Broadcasting
- Structured arrays
- Memory mapped arrays
- NumPy beyond the underlying data structure
- Random number generation
- Linear algebra with Numpy arrays

##### Packages Required

NumPy (1.5+), matplotlib (0.9+)

##### Optional Packages

##### Files

## Introduction to SciPy : optimization, linear algebra, statistics, and more... - *Anthony Scopatz*

##### Bio

Computational scientist and long time Python developer, Anthony holds his B.S. in physics from UC Santa Barbara and M.S.E. in mechanical engineering from UT Austin. Currently, a Ph.D. candidate in nuclear engineering at UT Austin, Anthony's research interests revolve around physics-based modeling of the nuclear fuel cycle and related information theoretic metrics. He has published and spoken at numerous conferences on both nuclear engineering and Python.

##### Description

SciPy is open-source software for mathematics, science, and engineering. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they are powerful enough to be depended upon by some of the world's leading scientists This tutorial will focus on some of the more often used sub-packages in SciPy, providing coverage of the mathematical models, the software architecture, and examples and exercises.

##### Outline

- Packages discussed
- linear algebra / sparse matrices
- probability and statistics
- optimization
- integration

##### Packages Required

##### Optional Packages

## Guide to Symbolic Mathematics with SymPy - *Mateusz Paprocki and Aaron Meurer*

##### Bio

Mateusz Paprocki has been SymPy's core developer since 2007. He was a Google Summer of Code student and two-time mentor for SymPy. He also has given talks about SymPy at various conferences and scientific meetings (most notably EuroSciPy, Py4Science and PyCon.PL).

Aaron Meurer is SymPy's core developer since 2009 and the current leader of the project. He was a Google Summer of Code student for SymPy twice, and is currently pursuing a bachelors in mathematics at New Mexico Tech.

##### Description

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more more advanced topics will be explained during presentation.

##### Outline

- installing, configuring and running SymPy
- Python, IPython, isympy, SymPy in a web browser
- configuration variables and their meaning
- basics of expressions in SymPy
- building blocks of expressions
- core structure of classes
- ways of defining symbols
- dummy symbols and their role
- constructing expressions
- automatic evaluation
- obtaining parts of expressions
- substituting subexpressions
- expressions in data structures
- turning strings into expressions
- traversal and manipulation of expressions
- manual and interactive traversal of subexpressions
- search and replace in expressions
- most common expression manipulation functions
- transforming expressions between different forms
- common issues and differences from other CAS
- why I have to define symbols?
- ``1/3`` is not a rational number
- ``^`` is not exponentiation operator
- why you shouldn't write ``10**(-1000)``
- how to deal with limited recursion depth
- expression caching and its consequences
- setting up and using printers
- repr/str, pretty (ASCII, Unicode), LaTeX, MathML
- generating code with SymPy (C, Fortran)
- defining your own customized printers
- querying expression properties
- why ``sqrt(x**2)`` doesn't give ``x``?
- defining your own query handlers
- not only symbolics: numerical computing (mpmath)
- evaluation of expressions to arbitrary precision
- symbolics vs. numerics (limits)

##### Mathematical problem solving with SymPy

**Computing certain sums of roots of polynomials.***Given a univariate polynomial ``f(z)`` and a univariate rational function ``g(r)`` compute ``g(r_1) + ... + g(r_n)``, where ``r_i``'s are roots of ``f`` (i.e. ``f(r_i) = 0``).*- To solve this problem we will use expression manipulation functions to put the sum in a certain form and then use symmetric reduction of multivariate polynomials and Viete formulas to obtain the final result.
**Vertex coloring of graphs.***Suppose we are given graph ``G(V, E)``, such that ``V`` is a set of vertices and ``E`` a set of edges, and a positive integer ``k``. We ask whether ``G`` is colorable with ``k`` colors and what are the color assignments.*- To handle this task we will transform a graph theoretic formulation of graph ``k``--coloring problem to a system of multivariate polynomial equations and solve it using Groebner bases.

##### Packages Required

Python 2.x (Python 3.x is not supported yet), SymPy (most recent version)

##### Optional Packages

IPython, matplotlib, NetworkX, GMPY

##### Online Guide

http://mattpap.github.com/scipy-2011-tutorial/html/index.html

## Introduction to Matplotlib, Traits, and Chaco - *Corran Webster*

## An Introduction to Bayesian Statistical Modeling using PyMC - *Christopher J. Fonnesbeck and Abie Flaxman*

##### Bio

Chris is a Biostatistician in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, applied decision analysis and machine learning. He originally hails from Vancouver, BC and recieved his Ph.D. from the University of Georgia.

Abie is a computer scientist hiding out in the Global Health Department at the University of Washington. He was trained in the theoretical aspects of randomized algorithms, but now spends his days measuring population health with statistical models (ideally in Python with PyMC). Abie blogs about applications of computer science to global health at http://healthyalgorithms.wordpress.com.

##### Description

PyMC is a Python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems across all quantitative disciplines. This hands-on tutorial will introduce users to the key components of PyMC and how to employ them to construct, fit and diagnose models. Though some familiarity with statistics is assumed, the tutorial will begin with a brief overview of Bayesian inference, including an introduction to Markov chain Monte Carlo.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more more advanced topics will be explained during presentation.

All relevant code snippets and other resources relevant to the tutorial are available at this GitHub.

##### Outline

- Introduction to Bayesian Statistics and PyMC (Chris, 1 hr)
- Thomas Bayes
- Bayesian Inference
- Probability
- Bayes Theorem
- Probability Distributions
- Likelihoods and Priors
- Posterior Calculation
- Approximation Methods
- Markov chain Monte Carlo
- History of PyMC
- Model Building in PyMC (Abie, 1hr)
- Motivating Example, regression of TFR on LDI?
- Variables and Data csv2rec, additional data wrangling tricks, and good practices
- Specifying Models
- Advanced Examples
- Model Fitting in PyMC (Abie, 1hr)
- Fitting Models
- Tests of convergence and Diagnostics
- Saving and Managing Output
- Model Checking - AIC, BIC, DIC; posterior predictive checks; predictive validity
- Extending PyMC (Chris, 1 hr)
- Implementing new MCMC algorithms
- Gradient-based MCMC
- Gaussian processes

##### Packages Required

NumPy (1.6+), matplotlib (1.0+), PyMC (2.2)

##### Optional Packages

## Statistical Learning with scikit-learn - *Gael Varoquaux*

##### Bio

Gael Varoquaux: research fellow at INSERM, associate researcher at INRIA. I work on statistical analysis of brain functional imaging data: applying machine learning technique to learn empirical models of brain function from large datasets. I am the project leader for scikit-learn, as well as a core contributor to Mayavi and Nipy (Neuroimaging in Python). I have a good experience in teaching scientific Python, via classes or tutorials.

##### Description

The goal of the tutorial is to introduce attendees to statistical data processing in high dimension and data mining, using the scikit-learn: http://scikit-learn.sourceforge.net/

Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackle range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. By `statistical learning` we denote the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.

The tutorial will be hands on, starting from concrete data analysis problems that users can relate to, and progressively introducing machine learning tools that can be used to solve these problems. Examples will be drawn from signal and image processing problems that are ubiquitous in scientific data analysis, financial modeling, and engineering.

The target audience are scientists and developers that know scipy and numpy well. No prior knowledge of machine learning, data processing or statistics is required.

##### Outline

- Statistical learning: the setting and the estimator object in the scikit-learn
- Supervised learning: predicting an output variable from high-dimensional observations.
- Nearest neighbor and the curse of dimensionality
- Linear model: from regression to sparsity
- Support vector machines
- Gaussian process: introducing the notion of posterior estimate
- Unsupervised learning: seeking representations of the data
- Clustering: grouping observations together
- Decompositions: from a signal to componants and loadings
- Model selection: choosing estimators and their parameters
- Score, and cross-validated scores
- Cross-validation generators
- Grid-search and cross-validated estimators
- Putting it all together
- Combining estimators with the Pipeline
- Wrapping up: face classification example

##### Packages Required

NumPy >= 1.3, SciPy >= 0.7, matplotlib, IPython, scikit-learn >= 0.6 (>= 0.7 preferred but not required)

## High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit - *Jeff Daily*

##### Bio

Mr. Daily joined the staff of Pacific Northwest National Laboratory in 2005 and received his MS and BS in computer science from Washington State University in 2009 and 2007, respectively. He is currently working on the development of software toolkits for use in writing high performance parallel computing codes as well as developing high performance parallel climate analysis tools. He has had long-standing interests in both the Python programming language as well as high performance computing and is currently pursuing research combining these interests. He is the primary author of the Python bindings for the Global Arrays toolkit as well as the parallel implementation of NumPy called Global Arrays in NumPy (GAiN).

##### Description

This tutorial will provide an overview of the Global Arrays (GA) parallel programming toolkit while emphasizing application development. Several new functionalities will be highlighted including a parallel implementation of the NumPy module. The target audience is programmers who are familiar with scalar programming but not parallel programming and parallel programmers who use message-passing libraries such as the Message Passing Interface (MPI), but may not be familiar with shared memory and one-sided styles of programming.

The tutorial will begin with an overview of one-sided communication and Partitioned Global Address Space programming models. Next, the basic functionality and programming model provided by GA is described. GA's approach will then be compared to the model provided by MPI via instructive, sample exercises. Compatibility with MPI will be briefly discussed. This will be followed by a discussion of more advanced features of GA coupled with examples of how these are used in actual applications. The examples will be further illustrated with interactive coding demonstrations using the sample code provided.

The new parallel implementation of the NumPy module, GAiN, will be covered last. The audience is assumed to have at least basic knowledge of the NumPy API. Using GAiN as a drop-in replacement for NumPy will be discussed along with the caveats of transitioning from serial to parallel codes. Mixing GA functionality with GAiN will be discussed last.

##### Outline

- Introduction and overview of one-sided communication
- Distributed data versus global index space
- One-sided communication vs message-passing
- Shared-memory style programming
- Programmer productivity
- Programming challenges at large scales
- Overview of the Global Arrays Programming Model
- Downloading and building GA using autoconf
- Compiling and running GA programs
- Interoperability with MPI
- Structure of GA library
- 10 Basic GA commands
- PUT/GET
- Synchronization
- Data locality
- Global Array model for computation
- Simple programming example (1D transpose)
- Advanced GA programming concepts illustrated by sample exercises
- Non-blocking communication (matrix multiply)
- Global counters (read-inc)
- Gather/Scatter
- Processor groups, sparse data, block-cyclic data, ghost cells
- GAiN
- Overview of GAiN
- Caveats
- What types of codes can "import gain as numpy"
- Notable differences between GAiN and NumPy
- Using GAiN
- Advanced GAiN and GA/GAiN interoperability
- Notable differences between GAiN and NumPy
- Processor groups
- Irregular data distributions
- Custom data distributions using restricted arrays

##### System Requirements

Global Arrays is currently only compatible with Unix-based operating systems e.g. Linux, OSX (Mac). GA developers do not have access to Windows HPC platforms nor has Global Arrays been tested on Windows using Cygwin. Tutorial attendees must either have a Unix-based laptop or they must be able to SSH to a Unix-based system.

##### Packages Required

mpi4py (latest stable), NumPy (latest stable)

##### Files

## Interactive Parallel Computing with IPython and PyZMQ - *Min Ragan-Kelley*

##### Bio

Min is a member of the core IPython development team. All physicists by training, productive interactive tools for scientific computing have been a priority in their development of IPython and its parallel computing package.

##### Description

Scientific computing is always growing in the Python community, and parallel computing is a large part of that growth. IPython has focused on bringing interactivity to existing tools. With IPython 0.11, the interactive parallel computing component has been completely rewritten using pyzmq, resulting in dramatic performance improvements, as well as a much improved interface. Gone are the days of passing code as strings to do your work remotely.

This tutorial will cover how to use the new IPython.parallel for various multiplexed and load-balanced workloads. Examples will include synchronous and asynchronous execution, Monte Carlo approximations, Map/Reduce, and various approaches to load-balancing. Gathering results from earlier sessions, profiling execution with IPython's task metadata, and some examples of messaging with PyZMQ will also be covered.

##### Outline

- General IPython.parallel overview
- Using ipcluster and the IPython config system
- Principal interface classes: the Client and the View
- Functions as work units
- Using AsyncResults
- Task granularity: how small can your tasks be?
- Safe non-copying sends of arrays
- Remote Exceptions
- Multiplexed Computing with IPython
- Remote imports
- Distributed Array operations with scatter/gather
- Breaking up iterables with Map/Reduce (Monte Carlo Pi)
- Reconstructing
- Load-balanced task farming
- Simple task submission (prime search)
- Using graph dependencies to control task flow
- Location dependencies
- Aborting tasks
- using Task metadata to inspect
- Requesting results from prior sessions
- PyZMQ and Messaging patterns for computing (time permitting)
- zero-copy sends of numpy data
- XREQ for load-balancing
- low-latency logging with pyzmq
- inter-engine communication