## Tutorials Schedule:

The Tutorials Schedule (July 16th & 17th) is in its final stages of confirmation. There may be changes made to the schedule between now and the conference.

Monday - July 16th | |||
---|---|---|---|

Time | Room 105 | Room 106 | |

08:00 AM - 12:00 PM | Introductory/Intermediate Introduction to NumPy and Matplotlib Video - Jones, Eric | Advanced scikit-learn - Vanderplas, Jake | |

01:00 PM - 05:00 PM | Introductory/Intermediate HDF5 is for lovers Video - Scopatz, Anthony | Advanced Advanced Matplotlib - May, Ryan |

## Introduction to NumPy and Matplotlib - *Eric Jones*

### Bio

Eric has a broad background in engineering and software development and leads Enthought's product engineering and software design. Prior to co-founding Enthought, Eric worked with numerical electromagnetics and genetic optimization in the Department of Electrical Engineering at Duke University. He has taught numerous courses on the use of Python for scientific computing and serves as a member of the Python Software Foundation. He holds M.S. and Ph.D. degrees from Duke University in electrical engineering and a B.S.E. in mechanical engineering from Baylor University

### Description

NumPy is the most fundamental package for scientific computing with Python. It adds to the Python language a data structure (the NumPy array) that has access to a large library of mathematical functions and operations, providing a powerful framework for fast computations in multiple dimensions. NumPy is the basis for all SciPy packages which extends vastly the computational and algorithmic capabilities of Python as well as many visualization tools like Matplotlib, Chaco or Mayavi.

This tutorial will teach students the fundamentals of NumPy, including fast vector-based calculations on numpy arrays, the origin of its efficiency and a short introduction to the matplotlib plotting library. In the final section, more advanced concepts will be introduced including structured arrays, broadcasting and memory mapping.

### Outline

- NumPy: history and overview
- History
- Overview

- Basic plotting with Matplotlib
- Basic plotting with Matplotlib
- 2D plots
- Histograms
- Scatter plots
- Displaying images

- Fast computations with NumPy arrays
- Creating NumPy arrays
- Computations with NumPy arrays
- Types and shapes of NumPy arrays
- Built-in operations on a NumPy array
- Slicing and indexing
- From data files to arrays and back

- Advanced concepts
- The underlying data structure
- Broadcasting
- Structured arrays
- Memory mapped arrays

### Required Packages

It requires python 2.6+ or 3.1+, NumPy 1.6.1+, iPython 0.11+, and matplotlib 1.0+ to be installed on your laptop. All these packages are available in various one-click installers including EPDFree.

In addition:

- Download and unpack the tutorial files.

[ introduction_numpy_matplotlib.zip | 5.7MB ] - To test if your installation is working, follow the indications on page 7 of the manual. The speed of light folder is inside the class folder inside student/demo/speed_of_light/

## scikit-learn - *Jake Vanderplas*

### Bio

Jake Vanderplas is an NSF postdoctoral research fellow, working jointly between the Astronomy and Computer Science departments at the University of Washington, and is interested in topics at the intersection of large-scale machine learning and wide-field astronomical surveys. He is co-author of the book “Statistics, Data Mining, and Machine Learning in Astronomy”, which will be published by Princeton press later this year. In the Python world, Jake is the author of AstroML, and a maintainer of Scikit-learn & Scipy. He gives regular talks and tutorials at various Python conferences, and occasionally blogs his thoughts and his code at Pythonic Perambulations: http://jakevdp.github.com.### Description

Machine Learning has been getting a lot of buzz lately, and many software libraries have been created which implement these routines. scikit-learn is a python package built on numpy and scipy which implements a wide variety of machine learning algorithms, useful for everything from facial recognition to optical character recognition to automated classification of astronomical images. This tutorial will begin with a crash course in machine learning and introduce participants to several of the most common learning techniques for classification, regression, and visualization. Building on this background, we will explore several applications of these techniques to scientific data -- in particular, galaxy, star, and quasar data from the Sloan Digital Sky Survey -- and learn some basic astrophysics along the way. From these examples, tutorial participants will gain knowledge and experience needed to successfully solve a variety of machine learning and statistical data mining problems with python.

### Outline

- 0:00 - 0:20 Setup - making sure all participants have the necessary packages and datasets for tutorials (datasets will be posted on my web page)
- 0:20 - 1:20 Machine Learning 101: introduction to Supervised learning (Regression/Classification) and Unsupervised learning (clustering/dimensionality reduction) and the scikit-learn interface for these tools.
- 1:20 - 1:40 Working through a classification exercise using the iris dataset
- 1:40 - 2:20 Practical advice for machine learning (including examples) bias, variance, over-fitting, under-fitting, and cross-validation.
- 2:20 - 2:50 Classification exercise: star-quasar classification with Naive Bayes (example done together) and Gaussian Mixture Models (exercise done independently)
- 2:50 - 3:20 Regression exercise: photometric redshifts with decision trees and random forests (example done together). Applying cross-validation and learning curves method to choose optimal model complexity (exercise done independently)
- 3:20 - 3:50 Dimensionality reduction exercise: reducing the dimensionality of high-dimensional spectral data with PCA (example done together) and with manifold learning methods (exercise done independently)
- 3:50 - 4:00 Conclusion and review

This follows the general outline of the online tutorial I've prepared for scikit-learn (see link above). For the purpose of Scipy2012, I plan to convert most of this material into ipython notebooks for interactive instruction, though an updated version of the web page will be available as well.

### Packages Required

numpy, scipy, scikit-learn (bleeding-edge not necessary; I'll make everything compatible with the ubuntu distro versions)

ipython *including the recently released ipython notebook*. Participants should be able to run "ipython notebook" in the command line and see the ipython dashboard in their web browser (version 2+ is fine).

Note that an EPDfree installation contains all the necessary dependencies, with the exception of scikit-learn.

For installation instructions, click here.

## HDF5 is for lovers - *Anthony Scopatz*

### Bio

Anthony Scopatz is a computational nuclear engineer / physicist post-doctoral scholar at the FLASH Center at the University of Chicago. His initial workshop teaching experience came from instructing bootcamps for The Hacker Within - a peer-led teaching organization at the University of Wisconsin. Out of this grew a collaboration teaching Software Carpentry bootcamps in partnership with Greg Wilson. During his tenure at Enthought, Inc, Anthony taught many week long courses (approx. 1 per month) on scientific computing in Python.

### Description

HDF5 is a hierarchical, binary database format that has become a *de facto* standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!

This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.

### Outline

- Meaning in layout (20 min)
- Tips for choosing your hierarchy

- Advanced datatypes (20 min)
- Tables
- Nested types
- Tricks with malloc() and byte-counting

- Exercise on above topics** (20 min)
- Chunking (20 min)
- How it works
- How to properly select your chunksize

- Queries and Selections (20 min)
- In-core vs Out-of-core calculations
- PyTables.where()
- Datasets vs Dataspaces

- Exercise on above topics** (20 min)
- The Starving CPU Problem (1 hr)
- Why you should always use compression
- Compression algorithms available
- Choosing the correct one
- Exercise

- Integration with other databases (1 hr)
- Migrating to/from SQL
- HDF5 in other databases (JSON example)
- Other Databases in HDF5 (JSON example)
- Exercise

### Packages Required

This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables 2.3+. `ViTables`_ and MatPlotLib are also recommended. These may all be found in Linux package managers. They are also available through EPD or easy_install. ViTables may need to be installed independently. [http://vitables.org/]

## Advanced Matplotlib - *Ryan May*

### Bio

Ryan May is a Software Engineer at Enterprise Electronics Corporation and a Doctoral student in the School of Meteorology at the University of Oklahoma. His primary interest in Python is for its application for data visualization and for rapid development and testing of signal processing techniques for weather radar applications. He has also been a developer for the Matplotlib project since 2008, giving an introductory Matplotlib tutorial at SciPy 2010. Among Ryan's contributions to Matplotlib are improvements to its spectral analysis routines, wind barb support (for the meteorological community), and, most recently, simplified support for creating and saving animations.

### Description

Matplotlib is one of the main plotting libraries in use within the scientific Python community. This tutorial covers advanced features of the Matplotlib library, including many recent additions: laying out axes, animation support, Basemap (for plotting on maps), and other tweaks for creating aesthetic plots. The goal of this tutorial is to expose attendees to several of the chief sub-packages within Matplotlib, helping to ensure that users maximize the use of the full capabilities of the library. Additionally, the attendees will be run through a "grab-bag" of tweaks for plots that help to increase the aesthetic appeal of created figures. Attendees should be familiar with creating basic plots in Matplotlib as well as basic use of NumPy for manipulating data.

### Outline

- Basemap (45 minutes)
- Explain basics of Basemap with regard to projections
- Using a basemap instance to project lat/lon points
- Basemap support for maps
- Explain Basemap's wrapping of wrapping plot methods
- Exercise: Scatter plot of point data on map
- Exercise: Plotting information from shape files

- Animations (1 hour)
- Show basic draw() loop example and the problems
- Can't resize
- Can't interact

- Solution 1: Callback-based using only generic timer object
- Introduce artist animation
- Exercise: Animate a scatter plot of random data

- This works, but can be inefficient...FuncAnimation
- Eliminates need to create multiple artists
- Show example that sets array data
- Then just go full procedurally generated data
- Exercise: Create a line plot with data from a function

- Show saving movie files

- Show basic draw() loop example and the problems
- Axes Layouts (1 hour)
- plt.subplots()
- sharing axes/labels
- tight_layout()
- Exercise: Create common axes with nice spacing, eliminating superfluous labels as well as problems with overlapping text
- AxesGrid, ImageGrid, AnchoredArtist
- Exercise: Use image grid to create image plots with shared colorbars and labelled subplots (labelled with text boxes for a,b,c, etc.)
- Cover more advanced features of axesgrid/imagegrid
- More complex layouts
- Nested layouts
- Inset axes

- Tweaks for Aesthetic plots (45 minutes)
- Drop shadows on boxes
- Advanced legends with proxy artists
- hatching, line width, other graphics properties
- Exercise: Take a plot and make it look much better with a few tweaks
- Drop shadow on legend with rounded corners
- Move from outside axes title to an AnchoredArtist based text box at the top of the plot
- Improve scatter plot legend using proxy artists

### Packages Required

Matplotlib 1.1.0 (May need to bump if another release comes out), Basemap, Numpy. ffmpeg/mencoder would optionally be used to save animations as movie files.

Attendees need to have Matplotlib and Basemap installed. EPDFree is a good start, since matplotlib is included. Instructions for installing Basemap can be found at:

If needed, instructions for installing Matplotlib are located at: