SciPy 2013 Tutorials

The conference always kicks off with two days of tutorials. These sessions provide extremely affordable access to expert training, and consistently receive fantastic feedback from participants. This year we are expanding the tutorial session to include three parallel tracks: introductory, intermediate and advanced.


[ back to top ]

Data Processing with Python

This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.

In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.

See complete details


[ back to top ]

Guide to Symbolic Computing with SymPy

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more advanced topics will be explained during presentation.

See complete details


[ back to top ]

NumPy and IPython

This tutorial is a hands-on introduction to the two most basic building-blocks of the scientific Python stack: the enhanced interactive interpreter IPython and the fast numerical container Numpy. Amongst other things you will learn how to structure an interactive workflow for scientific computing and how to create and manipulate numerical data efficiently. You should have some basic familiarity with Python (variables, loops, functions) and basic command-line usage (executing commands, using history).

See complete details


[ back to top ]

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran

Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you can add type information to your Python code to yield dramatic performance improvements. Cython also allows you to wrap C, C++ and Fortran libraries to work with Python and NumPy. It is used extensively in research environments and in end-user applications.

This hands-on tutorial will cover Cython from the ground up, and will include the newest Cython features, including typed memoryviews.

Target audience:

Developers, researchers, scientists, and engineers who use Python and NumPy and who routinely hit bottlenecks and need improved performance.

C / C++ / Fortran users who would like their existing code to work with Python.

Expected level of knowledge:

Intermediate and / or regular user of Python and NumPy. Have used Python's decorators, exceptions, and classes. Knowledge of NumPy arrays, array views, fancy indexing, and NumPy dtypes. Have programmed in at least one of C, C++, or Fortran.

Some familiarity with the Python or NumPy C-API a plus. Familiarity with memoryviews and buffers a plus. Familiarity with OpenMP a plus. Array-based inter-language programming between Python and C, C++, or Fortran a plus.

Goals:

Overall goal: Cython familiarity for newcomers, Cython competence for those with some experience.

Understand what Cython is, what benefit it brings, when it is appropriate to use.

Know how to create and use a setup.py file that will create an extension module using cython.

Know how to use Cython from within the IPython notebook.

Know how and why to add cython type declarations to Python code.

Know how to create cdef and cpdef functions and cdef classes in Cython.

Know how to use Cython's typed memoryviews to work with buffer objects and C / C++ / Fortran arrays.

Know how to identify cython bottlenecks and speed them up.

Know how to wrap external C / C++ / Fortran 90 code with Cython.

Know how to handle inter-language error states with Cython.

Know how to apply Cython's OpenMP-based parallelism to straightforward nested loops for further performance.

See complete details


[ back to top ]

Anatomy of Matplotlib

This tutorial will be the introduction to matplotlib, intended for users who want to become familiar with python's predominate scientific plotting package. First, the plotting functions that are available will be introduced so users will know what kinds of graphs can be done. We will then cover the fundamental concepts and terminologies, starting from the figure object down to the artists. In an organized and logical fashion, the components of a matplotlib figure are introduced, such as the axes, axis, tickers, and labels. We will explain what an Artist is for, as well as explain the purpose behind Collections. Finally, we will take an overview of the major toolkits available to use, particularly AxesGrid, mplot3d and basemap.

See complete details


[ back to top ]

IPython in depth

IPython provides tools for interactive and parallel computing that are widely used in scientific computing, but can benefit any Python developer.

We will show how to use IPython in different ways, as: an interactive shell, an embedded shell, a graphical console, a network-aware VM in GUIs, a web-based notebook with code, graphics and rich HTML, and a high-level framework for parallel computing.

See complete details


[ back to top ]

Version Control and Unit Testing for Scientific Software

Writing software can be a frustrating process but developers have come up with ways to make it less stressful and error prone. Version control saves the history of your project and makes it easier for multiple people to participate in development. Unit testing and testing frameworks help ensure the correctness of your code and help you find errors by quickly executing and testing your entire code base. These tools can save you time and stress and are valuable to anyone writing software of any description.

This collaborative, hands-on tutorial will cover version control with Git plus writing and running unit tests in Python (and IPython!) using the nose testing framework. Attendees should be comfortable with the basics of Python and the command line but no experience with scientific Python is necessary.

See complete details


[ back to top ]

Diving into NumPy code

Do you want to contribute to NumPy but find the codebase daunting ? Do you want to extend NumPy (e.g. adding support for decimal, or arbitrary precision) ? Are you curious to understand how NumPy works at all ? Then this tutorial is for you.

The goal of this tutorial is do dive into NumPy codebase, in particular the core C implementation. You will learn how to build NumPy from sources, how some of the core concepts such as data types and ufuncs are implemented at the C level and how it is hooked up to the Python runtime. You will also learn how to add a new ufunc and a new data type.

During the tutorial, we will also have a look at various tools (unix-oriented) that can help tracking bugs or follow a particular numpy expression from its python representation to its low-level implementation.

While a working knowledge of C and Python is required, we do not assume a preliminary knowledge of the NumPy codebase. An understanding of Python C extensions is a plus, but not required either.

See complete details


[ back to top ]

An Introduction to scikit-learn (I)

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn’s wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session I will assume participants already have a basic knowledge of using numpy and matplotlib for manipulating and visualizing data. It will require no prior knowledge of machine learning or scikit-learn. The goals of Session I are to introduce participants to the basic concepts of machine learning, to give a hands-on introduction to using Scikit-learn for machine learning in Python, and give participants experience with several practical examples and applications of applying supervised learning to a variety of data. It will cover basic classification and regression problems, regularization of learning models, basic cross-validation, and some examples from text mining and image processing, all using the tools available in scikit-learn.

See complete details


[ back to top ]

Statistical Data Analysis in Python

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Tutorial GitHub repo: https://github.com/fonnesbeck/statistical-analysis-python-tutorial

See complete details


[ back to top ]

Using geospatial data with python

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

See complete details


[ back to top ]

An Introduction to scikit-learn (II)

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn’s wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session II will build upon Session I, and assume familiarity with the concepts covered there. The goals of Session II are to introduce more involved algorithms and techniques which are vital for successfully applying machine learning in practice. It will cover cross-validation and hyperparameter optimization, unsupervised algorithms, pipelines, and go into depth on a few extremely powerful learning algorithms available in Scikit-learn: Support Vector Machines, Random Forests, and Sparse Models. We will finish with an extended exercise applying scikit-learn to a real-world problem.

See complete details