Constructing scientific programs using SymPy

We describe a method for constructing scientific programs where SymPy is used to model the mathematical steps in the derivation. With this workflow, each step in the process can be checked by machine, from the derivation of the equations to the generation of the source code. We present an example based on computing the partition function integrals in statistical mechanics.


Introduction
Writing correct scientific programs is a difficult, largely manual process. Steps in the process include deriving of the constituent equations, translating those into source code, and testing the result to ensure correctness. One challenging aspect of testing is untangling the cause of errors. For example, if the result appears incorrect, it is hard to determine whether the problem is with the algorithm, or a mistake was made in the derivation, or a simple transcription error in writing the source code. Confidence in the correctness of the program can be increased if these steps can be checked by computer.
The process of scientific programming also includes the pursuit of performance. Often the code needs to be heavily modified or rewritten to take best advantage of various target systems. Each modification introduces the possibility of further errors, and must be checked.
A standard approach to create a high-level description of the problem that is specialized to the particular domain, often through a Domain Specific Language (DSL), and then transform this to a representation in a general-purpose programming language. This is used in systems such as FEniCS [FEniCS] (representing PDE's) and the Tensor Contraction Engine [TCE] (representing matrix operations in quantum chemistry).
Since the specifications for scientific software are expressed largely as mathematical equations, our high-level description will be formed from a symbolic mathematics representation. A symbolic mathematics package is well-suited for this representation. In addition to this representation of the high-level description, we will model the derivation steps that lead to a computationally useful form.
The approach taken in this work is to operate on a symbolic representation of the scientific program, and then programmatically transform it into the target system. The specifications for scientific software are expressed largely as equations, and are ideally suited for a symbolic mathematics package. We use SymPy [SymPy], a symbolic mathematics package written in Python, for this part of the process.
The target system is likely a source code representation (C, Fortran, Python, etc), but could encompass more than that. For instance, C might be used to target the CPU or GPU, but the source code might look quite different in those cases. Or the user may call different libraries for the same function to compare performance.

Modeling Derivations
The first goal is to model on the computer a set of steps similar to those used when manually performing the derivation. Deriving the equations using in scientific software is similar to a proof where there is a series of logically justified steps connecting each expression until the final result is reached.
The OpenMathDocuments (OMDoc) project [OMDoc] is a representation for mathematics that describes a higher level than expressions in MathML. For instance, it has representations for proofs and lemmas. Similarly, for scientific computation we need to represent structures a higher level. One major difference between proofs and derivations in scientific software is that some steps are approximations.
The steps can be categorized as exact transformations, approximations, or specializations. An exact transform leaves the equality satisfied. Some types of exact transforms are rearranging terms, multiplying by factors, and identities (which operates only on one side of the equation). A specialization is specifying a physical or model parameter, such as the number of spatial dimensions, the number of particles, interaction potential, etc.
Finally, the results can be displayed in rendered mathematics (we use MathML or MathJax in web pages) to make the operation of each step and the results clearly visible.

Modeling Derivations
For the implementation, the basic class named derivation has a constructor that takes an initial equation (lhs and rhs). The primary method is add_step, which takes an operation (or list of operations) to perform and a textual description of the operation(s). There are series of classes for various operations, such approx_lhs, which replaces the left-hand side of the equation with a new value. Also there is add_term, which adds the same term to both sides of the equation.
It is easy to start generating code by simply printing the statements of the target language. However, for greater generality we will use a model of the target language. Currently this work has (incomplete) language models for Python and C.
At the lowest level of transforming expressions, we developed a pattern-matching syntax that concisely captures some of the SymPy idioms.
The Match object matches a SymPy expression. The __call__ method matches the first argument as the type of the expression. Subsequent arguments are variables to be bound to arguments. If the argument is a tuple, it is matched recursively on that argument. In this way the pattern for a tree structure can be built up concisely.
Variables that can match (for later binding) are members of an AutoVar class. This class creates member variables upon first access, and they are bound when the match succeeds.
Here is an example fragment of part of the SymPy to Python expression transformation, that matches addition, subtraction, and the reciprocal. SymPy normalizes subtraction as adding two expressions where the subtractand is multiplied by negative one. (That is, a − b is represented as −1 * b + a). Matching subtraction requires a nested pattern, which is shown here as well.

Simple derivation
The Euler method is the simplest method for solving a differential equation. The steps involve a finite difference approximation to the derivative, rearranging terms, and the result is The derivation is the following code: from sympy import Function, Symbol, diff, sympify from derivation_modeling import derivation, \ approx_lhs, mul_factor, add_term This can be output to MathML (or MathJax) for display in a web browser, which looks approximately like the following: Approximate derivative with finite difference Move f_0 term to left side to get the final result For one of the simplest quadrature formulas, we use the trapezoidal rule [Trapezoid]. The derivation part consists of starting from the rule for single interval, and extending it to a series of intervals. (The rules for a single interval can be derived from interpolating polynomials, but we didn't start there) The starting point for the derivation in Python is to define all the symbols, and the initial expression, then manipulate the expression so the function evaluation of each point is used only once.
from sympy import Symbol, Function, IndexedBase, Sum from derivation_modeling import derivation, identity i = Symbol('i',integer=True) n = Symbol('n',integer=True) # definitions of split_sum, adjust_limits, # peel_terms not shown # split_sum -expand the sum of terms into a term of sums # adjust_limits -adjust the expressions in the # summation variable. This allows matching # the index used in the summand among different sums. # peel_terms -move terms from the either end of the sum # to be an explicit term this allows the sum limits # to match and be combined. The LaTeX representation for the steps was copied from the generated output. Start with a sum of single interval formulas Split into two sums ('Split sum') Adjust the limits so the functions in the sum have compatible indices ('Adjust limits') Peel off some terms so the sum limits match, and combine the sums. ('Peel terms') Now we have the final expression and can move to the transformation step. The approach to multiple dimensional integrals will be iterated one-dimensional integrals.

Partition Function
We start with the partition function from statistical mechanics [Partition]. It incorporates the interactions between particles (think of particles in a box), and contains all the thermodynamic information about a system. The dimension of the integral rises with the number of particles. The complexity for the convergence of gridbased methods is exponential in the number of dimensions, and they quickly become overwhelmed. The convergence of Monte Carlo methods is independent of dimension, and are commonly used to compute these integrals. However, it would be still be useful to use a grid method for a small number of particles as a way to check the Monte Carlo algorithms. The derivation starts as follows: Where V is the inter-particle potential, T is the temperature, k is Boltzmann's constant, and Z is the symbol for the partition function. All of these are defined as SymPy Symbol.

Code Generation
As an example of the language model, the classic 'Hello World' program in python is The example derivations presented here are fairly simple and linear. In reality, the connections are more complex. For instance, one is often interested in multiple properties (energy, pressure, distribution functions) that may branch off the original derivation or have a separate thread of steps, but eventually, for efficiency they should all be evaluated in the same integral. The pattern-matching style makes the lower levels of expression translation fairly clear, but the the translations at the next level up (combining the source code statements) is not very transparent yet. An important future step is enhancing debugging by making the connections between the code generator and the generated code clearer.

Other Work
For solving partial differential equations, there is FEniCS [FEniCS] project and the SAGA (Scientific computing with Algebraic and Generative Abstractions) project [SAGA] .
Ignition [Ignition], [Terrel11]_ is a library that provides support for writing and combining DSL's for describing problems (or aspects of problems) Part of this work is modeling the target language for code generation. Several other projects for modeling programming languages include Pivot [Pivot], a project for modeling C++. CodeBoost [CodeBoost] is the code transformation portion of the SAGA system. PyCUDA [PyCUDA] is a potential target system, and it also has an associated model of C and CUDA for generation of code [CodePy]

Conclusions
We presented a snapshot of some work on some software blocks necessary for a system of scientific computing, including modeling a derivation, transforming to a source code representation, and code generation.