Optimised finite difference computation from symbolic equations

Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automated execution of highly optimized stencil code from only a few lines of high-level symbolic Python for a set of scientific equations, before exploring the use of Devito operators in seismic inversion problems.


Introduction
Domain-specific high-productivity environments are playing an increasingly important role in scientific computing. The level of abstraction and automation provided by such frameworks not only increases productivity and accelerates innovation, but also allows the combination of expertise from different specialised disciplines. This synergy is necessary when creating the complex software stack needed to solve leading edge scientific problems, since domain specialists as well as high performance computing experts are required to fully leverage modern computing architectures. Based on this philosophy we introduce Devito [Lange17], an opensource domain-specific framework for solving partial differential equations (PDE) from symbolic problem definitions by the finite difference method.
Symbolic computation, where optimized numerical code is automatically derived from a high-level problem definition, is a powerful technique that allows domain scientists to focus on algorithmic development rather than implementation details. For this reason Devito exposes an API based on Python (SymPy) [Meurer17] that allow users to express equations symbolically, from which it generates and executes optimized stencil code via just-in-time (JIT) compilation. Using latest advances in stencil compiler research, Devito thus provides domain scientists with the ability to quickly and efficiently generate high-performance kernels from only a few lines of Python code, making Devito composable with existing open-source software.
While Devito was originally developed for seismic imaging workflows, the automated generation and optimization of stencil codes can be utilised for a much broader set of computational problems. Matrix-free stencil operators based on explicit finite difference schemes are widely used in industry and academic research, although they merely represent one of many approaches to solving PDEs [Baba16], [Liu09], [Rai91]. In this paper we therefore limit our discussion of numerical methods and instead focus on the ease with which these operators can be created symbolically. We give a brief overview of the design concepts and key features of Devito and demonstrate its API using a set of classic examples from computational fluid dynamics (CFD). Then we will discuss the use of Devito in an example of a complex seismic inversion algorithm to illustrate its use in practical scientific applications and to showcase the performance achieved by the auto-generated and optimised code.

Background
The attraction of using domain-specific languages (DSL) to solve PDEs via a high-level mathematical notation is by no means new and has led to various purpose-built software packages and compilers dating back to 1962 [Iverson62], [Cardenas70], [Umetani85], [Cook88], [VanEngelen96]. Following the emergence of Python as a widely used programming language in scientific research, embedded DSLs for more specialised domains came to the fore, most notably the FEniCS [Logg12] and Firedrake [Rathgeber16] frameworks, which both implement the unified Form Language (UFL) [Alnaes14] for the symbolic definition of finite element problems in the weak form. The increased level of abstraction that such high-level languages provide decouples the problem definition from its implementation, thus allowing domain scientists and mathematicians to focus on more advanced methods, such as the automation of adjoint models as demonstrated by Dolfin-Adjoint [Farrell13].
The performance optimization of stencil computation on regular cartesian grids for high-performance computing applications has also received much attention in computer science research [Datta08], [Brandvik10], [Zhang12], [Henretty13], [Yount15]. The primary focus of most stencil compilers or DSLs, however, is the optimization of synthetic problems which often limits their applicability for practical scientific applications. The primary consideration here is that most realistic problems often require more than just a fast and efficient PDE solver, which entails that symbolic DSLs embedded in Python can benefit greatly from native interoperability with the scientific Python ecosystem.

Design and API
The primary objective of Devito is to enable the quick and effective creation of highly optimised finite difference operators for use in a realistic scientific application context. As such, its design is centred around composability with the existing Python software stack to provide users with the tools to dynamically generate optimised stencil computation kernels and to enable access to the full scientific software ecosystem. In addition, to accommodate the needs of "real life" scientific applications, a secondary API is provided that enables users to inject custom expressions, such as boundary conditions or sparse point interpolation routines, into the generated kernels.
The use of SymPy as the driver for the symbolic generation of stencil expressions and the subsequent code-generation are at the heart of the Devito philosophy. While SymPy is fully capable of auto-generating low-level C code for pre-compiled execution from high-level symbolic expressions, Devito is designed to combine these capabilities with automatic performance optimization based on the latest advances in stencil compiler technology. The result is a framework that is capable of automatically generating and optimising complex stencil code from high-level symbolic definitions.
The Devito API is based around two key concepts that allow users to express finite difference problems in a concise symbolic notation: • Symbolic data objects: Devito's high-level symbolic objects behave like sympy.Function objects and provide a set of shorthand notations for generating derivative expressions, while also managing user data. The rationale for this duality is that many stencil optimization algorithms rely on data layout changes, mandating that Devito needs to be in control of data allocation and access.
• Operator: An Operator creates, compiles and executes a single executable kernel from a set of SymPy expressions. The code generation and optimization process involves various stages and accepts a mixture of highlevel and low-level expressions to allow the injection of customised code.

Fluid Dynamics Examples
In the following section we demonstrate the use of the Devito API to implement two examples from classical fluid dynamics, before highlighting the role of Devito operators in a seismic inversion context. Both CFD examples are based in part on tutorials from the introductory blog "CFD Python: 12 steps to Navier-Stokes" 1 by the Lorena A. Barba group. We have chosen the examples in this section for their relative simplicity to concisely illustrate the capabilities and API features of Devito. For a more complete discussion on numerical methods for fluid flows please refer to [Peiro05].

Linear Convection
We will demonstrate a basic Devito operator definition based on a linear two-dimensional convection flow (step 5 in the original tutorials) 2 . The governing equation we are implementing here is: A discretised version of this equation, using a forward difference scheme in time and a backward difference scheme in space might be written as where the subscripts i and j denote indices in the space dimensions and the superscript n denotes the index in time, while ∆t, ∆x, ∆y denote the spacing in time and space dimensions respectively. The first thing we need is a function object with which we can build a timestepping scheme. For this purpose Devito provides so-called TimeData objects that encapsulate functions that are differentiable in space and time. With this we can derive symbolic expressions for the backward derivatives in space directly via the u.dxl and u.dyl shorthand expressions (the l indicates "left" or backward differences) and the shorthand notation u.dt provided by TimeData objects to derive the forward derivative in time.
from devito import * c = 1. u = TimeData(name='u', shape=(nx, ny)) eq = Eq(u.dt + c * u.dxl + c * u.dyl) [In] print eq [Out] Eq(-u(t, x, y)/s + u(t + s, x, y)/s + 2.0 * u(t, x, y)/h -1.0 * u(t, x, y -h)/h -1.0 * u(t, x -h, y)/h, 0) The above expression results in a sympy.Equation object that contains the fully discretised form of Eq. 1, including placeholder symbols for grid spacing in space (h, assuming ∆x = ∆y) and time (s). These spacing symbols will be resolved during the code generation process, as described in the code generation section. It is also important to note here that the explicit generation of the space derivatives u_dx and u_dy is due to the use of a backward derivative in space to align with the original example. A similar notation to the forward derivative in time (u.dt) will soon be provided.
In order to create a functional Operator object, the expression eq needs to be rearranged so that we may solve for the unknown u n+1 i, j . This is easily achieved by using SymPy's solve utility and the Devito shorthand u.forward which denotes the furthest forward stencil point in a time derivative (u n+1 i, j ).
from sympy import solve The above variable stencil now represents the RHS of Eq. 2, allowing us to construct a SymPy expression that updates u n+1 i, j and build a devito.Operator from it. When creating this operator we also supply concrete values for the spacing terms h and s via an additional substitution map argument subs.

D R A F T
Using this operator we can now create a similar example to the one presented in the original tutorial by initialising the data associated with the symbolic function u, u.data with an initial flow field. However, to avoid numerical errors due to the discontinuities at the boundary of the original "hat function", we use the following smooth initial condition provided by [Krakos12], as depicted in Figure 1.
The final result after executing the operator for 5s (100 timesteps) is depicted in Figure 2. The result shows the expected displacement of the initial shape, in accordance with the prescribed velocity (c = 1.0), closely mirroring the displacement of the "hat function" in the original tutorial. It should also be noted that, while the results show good agreement with expectations by visual inspection, they do not represent an accurate solution to the linear convection equation. In particular, the low order spatial discretisation introduces numerical diffusion that causes a decrease in the peak velocity. This is a well-known issue that could be addressed with more sophisticated solver schemes as discussed in [LeVeque92].

Laplace equation
The above example shows how Devito can be used to create finite difference stencil operators from only a few lines of high-level symbolic code. However, the previous example only required a single variable to be updated, while more complex operators might need to execute multiple expressions simultaneously, for example to solve coupled PDEs or apply boundary conditions as part of the time loop. For this reason devito.Operator objects can be constructed from multiple update expressions and allow mutiple expression formats as input.
Nevertheless, boundary conditions are currently not provided as part of the symbolic high-level API. For exactly this reason, Devito provides a low-level, or "indexed" API, where custom SymPy expressions can be created with explicitly resolved grid accesses to manually inject custom code into the auto-generation toolchain. This entails that future extensions to capture different types of boundary conditions can easily be added at a later stage.
To illustrate the use of the low-level API, we will use the Laplace example from the original CFD tutorials (step 9), which implements the steady-state heat equation with Dirichlet and Neuman boundary conditions 3 . The governing equation for this problem is The rearranged discretised form, assuming a central difference scheme for second derivatives, is Using a similar approach to the previous example, we can construct the SymPy expression to update the state of a field p. For demonstration purposes we will use two separate function objects of type DenseData in this example, since the Laplace equation does not contain a time-dependence. The shorthand expressions pn.dx2 and pn.dy2 hereby denote the second derivatives in x and y. Just as the original tutorial, our initial condition in this example is p = 0 and the flow will be driven by the boundary conditions To implement these BCs we can utilise the .indexed property that Devito symbols provide to get a symbol of type sympy.IndexedBase, which in turn allows us to use matrix indexing notation (square brackets) to create symbols of type sympy.Indexed instead of sympy.Function. This notation D R A F T allows users to hand-code stencil expressions using explicit relative grid indices, for example p[x, y] -p[x-1, y] / h for the discretized backward derivative ∂ u ∂ x . The symbols x and y hereby represent the respective problem dimensions and cause the expression to be executed over the entire data dimension, similar to Python's : operator.
The Dirichlet BCs in the Laplace example can thus be implemented by creating a sympy.Eq object that assigns either fixed values or a prescribed function, such as the utility symbol bc_right in our example, along the left and right boundary of the domain. To implement the Neumann BCs we again follow the original tutorial by assigning the second grid row from the top and bottom boundaries the value of the outermost row. The resulting SymPy expressions can then be used alongside the state update expression to create our Operator object. After building the operator, we can now use it in a timeindependent convergence loop that minimizes the L 1 norm of p. However, in this example we need to make sure to explicitly exchange the role of the buffers p and pn. This can be achieved by supplying symbolic data objects via keyword arguments when invoking the operator, where the name of the argument is matched against the name of the original symbol used to create the operator. The convergence criterion for this example is defined as the relative error between two iterations and set to p 1 < 10 −4 . The corresponding initial condition and the resulting steady-state solution, depicted in Figures 3 and 4 respectively, agree with the original tutorial implementation. It should again be noted that the chosen numerical scheme might not be optimal to solve steady-state problems of this type, since implicit methods are often preferred.

Seismic Inversion Example
The primary motivating application behind the design of Devito is the solution of seismic exploration problems that require highly optimised wave propagation operators for forward modelling and  adjoint-based inversion. Obviously, the speed and accuracy of the generated kernels are of vital importance. Moreover, the ability to efficiently define rigorous forward modelling and adjoint operators from high-level symbolic definitions also implies that domain scientists are able to quickly adjust the numerical method and discretisation to the individual problem and hardware architecture [Louboutin17a].
In the following example we will show the generation of forward and adjoint operators for the acoustic wave equation and verify their correctness using the so-called adjoint test [Virieux09] 4 . This test, also known as dot product test, verifies that the implementation of an adjoint operator indeed computes the conjugate transpose of the forward operator.
The governing wave equation for the forward operator is defined as where u denotes the pressure wave field, m is the square slowness, q is the source term and η denotes the spatially varying dampening factor used to implement an absorbing boundary condition. On top of fast stencil operators, seismic inversion kernels also rely on sparse point interpolation to inject the modelled wave as a point source (q) and to record the pres-D R A F T sure at individual point locations. To accommodate this, Devito provides another symbolic data type PointData, which allows the generation of sparse-point interpolation expressions using the "indexed" low-level API. These symbolic objects provide utility routines pt.interpolate(expression) and pt.inject(field, expression) to create symbolic expressions that perform linear interpolation between the sparse points and the cartesian grid for insertion into Operator kernels. A separate set of explicit coordinate values is associated with the sparse point objects for this purpose in addition to the function values stored in the data property.

Adjoint Test
The first step for implementing the adjoint test is to build a forward operator that models the wave propagating through an isotropic medium, where the square slowness of the wave is denoted as m. Since m, as well as the boundary dampening function eta, is re-used between forward and adjoint runs the only symbolic data object we need to create here is the wavefield u in order to implement and rearrange our discretised equation eqn to form the update expression for u. It is worth noting that the u.laplace shorthand notation used here expands to the set of second derivatives in all spatial dimensions, thus allowing us to use the same formulation for two-dimensional and three-dimensional problems.
In addition to the state update of u, we are also inserting two additional terms into the forward modelling operator: • src_term injects a pressure source at a point location according to a prescribed time series stored in src.data that is accessible in symbolic form via the symbol src. The scaling factor in src_term is coded by hand but can be automatically inferred.
• rec_term adds the expression to interpolate the wavefield u for a set of "receiver" hydrophones that measure the propagated wave at varying distances from the source for every time step. The resulting interpolated point data will be stored in rec.data and is accessible to the user as a NumPy array. After building a forward operator, we can now implement the adjoint operator in a similar fashion. Using the provided symbols m and eta, we can again define the adjoint wavefield v and implement its update expression from the discretised equation. However, since the adjoint operator needs to operate backwards in time there are two notable differences: • The update expression now updates the backward stencil point in the time derivative v n−1 i, j , denoted as v.backward. In addition to that, the Operator is forced to reverse its internal time loop by providing the argument time_axis=Backward • Since the acoustic wave equation is self-adjoint without dampening, the only change required in the governing equation is to invert the sign of the dampening term eta * u.dt. The first derivative is an antisymmetric operator and its adjoint minus itself.
Moreover, the role of the sparse point objects has now switched: Instead of injecting the source term, we are now injecting the previously recorded receiver values into the adjoint wavefield, while we are interpolating the resulting wave at the original source location. As the injection and interpolations are part of the kernel, we also insure that these two are adjoints of each other. Having established how to build the required operators we can now define the workflow for our adjoint example. For illustration purposes we are using a utility object Model that provides the core information for seismic inversion runs, such as the values for m and the dampening term eta, as well as the coordinates of the point source and receiver hydrophones. It is worth noting that the spatial discretisation and thus the stencil size of the operators is still fully parameterisable. The adjoint test is the core definition of the adjoint of a linear operator. The mathematical correctness of the adjoint is required for mathematical adjoint-based optimizations methods that are only guarantied to converged with the correct adjoint. The test can be written as: < src, ad joint(rec) >=< f orward(src), rec > The adjoint test can be used to verify the accuracy of the forward propagation and adjoint operators and has been shown to agree for 2D and 3D implementations [Louboutin17b]. The shot record of the data measured at the receiver locations after the forward run is shown in Figure 5.

Automated code generation
The role of the Operator in the previous examples is to generate semantically equivalent C code to the provided SymPy expressions, complete with loop constructs and annotations for performance optimization, such as OpenMP pragmas. Unlike many other DSL-based frameworks, Devito employs actual compiler technology during the code generation and optimization process. The symbolic specification is progressively lowered to C code through a series of passes manipulating abstract syntax trees (AST), rather than working with rigid templates. This software engineering choice has an invaluable impact on maintainability, extensibility and composability.
Following the initial resolution of explicit grid indices into the low-level format, Devito is able to apply several types of automated performance optimization throughout the code generation pipeline, which are grouped into two distinct sub-modules: • DSE -Devito Symbolic Engine: The first set of optimization passes consists of manipulating SymPy equations with the aim to decrease the number of floating-point operations performed when evaluating a single grid point. This initial optimization is performed following an initial analysis of the provided expressions and consists of sub-passes such as common sub-expressions elimination, detection and promotion of time-invariants, and factorization of common finite-difference weights. These transformations not only optimize the operation count, but they also improve the symbolic processing and low-level compilation times of later processing stages.
• DLE -Devito Loop Engine: After the initial symbolic processing Devito schedules the optimised expressions in a set of loops by creating an Abstract Syntax Tree (AST). The loop engine (DLE) is now able to perform typical loop-level optimizations in mutiple passes by manipulating this AST, including data alignment through array annotations and padding, SIMD vectorization through OpenMP pragmas and thread parallelism through OpenMP pragmas. On top of that, loop blocking is used to fully exploit the memory bandwidth of a target architecture by increasing data locality and thus cache utilization. Since the effectiveness of the blocking technique is highly architecturedependent, Devito can determine optimal block size through runtime auto-tuning.

Performance Benchmark
The effectiveness of the automated performance optimization performed by the Devito backend engines can be demonstrated using the forward operator constructed in the above example. The following performance benchmarks were run with for a threedimensional grid of size 512 × 512 × 512 with varying spatial discretisations resulting in different stencil sizes with increasing operational intensity (OI). The benchmark runs were performed on on a Intel(R) Xeon E5-2620 v4 2.1Ghz "Broadwell" CPU with a single memory socket and 8 cores per socket and the slope of the roofline models was derived using the Stream Triad benchmark [McCalpin95].
The first set of benchmark results, shown in Figure 6, highlights the performance gains achieved through loop-level optimizations. For these runs the symbolic optimizations were kept at a "basic" setting, where only common sub-expressions elimination is performed on the kernel expressions. Of particular interest are the performance gains achieved by increasing the loop engine mode from "basic" to "advanced", to insert loop blocking and explicit vectorization directives into the generated code. Due to the improved memory bandwidth utilization the performance increased to between 52% and 74% of the achievable peak. It is also worth noting that more aggressive optimization in the "speculative" DLE mode (directives for non-temporal stores and row-wise data alignment through additional padding) did not yield any consistent improvements due to the low OI inherent to the acoustic formulation of the wave equation and the subsequent memory bandwidth limitations of the kernel. On top of loop-level performance optimizations, Figure 7 shows the achieved performance with additional symbolic optimizations and flop reductions enabled. While the peak performance shows only small effects from this set of optimizations due to the inherent memory bandwidth limitations of the kernel, it is interesting to note a reduction in operational intensity between     equivalent stencil sizes in Figures 6 and 7. This entails that, despite only marginal runtime changes, the generated code is performing less flops per stencil point, which is of vital importance for compute-dominated kernels with large OI [Louboutin17a].

Integration with YASK
As mentioned previously, Devito is based upon actual compiler technology with a highly modular structure. Each backend transformation pass is based on manipulating an input AST and returning a new, different AST. One of the reasons behind this software engineering strategy, which is clearly more challenging than a template-based solution, is to ease the integration of external tools, such as the YASK stencil optimizer [Yount16]. We are currently in the process of integrating YASK to complement the DLE, so that YASK may replace some (but not all) DLE passes.
The DLE passes are organized in a hierarchy of classes where each class represents a specific code transformation pipeline based on AST manipulations. Integrating YASK becomes then a conceptually simple task, which boils down to three actions: 1. Adding a new transformation pipeline to the DLE. 2. Adding a new array type, to ease storage layout transformations and data views (YASK employs a data layout different than the conventional row-major format).

Creating the proper Python bindings in YASK so that
Devito can drive the code generation process. It has been shown that real-world stencil codes optimised through YASK may achieve an exceptionally high fraction of the attainable machine peak [Yount15], [Yount16]. Further, initial prototyping (manual optimization of Devito-generated code through YASK) revealed that YASK may also outperform the loop optimization engine currently available in Devito, besides ensuring seamless performance portability across a range of computer architectures. On the other hand, YASK is a C++ based framework that, unlike Devito, does not rely on symbolic mathematics and processing; in other words, it operates at a much lower level of abstraction. These observations, as well as the outcome of the initial prototyping phase, motivate the on-going Devito-YASK integration effort.

Discussion
In this paper we present the finite difference DSL Devito and demonstrate its high-level API to generate two fluid dynamics operators and a full seismic inversion example. We highlight the relative ease with which to create complex operators from only a few lines of high-level Python code while utilising highly optimised auto-generated C kernels via JIT compilation. On top of purely symbolic top-level API based on SymPy, we show how to utilise Devito's secondary API to inject custom expressions into the code generation toolchain to implement Dirichlet and Neumann boundary conditions, as well as the sparse-point interpolation routines required by seismic inversion operators.
Moreover, we demonstrate that Devito-generated kernels are capable of exploiting modern high performance computing architectures by achieving a significant percentage of machine peak. Devito's code-generation engines achieve this by automating wellknown performance optimizations, as well as domain-specific optimizations, such as flop reduction techniques -all while maintaining full compatibility with the scientific software stack available through the open-source Python ecosystem.

Limitations and Future Work
The examples used in this paper have been chosen for their relative simplicity in order to concisely demonstrate the current features of the Devito API. Different numerical methods may be used to solve the presented examples with greater accuracy or achieve more realistic results. Nevertheless, finite difference methods play an important role and are widely used in academic and industrial research due to the relative ease of implementation, verification/validation and high computational efficiency, which is of particular importance for inversion methods that require fast and robust high-order PDE solvers.
The interfaces provided by Devito are intended to create highperformance operators with relative ease and thus increase user productivity. Several future extensions are planned to enhance the high-level API to further ease the construction of more complex operators, including explicit abstractions for symbolic boundary conditions, perfectly matched layer (PML) methods and staggered D R A F T grids. Devito's secondary low-level API and use of several intermediate representations are intended to ease the gradual addition of new high-level features.
Moreover, the addition of YASK as an alternative backend will not only provide more advanced performance optimisation, but also an MPI infrastructure to allow Devito to utilise distribute computing environments. Further plans also exist for integration with linear and non-linear solver libraries, such as PETSc, to enable Devito to handle implicit formulations.