Managing Ensembles of Multi-Processor Jobs with Tex-MECS and PyLauncher

Authors: Tobis, Michael, Planet 3.0; Eijkhout, Victor, Texas Advanced Computing Center

Track: Posters

A number of important statistical methodologies in such areas as optimization and uncertainty quantification involve repeating a computation multiple times permuting a set of scalar parameters. These approaches can be applied across a wide array of disciplines.

The resulting work is a sort of meta-computation - an algorithm which includes steps that themselves are complex scientific calculations, with the computations dependent on results of previous computations. This forms a class of workflows which is iterative (a sequence of steps is preformed repeatedly until a criterion is met), contingent (most computations are scheduled algorithmically when other computations complete), and resource-intensive(some of the steps require allocation of significant resources) .

Computations of this sort have been approached in several ways, which have respective drawbacks. Simple desktop monte-carlo software can be extended, but this is not robust. Custom scripts can be used, but this is labor intensive. Scientific workflow software can be adapted, but these may have the flexibility for some algorithms. Large optimization libraries can be integrated, but this has a steep learning curve.

We describe here a portable, lightweight, pure-Python approach to managing adaptive ensembles which involve multi-processor computations.

Tex-MECS, the Model Ensemble Control System, was designed to maximize flexibility and fault tolerance in large ensembles of climate model runs for uncertainty quantification. Domain specialists can adapt any pre-existing parameter search algorithm with ease, using the shell or scripting language of their choice. Algorithm developers can add new algorithms using basic Python skills.

PyLauncher provides the user of today's large shared supercomputing clusters with a virtual pool of CPU nodes that can manage jobs of varying durations and processing demands while minimizing the load on the queue.

In combination, these packages provide a portable, accessible platform for implementing reproducible computations from an important class of scientific, engineering or technical problems.