DMTCP: Bringing Checkpoint-Restart to Python
Authors: Arya, Kapil, Northeastern University; Cooperman, Gene, Northeastern University
DMTCP is a mature user-space checkpoint-restart package. One can think of checkpoint-restart as a generalization of pickling. Instead of saving an object to a file, one saves the entire Python session to a file. Checkpointing Python visualization software is as easy as checkpointing a VNC session with Python running inside.
A DMTCP plugin can be built in the form of a Python module. This Python module provides functions by which a Python session can checkpoint itself to disk. The same ideas extend to IPython.
Two classical uses of this feature are a saveWorkspace function (including visualization and the distributed processes of IPython). In addition, at least three novel uses of DMTCP for helping debug Python are demonstrated.
FReD --- a Fast Reversible Debugger that works closely with the Python pdb debugger, as well as other Python debuggers.
Reverse Expression Watchpoint --- A bug occurred in the past. It is associated with the point in time when a certain expression changed. Bring the user back to a pdb session at the step before the bug occurred.
Fast/Slow Computation --- Cython provides both traditional interpreted functions and compiled C functions. Interpreted functions are slow, but correct. Compiled functions are fast, but users sometimes define them incorrectly, whereupon the compiled function silently returns a wrong answer. The idea of fast/slow computation is to run the compiled version on one core, with checkpoints at frequent intervals, and to copy a checkpoint to another core. The second core re-runs the computation over that interval, but in interpreted mode.
DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop. Ansel, Arya, Cooperman. IPDPS-2009 http://dmtcp.sourceforge.net/ FReD: Automated Debugging via Binary Search through a Process Lifetime http://arxiv.org/abs/1212.5204 Distributed Speculative Parallelization using Checkpoint Restart, Ghoshal et al. Procedia Computer Science, 2011.