Dask : Parallel Computation with Blocked algorithms and Task Scheduling

Dask enables parallel and out-of-core computation. We couple blocked algorithms with dynamic and memory aware task scheduling to achieve a parallel and out-of-core NumPy clone. We show how this extends the effective scale of modern hardware to larger datasets and discuss how these ideas can be more broadly applied to other parallel collections.


Introduction
The Scientific Python stack [Oli07] rarely leverages parallel computation.Code built off of NumPy [vdW11] or Pandas [McK10] generally runs in a single thread on data that fits comfortably in memory.Advances in hardware in the last decade in multi-core processors and solid state drives provide significant and yet largely untapped performance advantages.
However, the Scientific Python stack consists of hundreds of software packages, papers, PhD theses, and developer-years.This stack is a significant intellectual and financial investment that, for the most part, does not align well with modern hardware.We seek software solutions to parallelize this software stack without triggering a full rewrite.
This paper introduces dask, a specification to encode parallel algorithms, using primitive Python dictionaries, tuples, and callables.We use dask to create dask.array a parallel Ndimensional array library that copies the NumPy interface, uses all of the cores in a modern processor, and manages data well from disk.Dask.array serves both as a general library for parallel out-of-core ndarrays and also as a demonstration that we can parallelize complex codebases like NumPy in a straightforward manner using blocked algorithms and task scheduling.
We first define dask graphs and give a trivial example of their use.We then share the design of dask.array a parallel ndarray.Then we discuss dynamic task scheduling and policies to minimize memory footprint.We then give two examples using dask.array on computational problems.We then briefly discuss dask.bag and dask.dataframe,two other collections in the dask library.We finish with thoughts about extension of this approach into the broader Scientific Python ecosystem.

Dask Graphs
Normally humans write programs and then compilers/interpreters interpret them (e.g.python, javac, clang).Sometimes humans disagree with how these compilers/interpreters choose to interpret and execute their programs.In these cases humans often bring the analysis, optimization, and execution of code into the code itself.
Commonly a desire for parallel execution causes this shift of responsibility from compiler to human developer.In these cases we often represent the structure of our program explicitly as data within the program itself.
Dask is a specification that encodes task schedules with minimal incidental complexity using terms common to all Python projects, namely dicts, tuples, and callables.Ideally this minimum solution is easy to adopt and understand by a broad community.
We define a dask graph as a Python dictionary mapping keys to tasks or values.A key is any Python hashable, a value is any Python object that is not a task, and a task is a Python tuple with a callable first element.We encode this as a dictionary below: While less pleasant than our original code this representation can be analyzed and executed by other Python code, not just the CPython interpreter.We don't recommend that users write code in this way, but rather that it is an appropriate target for automated systems.Also, in non-toy examples the execution times are likely much larger than for inc and add, warranting the extra complexity.

Specification
We represent a computation as a directed acyclic graph of tasks with data dependencies.Dask is a specification to encode such a graph using ordinary Python data structures, namely dicts, tuples, functions, and arbitrary Python values.

'x' ('x', 2, 3)
A task is a tuple with a callable first element.Tasks represent atomic units of work meant to be run by a single worker.
(add, 'x', 'y') We represent a task as a tuple such that the first element is a callable function (like add), and the succeeding elements are arguments for that function.
An argument may be one of the following: 1) Any key present in the dask like 'x' 2) Any other value like 1, to be interpreted literally 3) Other tasks like (inc, 'x') 4) List of arguments, like [1, 'x', (inc, 'x')] So all of the following are valid tasks The dask spec provides no explicit support for keyword arguments.
In practice we combine these into the callable function with functools.partialor toolz.curry.

Dask Arrays
The dask.array submodule uses dask graphs to create a NumPy-like library that uses all of your cores and operates on datasets that do not fit in memory.It does this by building up a dask graph of blocked array algorithms.The dask.array submodule is not the first library to implement a "Big NumPy Clone".Other partial implementations exist including Biggus an out-of-core ndarray specialized for climate science, Spartan [Pow14] a distributed memory ndarray, and Distarray a distributed memory ndarray that interacts well with other distributed array libraries like Trillinos.There have also been numerous projects in traditional high performance computing space including Elemental [Pou13], High Performance Fortran, etc.. Finally Theano [Ber10], an array compiler in Python with powerful optimizations and GPU support, statically schedules and reasons about array computations and has proven particularly valuable in machine learning applications.
Each of these implementations focuses on a particular application or problem domain.Dask.array distinguishes itself in that it focuses on a very general class of NumPy operations and streaming execution through dynamic task scheduling.

Blocked Array Algorithms
Blocked algorithms compute a large result like "take the sum of these trillion numbers" with many small computations like "break up the trillion numbers into one million chunks of size one million, sum each chunk, then sum all of the intermediate sums."Through tricks like this we can evaluate one large problem by solving very many small problems.
Blocked algorithms have proven useful in modern numerical linear algebra libraries like Flame [Gei08] and Plasma [Agu09] and more recently in data parallel systems like Dryad [Isa07] and Spark [Zah10].These compute macroscopic operations with a collection of related in-memory operations.
Dask.array takes a similar approach to linear algebra libraries but focuses instead on the more pedestrian ndarray operations, like arithmetic, reductions, and slicing common in interactive use.

Array metadata
In the example above x and z are both dask.array.Array objects.These objects contain the following data 1) A dask graph, .dask2) Information about shape and chunk shape, called .chunks 3) A name identifying which keys in the graph correspond to the result, .name4) A dtype The second item here, chunks, deserves further explanation.A normal NumPy array knows its shape, a dask array must know its shape and the shape of all of the internal NumPy blocks that make up the larger array.These shapes can be concisely described by a tuple of tuples of integers, where each internal tuple corresponds to the lengths along a single dimension.
In the example above we have a 20 by 24 array cut into uniform blocks of size 5 by 8.The chunks attribute describing this array is the following: chunks = ((5, 5, 5, 5), (8, 8, 8)) Where the four fives correspond to the heights of the blocks along the first dimension and the three eights correspond to the widths of the blocks along the second dimension.This particular example has uniform sizes along each dimension but this need not be the case.Consider the chunks of the following example operations Every dask.arrayoperation, like add, slicing, or transpose must take the graph and all metadata, add new tasks into the graph and determine new values for each piece of metadata.

Capabilities and Limitations
Adding subgraphs and managing metadata for most of NumPy is difficult but straightforward.At present dask.array is around 5000 lines of code (including about half comments and docstrings).It encompasses most commonly used operations including the following: • Arithmetic and scalar mathematics, +, * , exp, log, ...

•
Reductions along axes, sum(), mean(), std(), sum(axis=0), ... The shape of this array depends on the number of positive elements in x.This shape is not known given only metadata; it requires knowledge of the values underlying x, which are not available at graph creation time.Note however that this case is fairly rare; for example it is possible to determine the shape of the output in all other cases of slicing and indexing, e.g.

Dynamic Task Scheduling
We now discuss how dask executes task graphs.How we execute these graphs strongly impacts performance.Fortunately we can tackle this problem with a variety of approaches without touching the graph creation problem discussed above.Graph creation and graph execution are separable problems.The dask library contains schedulers for single-threaded, multi-threaded, multi-process, and distributed execution.
Current dask schedulers all operate dynamically, meaning that execution order is determined during execution rather than ahead of time through static analysis.This is good when runtimes are not known ahead of time or when the execution environment contains uncertainty.However dynamic scheduling does preclude certain clever optimizations.
Dynamic task scheduling has a rich literature and numerous projects, both within the Python ecosystem with projects like Spotify's Luigi for bulk data processing and projects without the ecosystem like DAGuE [Bos12] for more high performance task scheduling.Additionally, data parallel systems like Dryad or Spark contain their own custom dynamic task schedulers.
None of these solutions, nor much of the literature in dynamic task scheduling, suited the needs of blocked algorithms for shared memory computation.We needed a lightweight, easily installable Python solution that had latencies in the millisecond range and was mindful of memory use.Traditional task scheduling literature usually focuses on policies to expose parallelism or chip away at the critical path.We find that for bulk data analytics these are not very relevant as parallelism is abundant and critical paths are comparatively short relative to the depth of the graph.
The logic behind dask's schedulers reduces to the following situation: A worker reports that it has completed a task and that it is ready for another.We update runtime state to record the finished task, mark which new tasks can be run, which data can be released, etc..We then choose a task to give to this worker from among the set of ready-to-run tasks.This small choice governs the macroscale performance of the scheduler.
Instead of these metrics found in the literature we find that for out-of-core computation we need to choose tasks that allow us to release intermediate results and keep a small memory footprint.This lets us avoid spilling intermediate values to disk which hampers performance significantly.After several other policies we find that the policy of last in, first out is surprisingly effective.That is we select tasks whose data dependencies were most recently made available.This causes a behavior where long chains of related tasks trigger each other, forcing the scheduler to finish related tasks before starting new ones.We implement this with a simple stack, which can operate in constant time.
We endeavor to keep scheduling overhead low at around 1ms per task.Updating executing state and deciding which task to run must be made very quickly.To do this we maintain a great deal of state about the currently executing computation.The set of ready-to-run tasks is commonly quite large, in the tens or hundreds of thousands in common workloads and so in practice we must maintain enough state so that we can choose the right task in constant time (or at least far sub-linear time).
Finally, power users can disregard the dask schedulers and create their own.Dask graphs are completely separate from the choice of scheduler and users may select the right scheduler for their class of problem or, if no ideal scheduler exists, build one anew.The default single-machine scheduler is about three hundred significant lines of code and has been adapted to single-threaded, multi-threaded, multi-processing, and distributed computing variants.

Example: Matrix Multiply
We benchmark dask's blocked matrix multiply on an out-of-core dataset.This demonstrates the following: 1) How to interact with on-disk data 2) The blocked algorithms in dask.arrayachieve similar performance to modern BLAS implementations on computebound tasks We set up a trivial input dataset  The Dask convenience method, da.from_array, creates a graph that can pull data from any object that implements NumPy slicing syntax.The da.store function can then store a large result in any object that implements NumPy setitem syntax.Results: We do this same operation in different settings.
We use either use NumPy or dask.array: 1) Use NumPy on a big-memory machine 2) Use dask.array in a small amount of memory, pulling data from disk, using four threads We compare different BLAS implementations: 1) ATLAS BLAS, single threaded, unblocked 2) OpenBLAS, single threaded 3) OpenBLAS, multi-threaded For each configuration we compute the number of floating point operations per second.
We note the following 1) Compute-bound tasks are computationally bound by memory; we don't experience a slowdown 2) Dask.array can effectively parallelize and block ATLAS BLAS for matrix multiplies 3) Dask.array doesn't significantly improve when using an optimized BLAS, presumably this is because we've already reaped most of the benefits of blocking and multicore 4) One should not mix multiple forms of multi-threading.
Four dask.arraythreads each spawning multi-threaded OpenBLAS DGEMM calls results in worse performance.

Example: Meteorology
Performance is secondary to capability.In this example we use dask.array to manipulate climate datasets that are larger than memory.This example shows the following: 1) Use concatenate and stack to manage large piles of HDF5 files (a common case) 2) Use reductions and slicing to manipulate stacks of arrays 3) Interact with other libraries in the ecosystem using the __array__ protocol.
We start with a typical setup, a large pile of NetCDF files.: We can collect many of these files together using da.concatenate, resulting in a single large array.We can now play with this array as though it were a NumPy array.Because dask.arraysimplement the __array__ protocol we can dump them directly into functions of other libraries.These libraries will trigger computation when they call np.array(...) on their input.This computation took about a minute on an old notebook computer.It was bound by disk access.Meteorological cases tend to be I/O bound rather than compute bound, taking more advantage of dask's memory-aware schedulers rather than parallel computation.In other cases, such as parallel image processing, this trend is reversed.

Other Collections
The dask library contains parallel collections other than dask.array.We briefly describe dask.bag and dask.dataframeThe dask.bag API contains functions like map and filter and generally follows the PyToolz API.We find that it is particularly useful on the front lines of data analysis, particularly in parsing and cleaning up initial data dumps like JSON or log files because it combines the streaming properties and solid performance of projects like cytoolz with the parallelism of multiple processes.Currently dask.dataframeuses the threaded scheduler but does not achieve the same parallel performance as dask.arraydue to the GIL.We are enthusiastic about ongoing work in Pandas itself to release the GIL.
The dask dataframe can compute efficiently on partitioned datasets where the different blocks are well separated along an index.For example in time series data we may know that all of January is in one block while all of February is in another.Join, groupby, and range queries along this index are significantly faster when working on partitioned datasets.
Dask.dataframe benefits users by providing trivial access to larger-than-memory datasets and, where Pandas does release the GIL, parallel computation.

Dask for General Computing
The higher level collections dask.array/bag/dataframedemonstrate the flexibility of the dask graph specification to encode sophisticated parallel algorithms and the capability of the dask schedulers to execute those graphs intelligently on a multicore machine.Opportunities for parallel execution extend beyond beyond ndarrays and dataframes.
In the beginning of this document we gave the following toy example to help define dask graphs.d = {'x': 1, 'y': (inc, 'x'), 'z': (add, 'y', 10)} While this example of dask graphs is trivial it represents a broader class of free-form computations that don't fit neatly into a single high-level abstraction like arrays or dataframes but are instead just a bunch of related Python functions with data dependencies.In this context Dask offers a lightweight spec and range of schedulers as well as excellent error reporting and diagnostic facilities.In private projects we have seen great utility and performance from using the dask threaded scheduler to refactor and execute existing processing pipelines on large multi-core computers.

Low Barrier to Entry
The simplicity of dask graphs (no classes or frameworks) presents a very low barrier to entry.Users only need to understand basic concepts common to Python (or indeed most modern languages) like dictionaries, tuples, and functions as variables.As an example consider the work in [Tep15] in which the authors implement out-of-core parallel non-negative matrix factorizations on top of dask.arraywithout significant input from dask core developers.This demonstrates that algorithmic domain experts can implement complex algorithms with dask and achieve good results with a minimum of framework investment.
To demonstrate complexity we present the graph of an out-ofcore singular value decomposition contributed by those authors to the dask.array.linalglibrary.
>>> import dask.array as da >>> x = da.ones((5000,1000), chunks=(1000, 1000)) >>> u, s, v = da.svd(x) This algorithm is complex enough without having to worry about software frameworks.Mathematical experts were able to implement this without having to simultaneously develop expertise in a complex parallel programming framework.

Final Thoughts
Extend the Scale of Convenient Data: The dask collections (array, bag, dataframe) provide reasonable access to parallelism and out-of-core execution.These significantly extend the scale of data that is convenient to manipulate.
Low Barrier to Entry: More importantly these collections demonstrate the feasibility of dask graphs to describe parallel algorithms and of the dask schedulers to execute those algorithms efficiently in a small space.The lack of a more baroque framework drastically reduces the barrier to entry and the ability of developers to use dask within their own libraries.Dask is available on github, PyPI, and is now included in the Anaconda distribution.It is BSD licensed, runs on Python 2.6 to 3.4 and is tested against Linux, OSX, and Windows.
This document was compiled from numerous blogposts that chronicle dask's development and go more deeply into the computational concerns encountered during dask's construction.
Dask is used on a daily basis, both as a dependency in other projects in the SciPy ecosystem (xray, scikit-image, ...) and also in production in private business.

Fig. 3 :
Fig. 3: We use typical NumPy slicing and reductions on a large volume of data to show the average temperature difference between noon and midnight for year 2014

•
dask.array = numpy + threading • dask.bag= toolz + multiprocessing • dask.dataframe= pandas + threading BagA bag is an unordered collection with repeats.It is like a Python list but does not guarantee the order of elements.Because we typically compute on Python objects in dask.bagwe are bound by the Global Interpreter Lock and so switch from using a multithreaded scheduler to a multi-processing one.

TABLE 1 :
Matrix Multiply GigaFLOPS for NumPy/Dask.array and for ATLAS and OpenBLAS with one and four threads Each of these files contains the temperature at two meters above ground over the earth at quarter degree resolution, every six hours.