PMDA-Parallel Molecular Dynamics Analysis

MDAnalysis is an object-oriented Python library to analyze trajectories from molecular dynamics (MD) simulations in many popular formats. With the development of highly optimized MD software packages on high performance computing (HPC) resources, the size of simulation trajectories is growing up to many terabytes in size. However efficient usage of multicore architecture is a challenge for MDAnalysis, which does not yet provide a standard interface for parallel analysis. To address the challenge, we developed PMDA, a Python library that builds upon MDAnalysis to provide parallel analysis algorithms. PMDA parallelizes common analysis algorithms in MDAnalysis through a task-based approach with the Dask library. We implement a simple split-apply-combine scheme for parallel trajectory analysis. The trajectory is split into blocks, analysis is performed separately and in parallel on each block ("apply"), then results from each block are gathered and combined. PMDA allows one to perform parallel trajectory analysis with pre-defined analysis tasks. In addition, it provides a common interface that makes it easy to create user-defined parallel analysis modules. PMDA supports all schedulers in Dask, and one can run analysis in a distributed fashion on HPC machines, ad-hoc clusters, a single multi-core workstation or a laptop. We tested the performance of PMDA on single node and multiple nodes on a national supercomputer. The results show that parallelization improves the performance of trajectory analysis and, depending on the analysis task, can cut down time to solution from hours to minutes. Although still in alpha stage, it is already used on resources ranging from multi-core laptops to XSEDE supercomputers to speed up analysis of molecular dynamics trajectories. PMDA is available as open source under the GNU General Public License, version 2 and can be easily installed via the pip and conda package


Introduction
Classical molecular dynamics (MD) simulations have become an invaluable tool to understand the function of biomolecules [KM02], [DDG + 12], [SB14], [Oro14], [BLL18], [HBD + 19] (often with a view towards drug discovery [BS12]) and diverse problems in materials science [Rot09], [LS15], [VMMC + 15], [LJYH18], [KAHC18], [FPM18].Systems are modeled as particles (for example, atoms) whose interactions are approximated with a classical potential energy function [FS02], [BGM + 18].Forces on the particles are derived from the potential and Newton's equations of motion for the particles are solved with an integrator algorithm, typically using highly optimized MD codes that run on high performance computing (HPC) resources or workstations (often equipped with GPU accelerators).The resulting trajectories, the time series of particle positions r(t) (and possibly velocities), are analyzed with statistical mechanics approaches [Tuc10], [BGM + 18] to obtain predictions or to compare to experimentally measured quantities.Currently simulated systems may contain millions of atoms and the trajectories can consist of hundreds of thousands to millions of individual time frames, thus resulting in file sizes ranging from tens of gigabytes to tens of terabytes.Processing and analyzing these trajectories is increasingly becoming a rate limiting step in computational workflows [CR15], [BFJ18].Modern MD packages are highly optimized to perform well on current HPC clusters with hundreds of cores such as the XSEDE supercomputers [TCD + 14] but current general purpose trajectory analysis packages [Gio19] were not designed with HPC in mind.
In order to scale up trajectory analysis from workstations to HPC clusters with the MDAnalysis Python library [MADWB11], [GLB + 16] we leveraged Dask [Roc15], [Das16], a task-graph parallel framework, together with Dask's various schedulers (in particular distributed), and created the Parallel MDAnalysis (PMDA) library.By default, PMDA follows a simple split-apply-combine [Wic11] approach for trajectory analysis, whereby each task analyzes a single trajectory segment and reports back the individual results that are then combined into the final result [KPJB17].Our previous work established that Dask worked well with MDAnalysis [KPJB17] and that this approach was competitive with other task-parallel approaches [PLK + 18].However, we did not provide a general purpose framework to write parallel analysis tools with MDAnalysis.Here we show how the split-apply-combine approach lends itself to a generalizable Python implementation that makes it straightforward for users to implement their own parallel analysis tools.At the heart of PMDA is the idea that the user only needs to provide a function that analyzes a single trajectory frame.PMDA provides the remaining framework via the ParallelAnalysisBase class to split the trajectory, apply the user's function to trajectory frames, run the analysis in parallel via Dask/distributed, and combines the data.It also contains a growing library of ready-to-use analysis classes, thus enabling users to immediately accelerate analysis that they previously performed in serial with the standard MDAnalysis analysis classes [GLB + 16].

Methods
At the core of PMDA is the idea that a common interface makes it easy to create code that can be easily parallelized, especially if the analysis can be split into independent work over multiple trajectory slices and a final step, in which all data from the trajectory slices are combined.We first describe typical steps in analyzing MD trajectories and then outline the approach taken in PMDA.

Trajectory analysis
A trajectory with T saved time steps consists of a sequence of coordinates r 1 (t), r 2 (t), . . .r N (t) 1≤t≤T where r i (t) are the Cartesian coordinates of particle i at time step t with N particles in the simulated system, i.e., T × N × 3 floating point numbers in total.To simplify notation, we consider t as an integer that indexes the trajectory frames; each frame index corresponds to a physical time in the trajectory that we could obtain if needed.In general, the coordinates are passed to a function A ({r i (t)}) to compute a time-dependent quantity (1) This quantity does not have to be a simple scalar; it may be a vector or a function of another parameter.In many cases, the time series A(t) is the desired result.It is, however, also common to perform some form of reduction on the data, which can be as simple as a time average to compute a thermodynamic average A ≡ Ā = T −1 ∑ T t=1 A(t).Such an average can be easily calculated in a post-analysis step after the time series has been obtained.An example of a more complicated reduction is the calculation of a histogram such as a radial distribution function (RDF) [FS02], [Tuc10] between two types of particles with numbers N a and N b , where the Dirac delta function counts the occurrences of particles i and j at distance r.To compute a RDF, we could generate a time series of histograms along the spatial coordinate r, i.e., A(t; r) for each frame, and then perform the average in post-analysis.However, storage of such histograms becomes problematic, especially if instead of 1-dimensional RDFs, densities on 3-dimensional grids are being calculated.It is therefore better to reformulate the algorithm to perform a partial average (or reduction) during the analysis on a per-frame basis.For histograms, this could mean building a partial histogram and updating counts in the bins after every frame.PMDA supports the simple time series data collection and the per-frame reduction.

Split-apply-combine
The split-apply-combine strategy can be thought of as a simplified map-reduce [Wic11] that provides a conceptually simple approach to operate on data in parallel.It is based on the fundamental assumption that the data can be partitioned into blocks that can be analyzed independently.The trajectory is split along the time axis into M blocks of approximately equal size, τ = T /M.One trajectory block can be viewed as a slice of a trajectory, e.g., for block k, r 1 (t), r 2 (t), . . .r N (t) t k ≤t<t k +τ k with τ k frames in the block.Each block k is analyzed in parallel by applying the function A to the frames in each block.Finally, the results from all blocks are gathered and combined.
The advantage of this approach is its simplicity.Many typical analysis tasks are based on calculations of time series from single trajectory frames as in Eq. 1 and it is this calculation that varies from task to task while the book-keeping and trajectory slicing Steps are labeled with the methods in pmda.parallel.ParallelAnalysisBase that perform the corresponding function.Methods in red (_single_frame() and _conclude()) must be implemented for every analysis function because they are not general.The blue method _reduce() must be implemented unless a simple time series is being calculated.The _prepare() method is optional and provides a hook to initialize custom data structures.
is the same.Given a function A that performs the single frame calculation, PMDA provides code to perform the other necessary steps (Fig. 1).
As explained in more detail later, a class derived from pmda.parallel.ParallelAnalysisBase encapsulates one trajectory analysis calculation.Individual methods correspond to different steps and in the following (and in Fig. 1) we will mention the names of the relevant methods to make clear how PMDA abstracts parallel analysis.The calculation with M parallel workers is prepared by setting up data structures to hold the final result (method _prepare()).The indices for the M trajectory slices are created in such a way that the number of frames τ k are balanced and do not differ by more than 1.For each slice or block k, the single frame analysis function A (_single_frame()) is sequentially applied to all frames in the slice.The result, A(t), is reduced, i.e., added to the results for this block.For time series, A(t) is simply appended to a list to form a partial time series for the block.More complicated reductions (method _reduce()) can be implemented, for example, the data may be histogrammed and added to a partial histogram for the block (as necessary for the implementation of the parallel RDF Eq. 2).

Implementation
PMDA is written in Python and, through MDAnalysis [GLB + 16], reads trajectory data from the file system into NumPy arrays [Oli07], [VDWCV11].Dask's delayed() function is used to build a task graph that is then executed using any of the schedulers available to Dask [Das16].
MDAnalysis combines a trajectory file (frames of coordinates that change with time) and a topology file (list of particles, their names, charges, bonds -all information that does not change with time) into a Universe(topology, trajectory) object.Arbitrary selections of particles (often atoms) are made available as an AtomGroup and the common approach in MDAnalysis is to work with these objects [GLB + 16]; for instance, all coordinates of an AtomGroup with N atoms named protein are accessed as the N × 3 NumPy array protein.positions.pmda.parallel.ParallelAnalysisBase is the base class for defining a split-apply-combine parallel multi frame analysis in PMDA.It requires a Universe to operate on and any AtomGroup instances that will be used.A parallel analysis class must be derived from ParallelAnalysisBase and at a minimum, must implement the _single_frame(ts, agroups) and _conclude() methods.The arguments of _single_frame(ts, agroups) are a MDAnalysis Timestep instance and a tuple of AtomGroup instances so that the following code could be run (the code is a simplified version of the current implementation): 1 @delayed 2 def analyze_block(blockslice): The task graph is constructed by wrapping the above code into delayed() and appending a delayed instance for each trajectory slice to a (delayed) list: Calling the compute() method of the delayed list object hands the task graph over to the scheduler, which then executes the graph on the available Dask workers.For example, the multiprocessing scheduler can be used to parallelize task graph execution on a single multiprocessor machine while the distributed scheduler is used to run on multiple nodes of a HPC cluster.After all workers have finished, the variable results contains a list of results from the individual blocks.PMDA actually stores these raw results as ParallelAnalysisBase._results and leaves it to the _conclude() method to process the results; this can be as simple as numpy.hstack(self._results)to generate a time series by concatenating the individual time series from each block.
The default _reduce() method appends the results and is equivalent to line 6.In general, line 6 reads 6 result = self._reduce(result,A) where variable result should have been properly initialized in _prepare().In order to be parallelizable, the _reduce() method must be a static method that does not access any class variables but returns its modified first argument.For example, the default "append" reduction is @staticmethod def _reduce(res, result_single_frame): res.append(result_single_frame) return res In general, the ParallelAnalysisBase controls access to instance attributes via a context manager ParallelAnalysisBase.readonly_attributes().It sets them to "read-only" for all parallel parts to prevent the common mistake to set an instance attribute in a parallel task, which breaks under parallelization as the value of an attribute in an instance in a parallel process is never communicated back to the calling process.
Using PMDA PMDA allows one to perform parallel trajectory analysis with predefined analysis tasks.In addition, it provides a common interface that makes it easy to create user-defined parallel analysis modules.
Here, we will introduce some basic usages of PMDA.

Pre-defined Analysis
PMDA contains a growing number of pre-defined analysis classes that are modeled after functionality in MDAnalysis.analysis and that can be used right away.Current examples are pmda.rmsfor RMSD analysis, pmda.contacts for native contacts analysis, pmda.rdf for radial distribution functions, and pmda.leaflet for the LeafletFinder analysis tool [MADWB11], [PLK + 18] for the topological analysis of lipid membranes.While the first three modules are based on pmda.parallel.ParallelAnalysisBase as described above and follow the strict split-apply-combine approach, pmda.leaflet is an example of a more complicated task-based algorithm that can also easily be implemented with MDAnalysis and Dask [PLK + 18].All PMDA classes can be used in a similar manner to classes in MDAnalysis.analysis, which makes it easy for users of MDAnalysis to switch to parallelized versions of the algorithms.One example is the calculation of the root mean square distance (RMSD) of C α atoms of the protein with pmda.rms.RMSD.An analysis class object is instantiated with the necessary input data such as the AtomGroup containing the C α atoms and a reference structure.To perform the analysis, the run() method is called.Here the only difference between using the serial version and the parallel version is that the run() method takes additional arguments n_jobs and n_blocks, which determine the level of parallelization.When using the multiprocessing scheduler (the default), n_jobs is the number of processes to start and typically the number of blocks n_blocks is set to the number of available CPU cores.When the distributed scheduler is used, Dask will automatically learn the number of available Dask worker processes and n_jobs is meaningless; instead it makes more sense to set the number of trajectory blocks that are then spread across all available workers.

User-defined Analysis
PMDA makes it easy to create analysis classes such as the ones discussed above.If the per-frame analysis can be expressed as a simple function, then an analysis class can be created with a factory function.Otherwise, a class has to be derived from pmda.parallel.ParallelAnalysisBase.Both approaches are described below.pmda.custom.AnalysisFromFunction(): PMDA provides helper functions in pmda.custom to rapidly build a parallel class for users who already have a single frame function that 1. takes one or more AtomGroup instances as input, 2. analyzes one frame in a trajectory and returns the result for this frame.For example, if we already have a function to calculate the radius of gyration [MM14] of a protein given in AtomGroup ag, namely ag.radius_of_gyration() (as available in MDAnalysis), then we can write a simple function rgyr() that returns for each trajectory frame a tuple containing the time at the current time step and the value of the radius of gyration: This new parallel analysis class can be run just as the existing ones: parallel_rgyr.run(n_jobs=4, n_blocks=4) print(parallel_rgyr.results) The time series of the results is stored in the attribute parallel_rgyr.results; for our example where each perframe result is a tuple (time, Rgyr), the time series is stored as a T × 2 array that can be plotted with 1) the single frame analysis method _single_frame() (required), 2) the final results conclusion method _conclude() (required), 3) the additional preparation method _prepare() (optional), 4) the reduce method for frames within the same block _reduce() (optional for time series, required for anything else).
As an example, we show how one can build a class to calculate the radius of gyration of a protein given in AtomGroup protein; of course, in this case the simple approach with pmda.custom.AnalysisFromFunction() would be easier.Because we want to return a time series, it is not necessary to define a _reduce() method.This class can be used in the same way as the class that we defined with pmda.custom.AnalysisFromFunction: parallel_rgyr = RGYR(protein) parallel_rgyr.run(n_jobs=4,n_blocks=4) print(parallel_rgyr.results)

Performance Evaluation
In order to characterize the performance of PMDA on a typical HPC machine we performed computational experiments for two different analysis tasks, the RMSD calculation after optimum superposition (RMSD) and the water oxygen radial distribution function (RDF).
For the RMSD task we computed the time series of root mean square distance after optimum superposition (RMSD) of all 564 C α atoms of a protein with the initial coordinates at the first frame as reference, as implemented in class pmda.rms.RMSD.The RMSD calculation with optimum superposition was performed with the fast QCPROT algorithm [The05] as implemented in MDAnalysis [MADWB11].
As a second test case we computed the water oxygen-oxygen radial distribution function (RDF, Eq. 2) in 75 bins up to a cut-off of 5 Å for all 24,239 oxygen atoms in the water molecules in our test system, using the class pmda.rdf.InterRDF.The RDF calculation is compute-intensive due to the necessity to calculate and histogram a large number (O(N) because of the use of a cutoff) of distances for each time step; it additionally exemplifies a non-trivial reduction.
These two common computational tasks differ in their computational cost and represent two different requirements for data reduction and thus allow us to investigate two distinct use cases.We investigated a long (9000 frames) and a short trajectory (900 frames) to assess to which degree parallelization remained practical.The computational experiments were performed in different scenarios to assess the influence of different Dask schedulers (multiprocessing and distributed) and the role of the file storage system (shared Lustre parallel file system and local SSD), as described below and summarized in Table 1.We tested different combinations of Dask schedulers (distributed, multiprocessing) with different means to read the trajectory data (either from the shared Lustre parallel file system or from local SSD) as shown in Table 1.Using either the multiprocessing scheduler or the SSD restrict runs to a single node (maximum 24 CPU cores).With distributed (and Lustre) we tested fully utilizing all cores on a node and also only occupying half the available cores, while doubling the total number of nodes.In all cases the trajectory were split in as many blocks as there were available processes or Dask workers.We performed five independent repeat runs for all scenarios in Table 1 and plotted the mean of the reported timing quantity together with the standard deviation from the mean to indicate the variance of the runs.

Measured timing quantities
The ParallelAnalysisBase class collects detailed timing information for all blocks and all frames and makes these data available in the attribute ParallelAnalysisBase.timing: We measured the time t prepare for _prepare(), the time t wait k that each task k waits until it is executed by the scheduler, the  to perform the computation in _single_frame() and reduction in _reduce(), the time t conclude k to perform the final processing of all data in _conclude(), and the total wall time to solution t total .
We analyzed the total time to completion as a function of the number of CPU cores, which was equal to the number of trajectory blocks, so that each block could be processed in parallel.We quantified the strong scaling behavior by calculating the speedup for running on M CPU cores with M parallel Dask tasks as S(M) = t total (1)/t total (M), where t total (1) is the performance of the PMDA code using the serial scheduler.The efficiency was calculated as E(M) = S(M)/M.The errors of these quantities were derived by the standard error propagation.
To gain better insight into the performance-limiting steps in our algorithm (Fig. 1) we plotted the maximum times over all ranks because the overall time to completion cannot be faster than the slowest parallel process.For example, for the read I/O time we calculated the total read I/O time for each rank k as t k,t and then reported max k t I/O k .

RMSD analysis task
The parallelized RMSD analysis in pmda.rms.RMSD scaled well only to about half a node (12 cores), as shown in Fig. 2 A, D, regardless of the length of the trajectory.The efficiency dropped below 0.8 (Fig. 2 B, E) and the maximum achievable speed-up remained below 10 for the short trajectory (Fig. 2 C) and below 20 for the long one (Fig. 2 F).Overall, using the multiprocessing scheduler and either Lustre or SSD gave the best performance and shortest time to solution.The distributed scheduler with SSD gave widely variable results as seen by large standard deviations over multiple repeats.It still performed better than when the Lustre file system was used but overall, for a single node, the multiprocessing scheduler always gave better performance with less variation in run time.These results were consistent with findings in our earlier pilot study where we had looked at the RMSD task with Dask and had found that multiprocessing with both SSD and Lustre had given good single node performance but, using distributed, had not scaled well beyond a single SDSC Comet node [KPJB17].
A detailed look at the maximum times (Fig. 3) that the Dask worker processes spent on waiting to be executed, performing the RMSD calculation with data in memory, and reading the trajectory frame data from the file into memory showed that the waiting time (Fig. 3 A, D) either increased from about 0.02 s to 0.1 s for multiprocessing or was roughly a constant 1 s for distributed (on Lustre).For reasons that were not clear, the distributed scheduler with SSD had on average the largest wait times, with large fluctuations, ranging from 0.1 s to 10 s (red lines in Fig. 3 A, D).The computation itself scaled very well (Fig. 3 B, E) with only small variations, indicating that split-applycombine is a robust approach to parallelization, once the data are in memory.The reading time scaled fairly well but exhibited some variation beyond a single node (24 cores) and an unexplained decline in performance for the longer trajectory, as seen in Fig. 3 C, F. The read I/O results indicated that both Lustre and SSD can perform equally well.Beyond 12 cores, the waiting time started approaching the time for read I/O (compute was an order of magnitude less than I/O) and hence parallel speed-up was limited by the wait time.
The second major component that limited scaling performance was the time to create the Universe data structure (Fig. 4 A, D).The time to read the topology and open the trajectory file on the shared file system typically increased from 1 s to about 2 s and thus, for the given total trajectory lengths, also became comparable to the time for read I/O.The other components (prepare and conclude, see Fig. 4) remained negligible with times below 10 −3 s.
The parallelizable fraction of the workload consisted of the compute and read I/O steps.Because this fraction was relatively small and was dominated by the wait time from the Dask scheduler and the time to initialize the Universe data structure (Fig. 5), the overall performance gain by parallelization remained modest, as explained by Amdahl's law [Amd67].Thus, for a highly optimized and fast computation such as the RMSD calculation, the best performance (speed-up on the order of 10 fold) could already be achieved on the equivalent of a modern workstation.The multiprocessing scheduler seemed to be the more consistent and better performing choice in this scenario; therefore PMDA defaults to multiprocessing.Performance would likely improve with longer trajectories because the "fixed" serial costs (waiting, Universe  creation) would decrease in relevance to the time spent on computation and data ingestion (see Fig. 5 B), which benefit from parallelization [Gus88].However, all things considered, a single node seemed sufficient to accelerate RMSD analysis.

RDF analysis task
Unlike the RMSD analysis task, the parallelized RDF analysis in pmda.rdf.InterRDF showed decreasing total time to solution up to the highest number of CPU cores tested (see Fig. 6 A, D).The efficiency on a single node remained above 0.6 for almost all cases (Fig. 6 B, E) and remained above 0.6 for the best case (distributed on Lustre and half-filling of nodes for the long trajectory), up to 3 nodes (72 cores, Fig. 6 E).Even when filling complete nodes, the efficiency for the long trajectory remained above 0.5 (Fig. 6 E).Consequently, a sizable speed-up could be maintained that approached 40 fold in the best case (Fig. 6 F), which cut down the time to solution from about 40 min to under 1 min.On a single node, all approaches performed similarly well, with the distributed scheduler now having a slight edge over multiprocessing (Fig. 6), with the exception of the combination of distributed with the SSD, which for unknown reasons performed much worse than everything else (similar to the situation observed for the RMSD case).The detailed analysis of the individual components in Fig. 7 clearly showed that the RDF analysis task required much more computational effort than the RMSD task and that it was dominated by the compute component (Fig. 8), which scaled very well to the highest core numbers (Fig. 7 B, E).However, multiprocessing and especially distributed with SSD took longer for the computational part at ≥ 8 cores (one third of a single node), indicating that in these cases some sort of competition reduced performance.For comparison, serial computation required about 250 s while read I/O required less than 10 s, and this ratio was approximately maintained as the read I/O also scaled reasonably well (Fig. 7 C, F) Although the variance increased markedly when multiple nodes were included such as when using six half-filled nodes, this effect did not strongly impact overall performance because t compute k,t t I/O k,t .The differences between using all cores on a node compared to only using half the cores on each node were small but only using half a node was consistently better, especially in the compute time, and hence the overall performance of the latter approach was better.For the shorter trajectory, the wait time was a factor in reducing performance at higher core numbers (Fig. 7 A).The other components (t Universe k < 2 s, t prepare < 3 × 10 −5 s , t conclude k < 4 × 10 −4 s) were similar or better (i.e., shorter) than the ones shown for the RMSD task in Fig. 4 and are not shown; only the time to set up the Universe played a role in reducing the scaling performance in the Lustre-distributed-3nodes scenario at 60 or more CPU cores.
In summary, the performance increase for a compute-intensive task such as RDF was sizable and, although not extremely efficient, was large enough (about 30-40) to justify the use of about 100 cores on a HPC supercomputer.Because scaling seemed mostly limited by constant costs such as the scheduling wait time (see Fig. 8), processing longer trajectories, for which more work has to be done in the parallelizable compute and read I/O steps, should improve the scaling behavior [Gus88].

Conclusions
The PMDA Python package provides a framework to parallelize analysis of MD trajectories with a simple split-apply-combine approach by combining Dask with MDAnalysis.Although still in early development, it provides useful functionality for users to speed up analysis, ranging from a growing library of included tools to different approaches for users to write their own parallel analysis.In simple cases, just wrapping a user supplied function is enough to immediately use PMDA but the package also provides a documented API to derive from the pmda.parallel.ParallelAnalysisBase class.We showed that performance depends on the type of analysis that is being performed.Compute-intensive tasks such as the RDF calculation can show good strong scaling up to about a hundred cores on a typical supercomputer and speeding up the time to

Fig. 1 :
Fig.1: High-level view of the split-apply-combine algorithm in PMDA.Steps are labeled with the methods in pmda.parallel.ParallelAnalysisBase that perform the corresponding function.Methods in red (_single_frame() and _conclude()) must be implemented for every analysis function because they are not general.The blue method _reduce() must be implemented unless a simple time series is being calculated.The _prepare() method is optional and provides a hook to initialize custom data structures.

Fig. 3 :
Fig.3: Detailed per-task timing analysis for parallel components of RMSD analysis task.Individual times per task were measured for different testing configurations (Table1).A and D: Maximum waiting time for the task to be executed by the Dask scheduler.B and E: Maximum total compute time per task.C and F: Maximum total read I/O time per task.Points represent the mean over five repeats with the standard deviation shown as error bars.

Fig. 4 :Fig. 5 :
Fig.4: Detailed timing analysis for other components of the RMSD analysis task.Individual times per task were measured for different testing configurations (Table1).A and D: Maximum time for a task to load the Universe.B and E: Time t prepare to execute _prepare().C and F: Time t conclude k to execute _conclude().Points represent the mean over five repeats with the standard deviation shown as error bars.

Fig. 7 :
Fig.7: Detailed per-task timing analysis for parallel components of the RDF analysis task.Individual times per task were measured for different testing configurations (Table1).A and D: Maximum waiting time for the task to be executed by the Dask scheduler.B and E: Maximum total compute time per task.C and F: Maximum total read I/O time per task.Points represent the mean over five repeats with the standard deviation shown as error bars.

import numpy as np from pmda.parallel import ParallelAnalysisBase class
The _conclude() method reshapes the attribute self._results, which always holds the results from all blocks, into a time series.The call signature for method _single_frame() is fixed and ts must contain the current MDAnalysis Timestep and agroups must be a tuple of AtomGroup instances.The current frame number, time and radius of gyration are returned as the single frame results: RGYR(ParallelAnalysisBase): def __init__(self, protein):

TABLE 1 :
Testing configurations on SDSC Comet.maxnodes is the maximum number of nodes that were tested; the multiprocessing scheduler is limited to a single node.maxprocesses is the maximum number of processes or Dask workers that were employed.packagesexceptPMDA and MDAnalysis were installed with the conda package manager from the conda-forge channel.PMDA and MDAnalysis development versions were installed from source in a conda environment with pip install.Benchmarks were run on the CPU nodes of XSEDE's [TCD + 14] SDSC Comet supercomputer, a 2 PFlop/s cluster with 1,944 Intel Haswell Standard Compute Nodes in total.Each node contains two Intel Xeon CPUs (E5-2680v3, 12 cores, 2.5 GHz) with 24 CPU cores per node, 128 GB DDR4 DRAM main memory, and a non-blocking fat-tree InfiniBand FDR 56 Gbps node interconnect.All nodes share a Lustre parallel file system and have access to node-local 320 GB SSD scratch space.Jobs are run through the SLURM batch queuing system.Our SLURM submission shell scripts and Python benchmark scripts for SDSC Comet are available in the repository https: Test system, benchmarking environment, and data filesWe tested PMDA 0.2.1, MDAnalysis 0.20.0 (development version), Dask 1.2.0, and NumPy 1.15.4 under Python 3.6.All Strong scaling performance of the RMSD analysis task with short (900 frames) and long (9000) frames trajectories on SDSC Comet, where a single node contains 24 cores.The total time to completion t total was measured for different testing configurations (Table1).A and D: t total as a function of processes or Dask workers, i.e., the number of CPU cores that were actually used.The number of trajectory blocks was the same as the number of CPU cores.B and E: efficiency E. The ideal case is E = 1.C and F: speed-up S. The dashed line represents ideal strong scaling S(M) = M. Points represent the mean over five repeats with the standard deviation shown as error bars.Universe for each Dask task (which includes opening the shared trajectory and topology files and loading the topology into memory), the time t I/O k,t to read each frame t in each block k from disk into memory, the time t k to create a new Strong scaling performance of the RDF analysis task.The total time to completion t total was measured for different testing configurations (Table1).A and D: t total as a function of processes or Dask workers, i.e., the number of CPU cores that were actually used.The number of trajectory blocks was the same as the number of CPU cores.B and E: efficiency E. The ideal case is E = 1.C and F: speed-up S. The dashed line represents ideal strong scaling S(M) = M. Points represent the mean over five repeats with the standard deviation shown as error bars.