MDAnalysis : A Python Package for the Rapid Analysis of Molecular Dynamics Simulations

MDAnalysis (http://mdanalysis.org) is a library for structural and temporal analysis of molecular dynamics (MD) simulation trajectories and individual protein structures. MD simulations of biological molecules have become an important tool to elucidate the relationship between molecular structure and physiological function. Simulations are performed with highly optimized software packages on HPC resources but most codes generate output trajectories in their own formats so that the development of new trajectory analysis algorithms is confined to specific user communities and widespread adoption and further development is delayed. MDAnalysis addresses this problem by abstracting access to the raw simulation data and presenting a uniform object-oriented Python interface to the user. It thus enables users to rapidly write code that is portable and immediately usable in virtually all biomolecular simulation communities. The user interface and modular design work equally well in complex scripted work flows, as foundations for other packages, and for interactive and rapid prototyping work in IPython / Jupyter notebooks, especially together with molecular visualization provided by nglview and time series analysis with pandas. MDAnalysis is written in Python and Cython and uses NumPy arrays for easy interoperability with the wider scientific Python ecosystem. It is widely used and forms the foundation for more specialized biomolecular simulation tools. MDAnalysis is available under the GNU General Public License v2.


Introduction
Molecular dynamics (MD) simulations of biological molecules have become an important tool to elucidate the relationship between molecular structure and physiological function [DDG + 12], [Oro14].Simulations are performed with highly optimized software packages on HPC resources but most codes generate output trajectories in their own formats so that the development of new trajectory analysis algorithms is confined to specific user communities and widespread adoption and further development is delayed.Typical trajectory sizes range from gigabytes to terabytes so it is typically not feasible to convert trajectories into a range of different formats just to use a tool that requires this specific format.Instead, a framework is required that provides a common interface to raw simulation data.Here we describe the MDAnalysis library [MADWB11] that addresses this problem by abstracting access to the raw simulation data.MDAnalysis presents a uniform object-oriented Python interface to the user.Since its original publication in 2011 [MADWB11], MDAnalysis has been widely adopted and has undergone substantial changes.Here we provide a short introduction to MDAnalysis and its capabilities and an overview over recent improvements.
MDAnalysis was initially inspired by MDTools for Python (J.C.Phillips, unpublished) and MMTK [Hin00].MDTools pioneered the key idea to use an extensible and object-oriented language, namely, Python, to provide a high-level interface for the construction and analysis of molecular systems for MD simulations.MMTK became a tool kit to build MD simulation applications on the basis of a concise object model of a molecular system.MDAnalysis was built on an object model similar to that of MMTK with a strong focus on providing universal high-level building blocks for the analysis of MD trajectories, but for a much wider range of formats than previously available.MDAnalysis has been publicly available since January 2008 and is one of the longest actively maintained Python packages for the analysis of molecular simulations.Since then many other packages have appeared that primarily function as libraries for providing access to simulation data from within Python.Three popular examples are PyLOOS [RLG14], mdtraj [MBH + 15], and pytraj [NRSC16].PyLOOS [RLG14] consists of Python bindings to the C++ LOOS library [RG09]; in order to aid novice users, LOOS also provides about 140 small stand-alone tools that each focus on a single task.mdtraj [MBH + 15] is similar to MDAnalysis in many aspects but focuses even more on being a light-weight building block for other packages; it also includes a number of innovative performance optimizations.pytraj [NRSC16] is a versatile Python frontend to the popular and powerful cpptraj tool [RCI13] and is particularly geared towards users of the Amber MD package [CCD + 05].These three packages and MDAnalysis have in common that they are built on an object model of the underlying data (such as groups of particles or a trajectory), use compiled code in C, C++ or Cython to accelerate time critical bottlenecks, and have a "Pythonic" user interface.LOOS and MDAnalysis share a similar object-oriented philosophy in their user interface design.In contrast, mdtraj and pytraj expose a functional user interface.Both approaches have advantages and the existence of different "second generation" Python packages for the analysis of MD simulations provides many good choices for users and a fast moving and stimulating environment for developers.

Overview
MDAnalysis is specifically tailored to the domain of molecular simulations, in particularly in biophysics, chemistry, and biotechnology as well as materials science.The user interface provides physics-based abstractions (e.g., atoms, bonds, molecules) of the data that can be easily manipulated by the user.It hides the complexity of accessing data and frees the user from having to implement the details of different trajectory and topology file formats (which by themselves are often only poorly documented and just adhere to certain community expectations that can be difficult to understand for outsiders).MDAnalysis currently supports more than 25 different file formats and covers the vast majority of data formats that are used in the biomolecular simulation community, including the formats required and produced by the most popular packages such as NAMD [PBW + 05], Amber [CCD + 05], Gromacs [AMS + 15], CHARMM [BBIM + 09], LAMMPS [Pli95], DL_POLY [TSTD06], HOOMD [GNA + 15] as well as the Protein Data Bank PDB format [BWF + 00] and various other specialized formats.
Since the original publication [MADWB11], improvements in speed and data structures make it now possible to work with terabyte-sized trajectories containing up to ~10 million particles.MDAnalysis also comes with specialized analysis classes in the MDAnalysis.analysismodule that are unique to MDAnalysis such as LeafletFinder (in the leaflet module), a graphbased algorithm for the analysis of lipid bilayers [MADWB11], or Path Similarity Analysis (psa) for the quantitative comparison of macromolecular conformational changes [SKTB15].

Code base
MDAnalysis is written in Python and Cython with about 42k lines of code and 24k lines of comments and documentation.It uses NumPy arrays [VCV11] for easy interoperability with the wider scientific Python ecosystem.Although the primary dependency is NumPy, other Python packages such as netcdf4 and BioPython [HM03] also provide specialized functionality to the core of the library (Figure 1).

Availability
MDAnalysis is available in source form under the GNU General Public License v2 from GitHub as MDAnalysis/mdanalysis, and as PyPi and conda packages.The documentation is extensive and includes an introductory tutorial.

Development process
The development community is very active with more than five active core developers and many community contributions in every release.We use modern software development practices [WAB + 14], [SM14] with continuous integration (provided by Travis CI) and an extensive automated test suite (containing over 3500 tests with >92% coverage for our core modules).Development occurs on GitHub through pull requests that are reviewed by core developers and other contributors, supported by The MDAnalysis.analysis package contains independent modules that make use of the core to implement a wide range of algorithms to analyze MD simulations.The MDAnalysis.visualization package contains a growing number of tools that are specifically geared towards calculating visual representations such as, for instance, streamlines of molecules.
the results from the automated tests, test coverage reports provided by Coveralls, and QuantifiedCode code quality reports.Users and developers communicate extensively on the community mailing list (Google groups) and the GitHub issue tracker; new users and developers are very welcome and most user contributions are eventually integrated into the code base.The development and release process is transparent to users through open discussions and announcements and a full published commit history and changes.Releases are numbered according to the semantic versioning convention so that users can immediately judge the impact of a new release on their existing code base, even without having to consult the CHANGELOG documentation.Old code is slowly deprecated so that users have ample opportunity to update the code although we generally attempt to break as little code as possible.When backwards-incompatible changes are inevitable, we provide tools (based on the Python standard library's lib2to3) to automatically refactor code or warn users of possible problems with their existing code.

Basic usage
The core object in MDAnalysis is the Universe which acts as a nexus for accessing all data contained within a simulation.It is initialized by passing the file names of the topology and trajectory files, with a multitude of different formats supported in these roles.The topology acts as a description of all the particles in the system while the trajectory describes their behavior over time.The select_atoms method allows for AtomGroups to be created using a human readable syntax which allows queries according to properties, logical statements and geometric criteria.
# Select all solvent within a set distance from protein atoms ag = u.select_atoms('resnameSOL and around 5.0 protein') # Select all heavy atoms in the first 20 residues ag = u.select_atoms('resid1:20 and not prop mass < 10.0') # Use a preexisting AtomGroup as part of another selection sel1 = u.select_atoms('nameN and not resname MET') sel2 = u.select_atoms('around2.5 group Nsel', Nsel=sel1) The AtomGroup acts as a representation of a group of particles, with the properties of these particles made available as NumPy arrays.

ag.names ag.charges ag.positions ag.velocities ag.forces
The data from MD simulations comes in the form of a trajectory which is a frame by frame description of the motion of particles in the simulation.Today trajectory data can often reach sizes of hundreds of GB.Reading all these data into memory is slow and impractical.To allow the analysis of such large simulations on an average workstation (or even laptop) MDAnalysis will only load a single frame of a trajectory into memory at any time.
The trajectory data can be accessed through the trajectory attribute of a Universe.Changing the frame of the trajectory object updates the underlying arrays that AtomGroups point to.In this way the positions attribute of an AtomGroup within the iteration over a trajectory will give access to the positions at each frame.Through this approach only a single frame of data is present in memory at any time, allowing for large data sets, from half a million particles to tens of millions (see also section Analysis of large systems), to be dissected with minimal resources.
# the trajectory is an iterable object len(u.trajectory)implemented a mechanism by which the trajectory was read once on loading and frame offsets on disk were computed that could be used to directly seek to individual frames.Based on this idea, MDAnalysis implements a fast frame scanning algorithm for TRR and XTC files and also saves the offsets to disk (as a compressed NumPy array).When a trajectory is loaded again then instead of reading the whole trajectory, only the persistent offsets are read (provided they have not become stale as checked by conservative criteria such as changes in file name, modification time, and size of the original file, which are all saved with the offsets).In cases of terabyte-sized trajectories, the persistent offset approach can save hundreds of seconds for the initial loading of the Universe (after an initial one-time cost of scanning the trajectory).Current development work is extending the persistent offset scheme to all trajectory readers, which will provide random access for all trajectories in a completely automatic and transparent manner to the user.
Example: Per-residue RMSF As a complete example consider the calculation of the C α root mean square fluctuation (RMSF) ρ i that characterizes the mobility of a residue i in a protein: The code in Figure 2 A shows how MDAnalysis in combination with NumPy can be used to implement Eq. 1.The topology information and the trajectory are loaded into a Universe instance; C α atoms are selected with the MDAnalysis selection syntax and stored as the AtomGroup instance ca.The main loop iterates through the trajectory using the MDAnalysis trajectory iterator.The coordinates of all selected atoms become available in a NumPy array ca.positions that updates for each new time step in the trajectory.Fast operations on this array are then used to calculate variance over the whole trajectory.The final result is plotted with matplotlib [Hun07] as the RMSF over the residue numbers, which are conveniently provided as an attribute of the AtomGroup (Figure 2 B).The example demonstrates how the abstractions that MDAnalysis provides enable users to write concise code where the computations on data are cleanly separated from the task of extracting the data from the simulation trajectories.These characteristics make it easy to rapidly prototype new algorithms.In our experience, most new analysis algorithms are developed by first prototyping a simple script (like the one in Figure 2), often inside a Jupyter notebook (see section Interactive Use and Visualization).Then the code is cleaned up, tested and packaged into a module.In section Analysis Module, we describe the analysis code that is included as modules with MDAnalysis.

Interactive use and visualization
The high level of abstraction and the pythonic API, together with comprehensive Python doc strings, make MDAnalysis well suited for interactive and rapid prototyping work in IPython [PG07] and Jupyter notebooks.It works equally well as an interactive analysis tool, especially with Jupyter notebooks, which then contain an executable and well-documented analysis protocol that can be easily shared and even accessed remotely.Universes and Atom-Groups can be visualized in Jupyter notebooks using nglview, which interacts natively with the MDAnalysis API (Figure 3).Other Python packages that have become extremely useful in notebook-based analysis work flows are pandas [McK10] for rapid analysis of time series analysis, distributed [Roc15] for simple parallelization, FireWorks [JOC + 15] for complex work flows, and MDSynthesis [DGS + 16] for organizing, bundling and querying many simulations.

Analysis module
In the MDAnalysis.analysismodule we provide a large variety of standard analysis algorithms, like RMSD (root mean square distance) and RMSF (root mean square fluctuation) calculations, RMSD-optimized structural superposition [LAT10], native contacts [BHE13], [FKDD07], or analysis of hydrogen bonds as well as unique algorithms, such as the LeafletFinder in MDAnalysis.analysis.leaflet[MADWB11] and Path Similarity Analysis (MDAnalysis.analysis.psa)[SKTB15].Historically these algorithms were contributed by various researchers as individual modules to satisfy their own needs but this lead to some fragmentation in the user interface.We have recently started to unify the interface to the different algorithms with an AnalysisBase class.Currently PersistenceLength, InterRDF, LinearDensity and Contacts analysis have been ported.PersistenceLength calculates the persistence length of a polymer, InterRDF calculates the pairwise radial distribution function inside of a molecule, LinearDensity generates a density along a given axis and Contacts analysis native contacts, as described in more detail below.The API to these different algorithms is being unified with a common AnalysisBase class, with an emphasis on keeping it as generic and universal as possible so that it becomes easy to, for instance, parallelize analysis.Most other tools hand the user analysis algorithms as black boxes.We want to avoid that and allow the user to adapt an analysis to their needs.
The new Contacts class is a good example of a generic API that allows straightforward implementation of algorithms while still offering an easy setup for standard analysis types.The Contacts class is calculating a contact map for atoms in a frame and compares it with a reference map using different metrics.The used metric then decides which quantity is measured.A common quantity is the fraction of native contacts, where native contacts are all atom pairs that are close to each other in a reference structure.The fraction of native contacts is often used in protein folding to determine when a protein is folded.For native contacts two major types of metrics are considered: ones based on differentiable functions [BHE13] and ones based on hard cut-offs [FKDD07] (which we set as the default implementation).We have designed the API to choose between the two metrics and pass user defined functions to develop new metrics or measure other quantities.This generic interface allowed us to implement a "q 1 q 2 " analysis [FKDD07] on top of the Contacts class; q 1 and q 2 refer to the fractions of native contacts that are present in a protein structure relative to two reference states 1 and 2. Below is an incomplete code example that shows how to implement a q 1 q 2 analysis, the default value for the method keyword argument is overwritten with a user defined method radius_cut_q.A more detailed explanation can be found in the documentation.This type of flexible analysis algorithm paired with a collection of base classes enables rapid and easy analysis of simulations as well as development of new ones.

Visualization module
The new MDAnalysis.visualizationname space contains modules that primarily produce visualizations of molecular systems.Currently it contains functions that generate specialized streamline visualizations of lipid diffusion in membrane bilayers [CRG + 14].In short, the algorithm decomposes any given membrane into a grid and tracks the displacement of lipids between different grid elements, emphasizing collective lipid motions.Both 2D (MDAnalysis.visualization.streamlines)and 3D (MDAnalysis.visualization.streamlines_3D)implementations are available in MDAnalysis, with output shown in Figure 4. Sample input data files are available online from the Flows website along with the expected output visualizations.

Improvements in the internal topology data structures
Originally MDAnalysis followed a strict object-oriented approach with a separate instance of an Atom object for each particle in the simulation data.The AtomGroup then simply stored its contents as a list of these Atom instances.With simulation data now commonly exceeding 10 6 particles this solution did not scale well and so recently this design was overhauled to improve the scalability of MDAnalysis.
Because all Atoms have the same property fields (i.e.mass, position) it is possible to store this information as a single NumPy array for each property.Now an AtomGroup can keep track of its contents as a simple integer array, which can be used to slice these property arrays to yield the relevant data.
Overall this approach means that the same number of Python objects are created for each Universe, with the number of particles TABLE 1: Performance comparison of subselecting an AtomGroup from an existing one using the new system (upcoming release v0.16.0) against the old (v0.15.0).Subselections were slices of the same size (82,056 atoms).Shorter processing times are better.The benchmarks systems were taken from the vesicle library [KB15] and are listed with their approximate number of particles ("# atoms").Benchmarks were performed on a laptop with an Intel Core i5 2540M 2.6 GHz processor, 8 GB of RAM and a SSD drive.only changing the size of the arrays.This translates into a much smaller memory footprint (1.3 GB vs. 3.6 GB for a 10.1 M atom system), highlighting the memory cost of millions of simple Python objects.This transformation of the data structures from an Array of Structs to a Struct of Arrays also better suits the typical access patterns within MDAnalysis.It is quite common to compare a single property across many Atoms, but rarely are different properties within a single Atom compared.Additionally, it is possible to utilize NumPy's faster indexing capabilities rather than using a list comprehension.This new data structure has lead to performance improvements in our whole code base.The largest improvement is in accessing subsets of Atoms which is now over 40 times faster (Table 1), an operation that is used everywhere in MDAnalysis.Speed-ups of a factor of around five to seven were realized for accessing Atom attributes for whole AtomGroup instances (Table 2).The improved topology data structures are also much faster to initialize, which translates into speed-ups of about three for the task of loading a system from a file (for instance, in the Gromacs GRO format or the Protein Databank PDB format) into a Universe instance (Table 3).Given that for systems with 10 M atoms this process used to take over 100 s, the reduction in load time down to a third is a substantial improvement -and it came essentially "for free" as a by-product of improving the underlying topology data structures.

Analysis of large systems
MDAnalysis has been used extensively to study extremely large simulation systems for long simulation times.More recently, a 12.7 M CG particle system combining the influenza A envelope and a model of a plasma membrane [KS15] were simulated together (Figure 5 A).MDAnalysis was used to assess the stability of this enormous system by tracking, for example, the changes in Z coordinate values for different system components (Figure 5 B).In this case, the membrane appeared to rise too rapidly over the course of 50 ns, which suggests that the simulation system will likely have to be redesigned.Such large systems are challenging to work with, including their visualization, and analysis of quantities based on particle coordinates is essential to assess the correct behavior of the simulations.

Other packages that use MDAnalysis
The user interface and modular design work well in complex scripted work flows and for interactive work, as discussed in section Interactive Use and Visualization.MDAnalysis also serves as foundation for other packages.For example, ProtoMD [SMO16] is a toolkit that facilitates the development of algorithms for multiscale (MD) simulations and uses MDAnalysis for on-thefly calculations of the collective variables that drive the coarsegrained degrees of freedom.The ENCORE package [TPB + 15] enables users to compare conformational ensembles generated either from simulations alone or synergistically with experiments.MDAnalysis is also the back end for ST-analyzer [JJW + 14], a standalone graphical user interface tool set to perform various trajectory analyses.MDSynthesis [DGS + 16] (which is based on datreant (Dotson et al, this issue)) gives a Pythonic interface to molecular dynamics trajectories using MDAnalysis, giving the ability to work with the data from many simulations scattered throughout the file system with ease.It makes it possible to write analysis code that can work across many varieties of simulation, but even more importantly, MDSynthesis allows interactive work with the results from hundreds of simulations at once without much effort.

Conclusions
MDAnalysis provides a uniform interface to simulation data, which comes in a bewildering array of formats.It enables users to rapidly write code that is portable and immediately usable in virtually all biomolecular simulation communities.It has an active international developer community with researchers that are expert developers and users of a wide range of simulation codes.MDAnalysis is widely used (the original paper [MADWB11] has been cited more than 195 times) and forms the foundation for more specialized biomolecular simulation tools.Ongoing and future developments will improve performance further, introduce transparent parallelization schemes to utilize multi-core and GPU systems efficiently, and interface with the SPIDAL library for high performance data analytics algorithms [QJLF14].
Imaging Fellowship from the Department of Physics at Arizona State University.IMK was supported by a REU supplement to grant ACI-1443054 from the National Science Foundation.OB was supported in part by grant ACI-1443054 from the National Science Foundation; computational resources for OB's work were in part provided by the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575 (allocation MCB130177 to OB).The MDAnalysis Atom logo was designed by Christian Beckstein.

Fig. 1 :
Fig. 1: Structure of the MDAnalysis package.MDAnalysis consists of the core with the Universe class as the primary entry point for users.The MDAnalysis.analysis package contains independent modules that make use of the core to implement a wide range of algorithms to analyze MD simulations.The MDAnalysis.visualization package contains a growing number of tools that are specifically geared towards calculating visual representations such as, for instance, streamlines of molecules.

BFig. 2 :
Fig. 2: Example for how to calculate the root mean square fluctuation (RMSF) for each residue in a protein with MDAnalysis and NumPy.A: Based on the input simulation data (topology and trajectory in the Gromacs format (TPR and XTC), MDAnalysis makes coordinates of the selected C α atoms available as NumPy arrays.From these coordinates, the RMSF is calculated by averaging over all frames in the trajectory.The RMSF is then plotted with matplotlib.The algorithm to calculate the variance in a single pass is due to Welford [Wel62].B: C α RMSF for each residue.

Fig. 3 :
Fig.3: MDAnalysis can be used with nglview to directly visualize molecules and trajectories in Jupyter notebooks.The adenylate kinase (AdK) protein from one of the included test trajectories is shown. .

Fig. 5 :
Fig. 5: Simulation of a coarse-grained model of the influenza A virion membrane (purple/red) close to a model of the human plasma membrane (brown).A: Left: initial frame.Right: system after 40 ns .A horizontal black guide line is used to emphasize the rising plasma membrane position.The images were produced with VMD [HDS96].B Maximum Z (vertical) coordinate values for the influenza A virus envelope and the plasma membrane are tracked over the course of the simulation, indicating that the membrane rises to rapidly.

TABLE 2 :
Performance comparison of accessing attributes with new AtomGroup data structures (upcoming release v0.16.0) compared with the old Atom classes (v0.15.0).Shorter access times are better.The same benchmark systems as in Table1were used.
Marrink and coworkers [IME + 14] used MDAnalysis to analyze a realistic model of the membrane of a mammalian cell with 63 different lipid species and over half a million particles for 40 µs.They discovered that transient domains with liquid-ordered character formed and disappeared on the microsecond time scale, with different lipid

TABLE 3 :
Performance comparison of loading a topology file with 1.75 to 10 million atoms with new AtomGroup data structures (upcoming release v0.16.0) compared with the old Atom classes (v0.15.0).Shorter loading times are better.The same benchmark systems as in Table1were used.