Analyzing Particle Systems for Machine Learning and Data Visualization with freud

The freud Python library analyzes particle data output from molecular dynamics simulations. The library’s design and its variety of highperformance methods make it a powerful tool for many modern applications. In particular, freud can be used as part of the data generation pipeline for machine learning (ML) algorithms for analyzing particle simulations, and it can be easily integrated with various simulation visualization tools for simultaneous visualization and real-time analysis. Here, we present numerous examples both of using freud to analyze nano-scale particle systems by coupling traditional simulational analyses to machine learning libraries and of visualizing per-particle quantities calculated by freud analysis methods. We include code and examples of this visualization, showing that in general the introduction of freud into existing ML and visualization workflows is smooth and unintrusive. We demonstrate that among Python packages used in the computational molecular sciences, freud offers a unique set of analysis methods with efficient computations and seamless coupling into powerful data analysis pipelines.


Introduction
The availability of "off-the-shelf" molecular dynamics engines (e.g.HOOMD-blue [ALT08], [GNA + 15], LAMMPS [Pli95], GROMACS [BvdSvD95]) has made simulating complex systems possible across many scientific fields.Simulations of systems ranging from large biomolecules to colloids are now common, allowing researchers to ask new questions about reconfigurable materials [CDA + 18] and develop coarse-graining approaches to access increasing timescales [SZR + 19].Various tools have arisen to facilitate the analysis of these simulations, many of which are immediately interoperable with the most popular simulation tools.The freud library is one such analysis package that differentiates itself from others through its focus on colloidal and nano-scale systems.
Due to their diversity and adaptability, colloidal materials are a powerful model system for exploring soft matter physics [GS07].
Such materials are also a viable platform for harnessing photonic [CDA + 18], plasmonic [TCLC11], and other useful structurallyderived properties.In colloidal systems, features like particle anisotropy play an important role in creating complex crystal structures, some of which have no atomic analogues [DEG12].Design spaces encompassing wide ranges of particle morphology [DEG12] and interparticle interactions [AADG18] have been shown to yield phase diagrams filled with complex behavior.
The freud Python package offers a unique feature set that targets the analysis of colloidal systems.The library avoids trajectory management and the analysis of chemically bonded structures, which are the province of most other analysis platforms like MDAnalysis and MDTraj (see also 1) [MADWB11], [MBH + 15].In particular, freud excels at performing analyses based on characterizing local particle environments, which makes it a powerful tool for tasks such as calculating order parameters to track crystallization or finding prenucleation clusters.Among the unique methods present in freud are the potential of mean force and torque, which allows users to understand the effects of particle anisotropy on entropic self-assembly [vAAS + 14], [vAKA + 14], [KGG16], [HMA + 15], [AAM + 17], and various tools for identifying and clustering particles by their local crystal environments [TvAG19].All such tasks are accelerated by freud's extremely fast neighbor finding routines and are automatically parallelized, making it an ideal tool for researchers performing peta-or exascale simulations of particle systems.The freud library's scalability is exemplified by its use in computing correlation functions on systems of over a million particles, calculations that were used to illuminate the elusive hexatic phase transition in two-dimensional systems of hard polygons [AAM + 17].More details on the use of freud can be found in [RDH + 19].In this paper, we will demonstrate that freud is uniquely well-suited to usage in the context of data pipelines for visualization and machine learning applications.

Data Pipelines
The freud package is especially useful because it can be organically integrated into a data pipeline.Many research tasks in computational molecular sciences can be expressed in terms of data pipelines; in molecular simulations, such a pipeline typically involves: 1) Generating an input file that defines a simulation.2) Simulating the system of interest, saving its trajectory to a file.Fig. 1: Common Python tools for simulation analysis at varying length scales.The freud library is designed for nanoscale systems, such as colloidal crystals and nanoparticle assemblies.In such systems, interactions are described by coarse-grained models where particles' atomic constituents are often irrelevant and particle anisotropy (non-spherical shape) is common, thus requiring a generalized concept of particle "types" and orientation-sensitive analyses.These features contrast the assumptions of most analysis tools designed for biomolecular simulations and materials science.
3) Analyzing the resulting data by computing and storing various quantities.4) Visualizing the trajectory, using colors or styles determined from previous analyses.
However, in modern workflows the lines between these stages is typically blurred, particularly with respect to analysis.While direct visualization of simulation trajectories can provide insights into the behavior of a system, integrating higher-order analyses is often necessary to provide real-time interpretable visualizations in that allow researchers to identify meaningful features like defects and ordered domains of self-assembled structures.Studies of complex systems are also often aided or accelerated by a real-time coupling of simulations with on-the-fly analysis.This simultaneous usage of simulation and analysis is especially relevant because modern machine learning techniques frequently involve wrapping this pipeline entirely within a higher-level optimization problem, since analysis methods can be used to construct objective functions targeting a specific materials design problem, for instance.
Following, we provide demonstrations of how freud can be integrated with popular tools in the scientific Python ecosystem like TensorFlow, Scikit-learn, SciPy, or Matplotlib.In the context of machine learning algorithms, we will discuss how the analyses in freud can reduce the 6N-dimensional space of particle positions and orientations into a tractable set of features that can be fed into machine learning algorithms.We will further show that freud can be used for visualizations even outside of scripting contexts, enabling a wide range of forward-thinking applications including Jupyter notebook integrations, versatile 3D renderings, and integration with various standard tools for visualizing simulation trajectories.These topics are aimed at computational molecular scientists and data scientists alike, with discussions of real-world usage as well as theoretical motivation and conceptual exploration.The full source code of all examples in this paper can be found online 1 .

Performance and Integrability
Using freud to compute features for machine learning algorithms and visualization is straightforward because it adheres to a UNIX-like philosophy of providing modular, composable features.This design is evidenced by the library's reliance on NumPy 1. https://github.com/glotzerlab/freud-examplesarrays [Oli06] for all inputs and outputs, a format that is naturally integrated with most other tools in the scientific Python ecosystem.In general, the analyses in freud are designed around analyses of raw particle trajectories, meaning that the inputs are typically (N, 3) arrays of particle positions and (N, 4) arrays of particle orientations, and analyses that involve many frames over time use accumulate methods that are called once for each frame.This general approach enables freud to be used for a range of input data, including molecular dynamics and Monte Carlo simulations as well as experimental data (e.g.positions extracted via particle tracking) in both 3D and 2D.The direct usage of numerical arrays indicates a different usage pattern than that of tools, such as MDAnalysis [MADWB11] and MDTraj [MBH + 15], for which trajectory parsing is a core feature.Due to the existence of many such tools which are capable of reading simulation engines' output files, as well as certain formats like gsd2 that provide their own parsers, freud eschews any form of trajectory management and instead relies on other tools to provide input arrays.If input data is to be read from a file, binary data formats such as gsd or NumPy's npy or npz are strongly preferred for efficient I/O.Though it is possible to use a library like Pandas to load data stored in a comma-separated value (CSV) or other text-based data format, such files are often much slower when reading and writing large numerical arrays.Decoupling freud from file parsing and specific trajectory representations allows it to be efficiently integrated into simulations, machine learning applications, and visualization toolkits with no I/O overhead and limited additional code complexity, while the universal usage of NumPy arrays makes such integrations very natural.
In keeping with this focus on composable features, freud also abstracts and directly exposes the task of finding particle neighbors, the task most central to all other analyses in freud.Since neighbor finding is a common need, the neighbor finding routines in freud are highly optimized and natively support periodic systems, a crucial feature for any analysis of particle simulations (which often employ periodic boundary conditions).
In figure 2, a comparison is shown between the neighbor finding algorithms in freud and SciPy [JOPo01].For each system size, N particles are uniformly distributed in a 3D periodic cube such that each particle has an average of 12 neighbors within a distance of r cut = 1.0.Neighbors are found for each particle by searching within the cutoff distance r cut .The methods compared are scipy.spatial.cKDTree'squery_ball_tree, freud.locality.AABBQuery's queryBall, and freud.locality.LinkCell's compute.The benchmarks were performed with 5 replicates on a 3.6 GHz Intel Core i3-8100B processor with 16 GB 2667 MHz DDR4 RAM.
Evidently, freud performs very well on this core task and scales well to larger systems.The parallel C++ backend implemented with Cython and Intel Threading Building Blocks makes freud perform quickly even for large systems [BBC + 11], [Int18].Furthermore, freud supports periodicity in arbitrary triclinic volumes, a common feature found in many simulations.This support distinguishes it from other tools like scipy.spatial.cKDTree,which only supports cubic boxes.The fast neighbor finding in freud and the ease of integrating its outputs into other analyses not only make it easy to add fast new analysis methods into freud, they are also central to why freud can be easily integrated into workflows for machine learning and visualization.

Machine Learning
A wide range of problems in soft matter and nano-scale simulations have been addressed using machine learning techniques, such as crystal structure identification [SG18].In machine learning workflows, freud is used to generate features, which are then used in classification or regression models, clusterings, or dimensionality reduction methods.For example, Harper et al. used freud to compute the cubatic order parameter and generate high-dimensional descriptors of structural motifs, which were visualized with t-SNE dimensionality reduction [HWG19], [vdMH08].The library has also been used in the optimization and inverse design of pair potentials [AADG18], to compute fitness functions based on the radial distribution function.The open-source pythia 3 library offers a number of descriptor sets useful for crystal structure identification, leveraging freud for fast computations.Included among the descriptors in pythia are quantities based on bond angles and distances, spherical harmonics, and Voronoi diagrams.
Computing a set of descriptors tuned for a particular system of interest (e.g. using values of Q l , the higher-order Steinhardt W l parameters, or other order parameters provided by freud) is possible with just a few lines of code.Descriptors like these (exemplified in the pythia library) have been used with TensorFlow for supervised and unsupervised learning of crystal structures in complex phase diagrams [SG18], [AAB + 15].Another useful module for machine learning with freud is freud.cluster,which uses a distance-based cutoff to locate clusters of particles while accounting for 2D or 3D periodicity.Locating clusters in this way can identify crystalline grains, helpful for building a training set for machine learning models.
To demonstrate a concrete example, we focus on a common challenge in molecular sciences: identifying crystal structures.Recently, several approaches have been developed that use machine learning for detecting ordered phases [SCKL15], [SG18], [FSM19], [SNR83], [LD08].The Steinhardt order parameters are often used as a structural fingerprint, and are derived from rotationally invariant combinations of spherical harmonics.In the example below, we create face-centered cubic (fcc), body-centered cubic (bcc), and simple cubic (sc) crystals with added Gaussian noise, and use Steinhardt order parameters with a support vector machine to train a simple crystal structure identifier.Steinhardt order parameters characterize the spherical arrangement of neighbors around a central particle, and combining values of Q l for a range of l often gives a unique signature for simple crystal structures.This example demonstrates a simple case of how freud can be used to help solve the problem of structural identification, which often requires a sophisticated approach for complex crystals.
In figure 3, we show the distribution of Q 6 values for sample structures with 4000 particles.Here, we demonstrate how to compute the Steinhardt Q 6 , using neighbors found via a periodic Voronoi diagram.Neighbors with small facets in the Voronoi polytope are filtered out to reduce noise.

Visualization
Many analyses performed by the freud library provide a plot(ax=None) method (new in v1.2.0) that allows their computed quantities to be visualized with Matplotlib.Additionally, these plottable analyses offer IPython representations, allowing Jupyter notebooks to render a graph such as a radial distribution function g(r) just by returning the compute object at Fig. 4: UMAP of particle descriptors computed for simple cubic, body-centered cubic, and face-centered cubic structures of 4000 particles with added Gaussian noise.The particle descriptors include Q l for l ∈ {4, 6, 8, 10, 12}.Some noisy configurations of bcc can be confused as fcc and vice versa, which accounts for the small number of errors in the support vector machine's test classification.
the end of a cell.Analyses like the radial distribution function or correlation functions return data that is binned as a onedimensional histogram --these are visualized with a line graph via matplotlib.pyplot.plot,with the bin locations and bin counts given by properties of the compute object.Other classes provide multi-dimensional histograms, like the Gaussian density or Potential of Mean Force and Torque, which are plotted with matplotlib.pyplot.imshow.The most complex case for visualization is that of per-particle properties, which also comprises some of the most useful features in freud.Quantities that are computed on a per-particle level can be continuous (e.g.Steinhardt order parameters) or discrete (e.g.clustering, where the integer value corresponds to a unique cluster ID).Continuous quantities can be plotted as a histogram over particles, but typically the most helpful visualizations use these quantities with a color map assigned to particles in a twoor three-dimensional view of the system itself.For such particle visualizations, several open-source tools exist that interoperate well with freud.Below are examples of how one can integrate freud with plato 4 , fresnel 5 , and OVITO 6 [Stu10].
plato is an open-source graphics package that expresses a common interface for defining two-or three-dimensional scenes which can be rendered as an interactive Jupyter widget or saved to a high-resolution image using one of several backends (PyThreejs, Matplotlib, fresnel, POVray 7 , and Blender 8 , among others).Below is an example of how to render particles from a HOOMDblue snapshot, colored by the density of their local environment [ALT08]   fresnel 9 is a GPU-accelerated ray tracer designed for particle simulations, with customizable material types and scene lighting, as well as support for a set of common anisotropic shapes.Its feature set is especially well suited for publication-quality graphics.Its use of ray tracing also means that an image's rendering time scales most strongly with the image size, instead of the number of particles --a desirable feature for extremely large simulations.An example of how to integrate fresnel is shown below and rendered in figure 6.

Conclusions
The freud library offers a unique set of high-performance algorithms designed to accelerate the study of nanoscale and colloidal systems.These algorithms are enabled by a fast, easyto-use set of tools for identifying particle neighbors, a common first step in nearly all such analyses.The efficiency of both the core neighbor finding algorithms and the higher-level analyses makes them suitable for incorporation into real-time visualization environments, and, in conjunction with the transparent NumPybased interface, allows integration into machine learning workflows using iterative optimization routines that require frequent recomputation of these analyses.The use of freud for realtime visualization has the potential to simplify and accelerate existing simulation visualization pipelines, which typically involve slower and less easily integrable solutions to performing realtime analysis during visualization.The application of freud to machine learning, on the other hand, opens up entirely new avenues of research based on treating well-known analyses of particle simulations as descriptors or optimization targets.In these ways, freud can facilitate research in the field of computational molecular science, and we hope these examples will spark new ideas for scientific exploration in this field.

Getting freud
The freud library is tested for Python 2.7 and 3.5+ and is compatible with Linux, macOS, and Windows.To install freud, execute conda install -c conda-forge freud or pip install freud-analysis Its source code is available on GitHub 10 and its documentation is available via ReadTheDocs 11 .
Fig. 2: Comparison of runtime for neighbor finding algorithms in freud and SciPy for varied system sizes.See text for details.

Fig. 3 :
Fig. 3: Histogram of the Steinhardt Q 6 order parameter for 4000 particles in simple cubic, body-centered cubic, and face-centered cubic structures with added Gaussian noise.

Fig. 5 :
Fig. 5: Interactive visualization of a Lennard-Jones particle system, rendered in a Jupyter notebook using plato with the pythreejs backend.

Fig. 7 :
Fig. 7: A crystalline grain identified using freud's LocalDensity module and cut out for display using OVITO.The image shows a tP30-CrFe structure formed from an isotropic pair potential optimized to generate this structure [AADG18].
https://www.povray.org/8. https://www.blender.org/ The Python scripting functionality built into OVITO enables the use of freud modules, demonstrated in the code below and shown in figure7.
[Stu10]ITO is a GUI application with features for particle selection, making movies, and support for many trajectory formats[Stu10].OVITO has several built-in analysis functions (e.g.Polyhedral Template Matching), which complement the methods in freud.