datreant : persistent , Pythonic trees for heterogeneous data

In science the filesystem often serves as a de facto database, with directory trees being the zeroth-order scientific data structure. But it can be tedious and error prone to work directly with the filesystem to retrieve and store heterogeneous datasets. datreant makes working with directory structures and files Pythonic with Treants: specially marked directories with distinguishing characteristics that can be discovered, queried, and filtered. Treants can be manipulated individually and in aggregate, with mechanisms for granular access to the directories and files in their trees. Disparate datasets stored in any format (CSV, HDF5, NetCDF, Feather, etc.) scattered throughout a filesystem can thus be manipulated as meta-datasets of Treants. datreant is modular and extensible by design to allow specialized applications to be built on top of it, with MDSynthesis as an example for working with molecular dynamics simulation data. http://datreant.org/


Introduction
In many scientific fields, especially those analyzing experimental or simulation data, there is an existing ecosystem of specialized tools and file formats which new tools must work around.Consequently, specialized database systems may be unsuitable for data management and storage.In these cases the filesystem ends up serving as a de facto database, with directory trees the zeroth-order data structure for scientific data.This is particularly true for fields centered around simulation: simulation systems can vary widely in size, composition, rules, parameters, and starting conditions.And with ever-increasing computational power, it is often necessary to store intermediate results from large amounts of simulation data so that they may be accessed and explored interactively.
These problems make data management difficult, and ultimately serve as a barrier to answering scientific questions.To address this, we present datreant, a Pythonic interface to the filesystem.datreant deals primarily in Treants: specially marked directories with distinguishing characteristics that can be discovered, queried, and filtered.Treants can be manipulated individually and in aggregate, with mechanisms for granular access to the directories and files in their trees.By way of Treants, datreant adds a lightweight abstraction layer to the filesystem, allowing researchers to focus more on what is stored and less on where.This greatly reduces the tedium of storing, retrieving, and operating on datasets of interest, no matter how they are organized.

Treants as filesystem manipulators
The central object of datreant is the Treant.A Treant is a directory in the filesystem that has been specially marked with a state file.A Treant is also a Python object.We can create a Treant with: >>> import datreant.coreas dtr >>> t = dtr.Treant('maple') >>> t <Treant: 'maple'> This creates a directory maple/ in the filesystem (if it did not already exist), and places a special state file inside which stores the Treant's state.This file also serves as a flagpost indicating that this is more than just a directory: The name of this file includes the type of Treant to which it corresponds, as well as the uuid of the Treant, its unique identifier.The state file contains all the information needed to generate an identical instance of this Treant, so that we can start a separate Python session and immediately use the same Treant there: Internally, advisory locking is done to avoid race conditions, making a Treant multiprocessing-safe.A Treant can also be moved, either locally within the same filesystem or to a remote filesystem, and it will continue to work as expected.

Introspecting a Treant's Tree
A Treant can be used to introspect and manipulate its filesystem tree.We can, for example, work with directory structures rather easily: >>> data = t['a/place/for/data/'] >>> data <Tree: 'maple/a/place/for/data/'> This Tree object points to a path in the Treant's own tree, but it need not necessarily exist.We can check this with:

>>> data.exists False
This behavior is by design for Tree objects (as well as Leaf objects; see below).We want to be able to work freely with paths without creating filesystem objects for each, at least until we are ready.
We can make a Tree exist in the filesystem easily enough: >>> data.makedirs()and if we also make another directory, too: >>> Using Treant, Tree, and Leaf objects, we can work with the filesystem Pythonically without giving much attention to precisely where these objects live within that filesystem.This becomes especially powerful when we have many directories/files we want to work with, possibly in many different places.

Aggregation and splitting on Treant metadata
What makes a Treant distinct from a Tree is its state file.This file stores metadata that can be used to filter and split Treant objects when treated in aggregate.It also serves as a flagpost, making Treant directories discoverable.
If we have many more Treants, perhaps scattered about the filesystem: A Bundle can be constructed in a variety of ways, most commonly using existing Treant instances or paths to Treants in the filesystem.
We can use a Bundle to subselect Treants in typical ways, including integer indexing and slicing, fancy indexing, boolean indexing, and indexing by name.But in addition to these, we can use metadata features such as tags and categories to filter and group Treants as desired.

Filtering Treants with tags
Tags are individual strings that describe a Treant.Setting the tags for each of our Treants separately: We get only a single member for the pair of keys ('fibrous', 'california') since 'sequoia' is the only Treant having the 'home' category.Categories are useful as labels to denote the types of data that a Treant may contain or how the data were obtained.By leveraging the groupby method, one can extract Treants by selected categories without having to explicitly access each member.This feature can be particularly powerful in cases where many Treants have been created and categorized to handle incoming data over an extended period of time; one can quickly gather any data needed without having to think about low-level details.

Treant modularity with attachable Limbs
Treant objects manipulate their tags and categories using Tags and Categories objects, respectively.These are examples of Limb objects: attachable components which serve to extend the capabilities of a Treant.While Tags and Categories are attached by default to all Treant objects, custom Limb subclasses can be defined for additional functionality.datreant is a namespace package, with the dependencylight core components included in datreant.core.The dependencies of datreant.coreinclude backports of standard library modules such as pathlib and scandir, as well as lightweight modules such as fuzzywuzzy and asciitree.
datreant.coreremains lightweight because other packages in the datreant namespace can have any dependencies they require.One such package is datreant.data,which includes a set of convenience Limb objects for storing and retrieving Pandas and NumPy [vdW11] datasets in HDF5 using PyTables and h5py internally.
We can attach a Data limb to a Treant with: Looking at the directory structure of "maple", we see that the data was stored in an HDF5 file under a directory corresponding to the name we stored it with: What's more, datreant.dataalso includes a corresponding AggLimb for Bundle objects, allowing for automatic aggregation of datasets by name across all member Treant objects.If we collect and store similar datasets for each member in our Bundle: which we can use for aggregated analysis, or perhaps just pretty plots (Figure 1).
>>> for name, group in sines.groupby(level=0):... s = group.reset_index(level=0,drop=True) ... s.plot(legend=True, label=name) The Data limb stores Pandas and NumPy objects in the HDF5 format within a Treant's own tree.It can also store arbitrary (but pickleable) Python objects as pickles, making it a flexible interface for quick data storage and retrieval.However, it ultimately serves as an example for how Treant and Bundle objects can be extended to do complex but convenient things.

Using Treants as the basis for dataset access and manipulation with the PyData stack
Although it is possible to extend datreant objects with limbs to do complex operations on a Treant's tree, it isn't necessary to build specialized interfaces such as these to make use of the extensive PyData stack.datreant fundamentally serves as a Pythonic interface to the filesystem, bringing value to datasets and analysis results by making them easily accessible now and later.
As data structures and file formats change, datreant objects can always be used in the same way to supplement the way these tools are used.
Because each Treant is both a Python object and a filesystem object, they work remarkably well with distributed computation libraries such as dask.distributed[Roc15] and workflow execution frameworks such as Fireworks [Jai15].Treant metadata features such as tags and categories can be used for automated workflows, including backups and remote copies to external compute resources, making work on datasets less imperative and more declarative when desired.

Building domain-specific applications on datreant
Built-in datreant.coreobjects are general-purpose, while packages like datreant.dataprovide extensions to these objects that are more specific.But it is possible, and very useful, for domain-specific applications to define their own domain-specific Treant subclasses, with tightly-coupled limbs for domainspecific needs.Not only do objects such as Bundle work just fine with Treant subclasses and custom Limb classes; they are designed explicitly with this need in mind.
The first example of a domain-specific package built around datreant is MDSynthesis, a module that enables high-level management and exploration of molecular dynamics simulation data.MDSynthesis gives a Pythonic interface to molecular dynamics trajectories using MDAnalysis [MiA11], giving the ability to work with the data from many simulations scattered throughout the filesystem with ease.This package makes it possible to write analysis code that can work across many varieties of simulation, but even more importantly, MDSynthesis allows interactive work with the results from hundreds of simulations at once without much effort.

Leveraging molecular dynamics data with MDSynthesis
MDSynthesis defines a Treant subclass called a Sim.A Sim featues special limbs for storing an MDAnalysis Universe definition and custom atom selections within its state file, allowing for painless recall of raw simulation data and groups of atoms of interest.
As an example of effectively using Sims, say we have 50 biased molecular dynamics simulations that sample the conformational change of the ion transport protein NhaA [Lee14] from the inward-open to outward-open state (Figure 2).Let's also say that we are interested in how many hydrogen bonds exist at any given time between the two domains as they move past each other.These Sim objects already exist in the filesystem, each having a Universe definition already set to point to its unique trajectory file(s).
We can use the MDAnalysis HydrogenBondAnalysis class to collect the data for each Sim using Bundle.mapfor process parallelism, storing the results using the datreant.datalimb: # process parallelism provided internally # with `multiprocessingb .map(get_hbonds,processes=16) Then we can retrieve the datasets in aggregate using the Bundle datreant.datalimb and visualize the result (Figure 3): By making it relatively easy to work with what can often be many terabytes of simulation data spread over tens or hundreds of trajectories, MDSynthesis greatly reduces the time it takes to iterate on new ideas toward answering real biological questions.

Final thoughts
datreant is a young project that started as a domain-specific package for working with molecular dynamics data, but has quickly morphed into a powerful, general-purpose tool for managing and manipulating filesystems and the data spread about them.The dependency-light datreant.corepackage is pure Python, BSD-licensed, and openly developed, and the datreant namespace is designed to support useful extensions to the core objects.It is the hope of the authors that datreant continues to grow in a way that benefits the wider scientific community, smoothing the common pain point of data glut and filesystem management.