Molecular Dynamics databases: at the cusp of an upward trajectory?

By Richard Norman and Adam Hospital

With the recent publication describing the BioExcel COVID-19 Molecular Dynamics (MD) trajectory database as a new paradigm for how these computational data are stored and made accessible to the community (1), we take a look at previous and current efforts, their importance, and why we believe that we may be at the cusp of an upward trajectory when it comes to investment and the value-add these resources will provide in future.

Our ability to conduct more complex and informative MD simulations has evolved dramatically since the first reported MD simulation of a small protein was carried out in 1976. Traditionally, MD simulations were carried out, analysed, published and eventually discarded but not shared or validated beyond the groups which performed them. During the early 2000s, BioSimGrid, P-found, Dynasome and Dynameomics emerged as some of the first examples of MD databases. Among these, Dynameomics stood out as it included the protocols and quality control procedures to ensure reliability and reproducibility of data for future use. Short simulations (around 30 ns) under near-physiological conditions, the use of simple, and by today’s standards outdated force fields, were hallmarks of the data in these databases and these were used to understand protein unfolding and the effects of single point mutations on protein structure.

The Molecular Dynamics Extended Library (MoDEL) database (2) and MDWeb (3) is an example of how MD databases have evolved beyond databases per se to an extensive platform which supports the automatic setup of MD simulations and contains tools for validation, error detection and analysis of trajectories, a data warehouse and webserver and related web applications. The MoDEL database contains longer trajectories (up to 1 ms) of cytoplasmic monomeric proteins generated via a robust and flexible workflow structure using a number of popular modern force fields (GROMOS, OPLS, AMBER, CHARMM) and MD codes (AMBER, NAMD, GROMACS). The integrated webserver facilitates access to the data and the analysis of protein motion and its solvent environment at various levels of resolution which can be key in the study of protein flexibility, identification of cryptic pockets and small molecule binding using associated tools like GRID-MD (4). Additional specific databases exist; such as BigNASim for nucleic acids and MoDEL_CNS for proteins involved in Central Nervous System processes, as well as other databases for GPCRs, membrane proteins and the nucleosome.

The aforementioned BioExcel COVID-19 MD database is one of the latest efforts, storing more than 1500 MD simulations on a number of diverse SARS-CoV-2 protein units. The infrastructure incorporates data from multiple sources (research groups), and multiple methods, including the current state-of-the-art enhanced sampling, biased and multiple replica simulations. Besides, and thanks to the noSQL MongoDB technology used in the backend, it allows programmatic access to the MD trajectories, and meta-analyses on top of a set of MD simulations.

Despite the progress being made in the development of MD databases there is still work to be done to allow the storage and management of the large number of simulations of large macromolecular systems that are currently being produced by diverse groups across the community. It is unlikely that in future a single, centralized database, akin to the PDB for biomolecular structures, will exist for macromolecular MD trajectories. MD simulations are currently writing TBs of information for each run. The vision of a PDB-like single database storing all simulations is simply unrealistic. Instead, new projects should move to a distributed/federated infrastructure, made of a number of nodes interconnected and accessible from a single point, following the steps of the genomics field.

The need for open-access MD simulation data storage under FAIR principles is clear and undisputed as highlighted by a recent example of MD trajectories of phosphatidylcholine lipid bilayers being used to benchmark conformational dynamics of various force fields (5). Generative AI algorithms which will access these data to further optimize existing force fields or generate new ones will rely on such databases. New coarse-grained and mesoscopic models will also be trained using these data. For biosimulation data to be widely used and accepted, akin to how fluid dynamics simulation data is currently relied on in engineering applications, we need to be transparent and rigorous with how these data are generated, stored and accessed. From this perspective it is encouraging that the development of MD databases like MDDB (6) are being funded at the European level.

The next few years will be key in understanding whether today’s MD databases will evolve into or have an impact akin to that of the PDB. Will they develop independently of each other or should they be integrated in the PDBe-KB (7) thus providing a platform where users can access macromolecular structures, which have been derived experimentally or via AI algorithms (e.g. AlphaFold), and their associated simulation data to conduct further analyses of their dynamic behaviour in silico? The next few years will be key in defining the path and future success of MD databases, where community buy in, collaboration and additional funding will play fundamental roles. From our perspective we are at the cusp and the trajectory is upward.

(1) Beltran et al (2023) DOI: 10.1093/nar/gkad991

(2) Meyer et al (2010) DOI: 10.1016/j.str.2010.07.013

(3) Hospital et al (2010) DOI: 10.1093/bioinformatics/bts139

(4) Carrillo and Orozco (2008) DOI: 10.1002/prot.21592

(5) Antila et al (2021) DOI: 10.1021/acs.jcim.0c01299

(6) https://mddbr.eu/

(7) https://www.ebi.ac.uk/pdbe/pdbe-kb/