After decades of development, Molecular Dynamics (MD) has reached maturity. Millions of HPC supercomputer hours are dedicated to collecting trajectories, resulting in an overwhelming amount of simulation data that the community struggles to manage. The lack of standardised data sharing practices and appropriate infrastructures leads to significant data loss after limited analysis, revealing only a fraction of the valuable information contained within. The most significant obstacle to MD data sharing is the absence of interoperable standards and simulation ontologies, which hinders the ability to automatically find, access, interoperate, and reuse data according to FAIR (Findable, Accessible, Interoperable, and Reusable) principles.
BioExcel has partnered with the MDDB project to spearhead a community challenge aimed at implementing FAIR principles in biomolecular simulations. Building on the lessons learned from the design and development of the BioExcel-CV19 MD database, and leveraging the expertise of BioExcel partners along with feedback from the entire biomolecular simulation community, a comprehensive proposal was developed. This proposal defines FAIR MD data and outlines practical steps for implementation, illustrated by real case studies. More than 120 renowned researchers, including Nobel Laureates, contributed to this effort, resulting in a community white paper available on the arXiv repository. The list of authors reflects the broad consensus within the community on the key points discussed in the paper.
Implementing FAIR principles in biomolecular simulation data is essential for avoiding redundancy and saving valuable HPC computational resources. By making data easily findable and accessible, researchers can reuse existing datasets, reducing the need for resource-intensive re-runs. FAIR MD data is also vital for training AI models, optimising force fields and simulation protocols, and designing new coarse-grained and mesoscopic models. Additionally, large collections of MD data enable the generation of meta-trajectories by combining individual trajectories for the same system and facilitate meta-analyses to extract information from systems with shared characteristics (see Figure).
High-quality datasets that are interoperable and reusable promote seamless integration and validation, accelerate AI innovation, foster collaboration, and drive scientific discovery, ensuring transparency and reproducibility in research.
The MD data storage pipeline should include conducting high-quality simulations, performing rigorous quality control checks, applying FAIR principles to make data findable, accessible, interoperable, and reusable, sharing the data openly to promote transparency and collaboration, and using the data to gain insights and advance research, such as through AI modelling methods.
MD data reusability and reproducibility saves HPC computing time as there is no need to re-run the same calculations. In addition, stored MD (Big)Data is essential for; Training of AI methods (e.g. generative models), parameterization of new force-fields and development of new coarse-grained and mesoscopic models. Meta-trajectories and meta-analysis allow researchers to obtain rich information from systems with shared characteristics.