Simulation of biological macromolecules has evolved from a niche statistical-mechanics method into one of the most widely applied biophysical research tools, and is used far outside theoretical chemistry. Supercomputers are now as important as centrifuges or test tubes in chemistry. As showcased by the 2013 Nobel prize for chemistry, molecular dynamics based on statistical mechanics makes it possible to simulate the motions of atomic in realistic environments at room temperature, for systems ranging from material chemistry to proteins, DNA, RNA and membranes containing millions of atoms. The fundamental algorithm of molecular dynamics evaluates forces on all atoms in a system and updates the velocities and positions of the atoms according to Newton’s equations of motion. This numerical integration scheme is iterated for billions of steps, and it generates a series of samples that describe the thermodynamic ensemble of the system. This is the true strength of the technique, since it predicts experiments: it can accurately describe how molecules such as proteins move, but it also enables the calculation of free energies that describe chemical reactions, for instance the binding free energy of a candidate drug compound in a protein active site or how a ligand will stabilize a particular conformation to open or close an ion channel. Since the calculation of forces is required for a large number of algorithms, several other packages use molecular simulation toolkits as libraries to evaluate energies, for instance in docking or when refining structures with experimental restraints such as X-ray, NMR, or Cryo-EM data.gromacs scaling

This development would not have been possible without significant research efforts in simulation algorithms, optimization, and parallelization. The emergence of standardized packages for molecular modelling such as GROMACS, NAMD, AMBER, and CHARMM has been critical since they have helped commoditize simulation research, making the techniques available to life science application researchers who are not specialists in simulation development. All these packages have complementary strengths and profiles – the field has moved from historical competition to extensive sharing of ideas. GROMACS is one of the most widely used scientific software packages in the world with about 20,000 citations in total (Hess et al. 2008, Pronk et al. 2013); it is the largest free software and open source application in biomolecular research, and the only one of the major molecular dynamics simulation packages where development is led in Europe.

The GROMACS project started in 1995 as one of the first-ever parallel simulation codes, the international development team is lead by the KTH partner, and the project is strongly focused on simulation efficiency and generality. It is the only package to support all common force fields and it has a very wide range of simulation algorithms. This combined with the very liberal (and business-friendly) licensing is likely a major factor why it is used as a simulation, minimization and energy evaluation library by several other applications e.g. in bioinformatics or distributed computing projects such as Folding@Home. The code is portable to a very wide range of platforms (including embedded ones), it includes manually tuned assembly kernels for a dozen different architecture instruction sets and accelerator support both for Nvidia GPUs with CUDA, AMD GPUs with OpenCL, and Xeon Phi processors natively. The package uses state-of-the-art neutral territory domain decomposition and multi-level parallelization to enable scaling both to tens of thousands of nodes on supercomputers and efficient high-throughput computing with accelerators (Pall et al. 2014).

GROMACS can already use thousands of cores and hundreds of accelerators efficiently in parallel, even for a single quite small system. When adding ensemble-level parallelization with Copernicus the total problem scaling extends another two orders of magnitude.

Molecular dynamics simulation in general, and GROMACS in particular, has made it possible to study large and complex biomolecular systems such as membranes and membrane proteins and probe atomic detail that is not accessible to any experimental methods. Molecular simulations provided some of the first high-resolution models of resting states of ion channels based on X-ray structures of open channels (Vargas et al. 2012), and they were critical to model transient intermediate conformations during structural transitions of membrane proteins (Henrion et al. 2012). GROMACS was also used to predict the first specific molecular recognition of lipids by membrane proteins (Contreras et al. 2012) and for the simulations that identified separate potentiating and inhibitory binding sites in the ligand-gated ion channels of our nervous system (Murail et al. 2012) – results that are now used by several groups in attempts to design better drugs.

In the context if BioExcel, both the KTH and MPG partners will contribute to improving the performance, scalability, quality and usability both for GROMACS and other simulation codes:

  1. QA, unit testing and a general library for biomolecular modelling.
    GROMACS will be turned into a state-of-the-art module-based C++ library with full unit testing and up-to-date user & developer documentation for all modules. The project is moving to a professional QA setup by introducing strict code review (including from the main developers) and automatic continuous integration where all patches are compiled and unit-tested on a wide range of hardware and compilers to QA-approve every single change, and to make it possible for any installation site to guarantee the quality of their compiled install.
  2. Heterogeneous parallelization.
    We will develop a new heterogeneous parallelization implementation where all available CPU, accelerator and communication resources are used in parallel on each node through explicit multithreading and multi-level load balancing, as well as new support for OpenCL and Xeon Phi accelerators in addition to CUDA.
  3. Efficient ensemble techniques.
    Some of the most powerful approaches today are based on using hundreds or thousands of simulations for ensemble sampling techniques such as Markov state models or free energy calculations. We will make these approaches accessible to users in general by fully integrating our Copernicus framework for ensemble simulation with GROMACS (Pronk et al. 2011). This will make it possible to formulate high-level sampling and free energy calculation problems as black-box computation problems that can employ hundreds of thousands of processors internally. This is particularly important for high-throughput free energy screening applications. Notably, the framework is not limited to GROMACS, but it can be used with any code.
  4. To facilitate exchange of data with other applications, and to enable fully automated high-throughput simulation, we are developing public data formats to describe molecules with XML, highly compressed trajectory formats that support digital hashes and signatures to guarantee data integrity, and new tools to automatically create interaction descriptions (topologies) for arbitrary small molecules used e.g. as drug compounds targeting a number of different force fields such as CHARMM, GAFF, or OPLS-AA (Lundborg & Lindahl 2014).
  5. Some of the most promising potential applications of free energy calculation include the prediction of amino acid scanning experiments or how small molecules should be altered to improve binding. Currently, this is hampered by the requirement of either calculating absolute free energies for large changes (which causes large statistical errors), or manually designing topologies where residues or drugs are morphed directly into related molecules. As part of BioExcel, we will make free energy calculations applicable in these high-throughput settings by developing and integrating new modules to automatically morph any amino acid into others, and automatically turn drug compounds into related derivatives while keeping the perturbation as small as possible. In combination with automatic topology generation and ensemble simulation this will turn molecular simulations into a tool that can screen molecular and binding stability in 24-48h, with large implications for drug design usage in the pharmaceutical industry.