Fast, flexible, Free, GROMACS

Choosing the protonation state for (de)protonatable groups on proteins is problematic, Furthermore, most biomolecular systems show important changes in response to pH, and the local pH for a molecule can change when it moves 7through the system. Changes in protonation states can have rather large effects due to a local change in electrostatic potential. Dynamic protonation will enable simulating many interesting biological processes and also makes the initial choice of protonation states less critical or superfluous. We have recently developed a very efficient algorithm for simulating dynamics protonation (doi:10.1021/acs.jctc.2c00516), which is available in a branch of GROMACS. It needs to be refactored and fully integrated before it can be part of an official GROMACS release. 

The development of the feature is described at the GROMACS issue tracker on gitlab (https://gitlab.com/gromacs/gromacs/-/issues/4273).

Ensemble algorithms provide another level of parallelism, which is becoming critical to enable efficient usage of large HPC resources even for small systems such as proteins. While this is already supported in many ways in the main code base, the data analysis can often only be done using external tools, as in the case of constructing Markov state models (MSMs). Here, users would benefit from being able to use the GROMACS tools directly to perform some of the tasks needed to construct those MSMs, such as efficient clustering and feature extraction, as well as automatically setting up simulations through the Python API. It is planned to invest development effort into streamlining those processes and integrating them into a new GROMACS tool that will take advantage of the improvements to the API and the analysis tool suite.

With the breakthrough of AI and machine learning potentials, using these techniques have quickly gained popularity. Such potentials can be based on (expensive) ab-initio calculations, as well as on large ensembles of structures. These potentials are currently not supported by GROMACS’ standard pair potential setup. There are two ways of leveraging the capabilities of GROMACS with machine learning potentials. One is adding an interface to (Py)Torch. This provides full flexibility and support for most potentials. But compared to the speed of GROMACS, (Py)Torch is orders of magnitude slower. This will limit the time scales that can be covered. A second approach is native GPU support for shallow neural networks. This has more limited flexibility, but will likely be orders of magnitude faster. We plan to pursue both approaches.

Force fields are continuously evolving and are continuously updated to account for the variety of the complex biomolecular systems. The most recent versions of the force fields are usually recommended by the force field developers. With newer versions of force fields included in the default GROMACS distribution it would be easier for users to enjoy the recent developments. However, this comes with the burden of converting the force fields to a format suitable for GROMACS and subsequently testing the conversions in order to ensure correctness. In addition, it would also be important to retain old versions of force fields for backward compatibility. A balance between usability and sustainability will have to be found, in dialogue with the force field developers and/or contributors involved in the conversion process.

The particle-mesh Ewald algorithm is what limits scaling of MD simulations. For CPU-only or combined CPU-GPU runs we have a relatively efficient MPI parallel version of both the atom to grid to atom spread and gather operations as well as for the 3D FFT. However, as GPUs get faster, there is direct-GPU communication and especially now that there are HPC machines (e.g., LUMI) where the network is directly coupled to the GPU instead of the CPU, there are large benefits to running PME on multiple GPUs. To achieve this, we first move the spread and gather compute kernels to the GPU, together with the grid overlap communication, combined with FFTs on the CPU. As better parallel 3D-FFT GPUs become available, we will interface with those to achieve maximum performance on new (pre-) exascale machines such as LUMI. Prototype implementations are already available, but the code should be consolidated and tested.

Simulations of large systems pose particular challenges for simulation software. Large systems occur often in flow applications but are also getting more common in the biomolecular field with simulation of virus capsids and small cells. Large systems cover larger coordinates in space, which means less absolute precision is available. The main effect that deteriorates the quality of the results is larger noise, due to rounding errors in the integration. One remedy is compiling GROMACS in double precision (for small systems single precision has proved sufficient), but that affects performance significantly and excludes the use of general-purpose GPUs (even GPUs that support double precision do so with a significant performance hit). A more elegant solution is storing the coordinates in fixed precision using 32-bit integers. This results in a factor 100 lower rounding errors at little extra cost. Due to the fixed precision, the rounding errors are independent of the value of the coordinates.

We have previously worked on enabling steps between pair search and domain decomposition partitioning to run almost entirely on the GPU. Only the temperature and pressure calculation and the thermo- and barostat integration run on CPU, but these typically occur only every 10 steps. In the 2023 GROMACS release we increased this period to be 100 steps by default. This removes the bottleneck where high-end GPUs presently must be paired with an expensive CPU and it also removes the bottleneck of PCI transfer overheads, particularly on high-end systems.

We would like to extend this to all major accelerator platforms, but that will likely require additional development resources. Also, the remaining common operations should be ported to the GPU, both for performance and to simplify the run schedule. They will provide a setup where special algorithms, such as density-driven fitting or other methods whose forces can be applied with multiple time stepping, need to be written only once for the CPU, which minimizes developer time, while still having close to optimal total run performance.

Many knowledge-based force fields, particularly in materials science, produce interaction functions that are not based on an easily described mathematical formula. The potential and force for such interactions must be looked up from tables for each particle pair. The use of such kernels needs care because one must choose interpolation schemes that are able to run fast on the hardware, while providing a bounded error range.

To make use of such tabulated interactions, the topology description within GROMACS must be extended so that it is possible to choose tables for unique atom pairs, or pairs of all types, etc. This would significantly enhance the usability of GROMACS for non-traditional applications, including both materials science and special potentials currently only available on slow or non-scaling simulation codes.

To enable improved access to individual parts of GROMACS functionality, more of the internal parts will be exposed through a stable interface both on the C++ and Python level. The current versions of those features allow users to specify different workflows and dependencies on the Python level, as well as the use of all of the GROMACS tools from Python. As the need for workflows and especially ensemble simulation is increasing, there is a strong need for more flexibility in the APIs. Many workflows and algorithms need to only change coordinates or a few parameters. This would be much more performant, as well as simpler for the user, when this can be done through modification of the variables/parameters the user wants to change and keeping all data in memory rather than regenerating the whole run setup as is currently often required. Having this functionality will allow scripting of more advanced workflows and algorithms instead of needing to code this inside GROMACS. This will significantly lower the barrier for trying new simulation approaches and also improve the maintainability of GROMACS.

In addition, future work will be aimed at stabilizing the exposed API on the C++ level by improving modularization of the code, which is concurrently a requirement for enabling changes in the state and parameters of the system.

While work during BioExcel has led to major improvements to the trajectory analysis tool suite and now allows replacing part of the legacy tools for writing modified trajectories and extracting information from them, there is still significant work to be done porting legacy tools to the new framework. Another focus point of future work is the ability to express the requirements of analysis tools regarding the format of the provided trajectory data in a programmatic way. This will ensure that users will no longer have to perform manual conversions between the results of a simulation and the input for the analysis. The project aims to fully implement those requirements for the existing tools and extend them to all other tools that are being ported to the new analysis suite.

With ever more components of GROMACS being supported for GPU offloading, it can become challenging for end-users to configure a given simulation system for optimal performance on their particular hardware. There are simply too many combinations of options to try, and while BioExcel have developed short-run benchmarks, they may not always give a realistic picture as performance characteristics depend also on individual simulation data and settings.

This challenge should be addressed by a combination of pre-computed cross-over points and auto-tuning during simulation by dynamically moving computation between CPU and GPU depending on load and actual performance. This will help unlock the full GROMACS performance for every user.

GROMACS, like most molecular simulation packages, uses the particle-mesh Ewald (PME) algorithm to compute long-range electrostatics interactions. PME achieves high serial performance through the use of fast Fourier transforms (FFTs), but these strongly limit parallel scaling due to the higher number of communications required.

A hierarchical method such as the fast-multipole method (FMM) has much lower communication needs and therefore provides better scaling, which becomes critical when scaling large systems to millions of cores. A fundamental issue with FMM has been that it does not conserve energy, which is a requirement for molecular dynamics. Recently we have developed and demonstrated improvements to FMM to obtain energy conservation. This paves the way for the application of FMM to molecular dynamics. We are working with two leading groups in Europe and Japan to incorporate their FMM codes in GROMACS.

In addition to the core simulation engine and sampling algorithm libraries, GROMACS also packages numerous tools suitable for preparing simulation topologies and conformations. Currently the interfaces to these are complex and some tools are slow. For many, and especially novice users, it would be useful to be able to generate a run setup while only specifying a PDB file, a forcefield and a simulation length. This also simplifies setting up pipelines for studies on many different systems. For standard run setups no further user interaction should be needed, and the system can be automatically solvated in water and neutralized with counter ions. Those changes will remove the need of the users to directly interact with the legacy pdb2gmx conversion tool that has been indicated as one of the major hurdles for users starting new simulations with GROMACS.

HADDOCK currently uses CNS as a backend for a large part of its functionality. It would be advantageous to replace parts of the HADDOCK workflow by an open-source backend and GROMACS is the obvious candidate. HADDOCK needs many specific functionalities, in particular for restraints, which are not present in most molecular dynamics packages. In addition, HADDOCK use will need a flexible selection syntax. To work towards HADDOCK’s use of GROMACS we plan to implement dynamic atom group selections, more restraint functionality and rigid body dynamics.

In particular when combined with domain decomposition between computational nodes, the rapidly increasing core count is making it difficult to parallelize all tasks over all cores due to the limited size of data on each node. To address this, GROMACS developers plan to reformulate the entire molecular dynamics simulation algorithm into fine-grained task graphs where each task only runs on a subset of cores, communication is also formulated as tasks, and dynamic data dependencies are tracked to avoid global barriers. Presently there are few, if any, task libraries that support the type of extremely low-latency tasks needed for such an approach. Therefore, we are both evaluating the new OpenMP 5.0 task features while also working with two hardware vendors on co-design projects that involve future task libraries.

Historically, GROMACS defined its own plain text key-value input file, implementing its own parser and requiring users to understand how to use it. This was a good decision when it was made, but now that standard formats and libraries for handling structured text input files exist, we plan to transition using these. The best candidate for the format is currently JSON. This will eliminate many kinds of possible bugs in the code and support the modularization efforts elsewhere. Users will be able to edit the structured files in a syntax-aware editor of their choice. Developers extending the functionality of GROMACS will be able to make changes only in their new module, and not in multiple parts of the code.

Most GROMACS output formats are by now more than two decades old. Although they have served well, they are showing their age. A decade ago, we developed the TNG trajectory format. This is more modern and self-describing. But having a self-developed format is no longer ideal. Both for extensibility and for performance with large amounts of data, it is better to rely on available standards. The move to a new file format is far more a question of design choices than the actual implementation. At the moment HDF5 looks like the best candidate to build on.

In addition to the core simulation engine and sampling algorithm libraries, GROMACS also packages numerous tools suitable for preparing simulation topologies and conformations. Currently the interfaces to these are complex and some tools are slow. For many, and especially novice users, it would be useful to be able to generate a run setup while only specifying a PDB file, a forcefield and a simulation length. This also simplifies setting up pipelines for studies on many different systems. For standard run setups no further user interaction should be needed, and the system can be automatically solvated in water and neutralized with counter ions. Those changes will remove the need of the users to directly interact with the legacy pdb2gmx conversion tool that has been indicated as one of the major hurdles for users starting new simulations with GROMACS.

We are continuously improving the GROMACS SIMD acceleration layer to target all major platforms. Recently SIMD support has been extended to European RISC-V processor in a co-design effort within the EUPilot project. Currently algorithms are being re-evaluated to improve performance on the RISC-V vector unit. The code also takes advantage of GPU accelerators and now supports the calculation of almost all interaction types on those devices for CUDA-based platforms, and non-bonded short-range interactions and PME long-range interactions on devices that support OpenCL.

GROMACS has support for non-Nvidia GPUs through the OpenCL platform. However, this open standard has lost momentum and instead the new SYCL platform is being adopted by hardware vendors such as Intel. During BioExcel-2, the build system, basic infrastructure and non-bonded kernels for SYCL support was implemented. This was/is a significant effort, as SYCL uses a completely different approach to asynchronous task scheduling than CUDA. The new pre-exascale system LUMI had its AMD GPU partition come online recently. GROMACS can now make use of the GPUs on LUMI, but the software stack is problematic. GROMACS currently relies on hipSYCL to run SYCL code on AMD GPUs. We are in discussion with AMD to set up a co-design effort to add a native HIP back-end to GROMACS.

For CUDA we have an ongoing co-design project to use CUDA graphs for task scheduling. This is a completely different approach of tasking than that used by current tasking frameworks. Another significant performance improvement can be achieved through device-initiated communication, using NVSHMEM in case of CUDA, but also of interest to other vendors. Both approaches require a lot of effort to get the scheduling correct. By removing the CPU from the scheduling, this approach can significantly reduce the overhead of scheduling and launching tasks, which is critical for improving the absolute performance of molecular simulations. Similar functionality is expected to be available in the next version of SYCL standard, and then these improvements will be extended to a wider range of hardware.