By Xavier Anthony Raj

This brief write-up provides an overview of and highlights some of the key learning from the recent “How to run GROMACS efficiently on the LUMI supercomputer” workshop which was held 24-25 January 2024.

The workshop kicked-off with Rasmus Kronberg presenting a comprehensive lecture on the hardware architecture of LUMI pre-exascale supercomputer. The talk provided an insight on the factors corresponding to hardware design that influence GROMACS performance on LUMI. This was followed by a detailed walk-through of the SLURM submit script, custom-designed for running GROMACS on LUMI platform. Highlights of this talk was about binding of tasks to resources that considers LUMI’s CPU-GPU binding design to achieve best performance for GROMACS application.

 

Szilard Pall gave a detailed description of the heterogenous parallelization scheme implemented in GROMACS for offloading different tasks onto GPU and CPU resources. He also explained how to assess GROMACS efficiency and fine-tune parameters to achieve optimum performance. Redesigning of the fundamental GROMACS algorithms has been carried out to suit the modern HPC architecture (such as LUMI) that comprises of both CPU and GPU compute nodes.

Andrey Alekseenko provided a more technical description of the AMD GPU architecture and the OpenSYCL application programming interface (API) that supports newer versions of GROMACS on AMD GPUs. Earlier versions of GROMACS were supported on NVIDIA GPUs through CUDA programming model. He also explained about the scheduling of tasks between GPU and CPU. This technical insight is essential to understand GROMACS implementation on LUMI which exclusively has AMD GPU nodes.

 

To round it off, Alessandra Villa provided a detailed account on molecular dynamics simulations.

The practical aspect of the workshop provided effectively designed hands-on exercises: starting from running a GROMACS simulation on a single CPU, gradually progressing towards running simulation on a heterogenous platform (CPU-GPU node) using different permutations for offloading tasks and exploring scaling across multiple GPUs through staged communication. This provided the participants a clear perspective on how GROMACS performance varies between environments (homogenous/CPU-only and heterogenous/GPU-GPU) and by fine-tuning of other parameters such as number of parallel tasks [threads] calculating particle-particle [PP] interaction and particle mesh Ewald summation.

Participants were actively involved during the lectures and hands-on exercises as clearly seen through the questions posted by them on the live document (hack MD) designated for the course. The feedback provided after the course clearly shows that the training was well received and appreciated. The workshop provides a clear understanding about efficient implementation of GROMACS on LUMI platform and this knowledge, we believe, will be very useful in optimum utilization of LUMI-G resource by the large GROMACS user community across EU. We highly encourage GROMACS users to go through the training material and familiarize themselves with GROMACS implementation on LUMI platform. This will surely be useful in helping them achieve optimum performance for the resources used.