Biomolecules are hardly monogamous, therefore studying their interaction at an atomic resolution is fundamental to the understanding of their functions, to design inhibitors or drugs that can modulate their activity and to rationalize the effect of a genetic mutation. High-throughput experimental techniques generate a wealth of qualitative and quantitative data, but the structural dimension is often missing, which calls for complementary modelling approaches. Moreover, the large number of interactions translates into an even larger amount of data, which require HPC and HTC solutions, automated workflows and cutting-edge technologies for the interactive and integrative manipulation, analysis and visualization of the data.

HADDOCK (High Ambiguity Driven protein-protein DOCKing) is a holistic and versatile information-driven docking software for the modelling of biomolecular interactionsIt distinguishes itself from ab-initio docking approaches in the fact that it can integrate various information sources derived from biochemical, biophysical or bioinformatics methods to enhance sampling, scoring, or both. HADDOCK also allows direct and flexible modelling of large assemblies consisting of up to six different molecules, which, together with its rich data support, provides a truly integrative modelling platform.

Molecular dynamics simulation is one of the most popular methods to study at an atomic level the motions of large and complex biomolecular systems in realistic environments at room temperature. This technique is used to investigate the thermodynamic ensemble of the system, understand its dynamic and can accurately predict free energies of binding for protein-small ligands complexes.

What are we doing in BioExcel?

Bridging these two methods will enhance the automated modelling of biomolecular interactions, leveraging the performance of HPC/HTC infrastructure supported by BioExcel. Our HADDOCK engine lies at the center to generate models of the complexes. MD engines, such as GROMACS, will be used to both sample conformations prior to docking and to evaluate the stability of the best cluster representatives generated by HADDOCK through systematic MD simulations (post-docking). The input data is a protein, a peptide or a nucleic acids structure, either coming from a PDB code or a protein sequence (EBI, IRB APIs). At this stage, we are testing and benchmarking the use of GROMACS to improve HADDOCK’s scoring on a selection of cases taken from the protein-protein docking benchmark5.0 and specific CAPRI targets for which the scoring was particularly challenging.

Small molecule docking using SMILES information for the ligand will be also considered at a later stage, with potentially the possibility of estimating the binding affinity of the complex for this particular use-case.

A bit more about our collaborators and the context of the Use Case

In this context, we collaborate with the group of Dr. Daan Geerke (molecular toxicology group, VU Amsterdam), who develops MDstudio, an open software framework for the integration of Molecular Dynamics simulation workflows. MDstudio relies on, a Web Application Messenger Protocol (WAMP) that allows building distributed systems out of application components which are loosely coupled and can communicate in real-time. Currently, both protein-small ligand docking and binding affinity prediction workflows have been implemented in MDstudio and there was a common interest in adding HADDOCK as a new module to their initial design. This work is also supported by the Dutch e-Science center.
Within the BioExcel community, we can also benefit from having the Gromacs core developers on this project. Their input for performance tuning and optimization will be determinant to systematically apply this protocol to the ~75-100 job submissions we receive daily on our HADDOCK web server.

Why is this Use Case particularly interesting for BioExcel?

Such a workflow needs to integrate a variety of HPC and HTC computing resources to make optimal use of existing computing infrastructures. The HADDOCK web server already makes use of distributed computing (via the EGI and the US Open Science Grid resources) and also runs on HTC resources. We plan to further integrate server and analysis tools, offering workflow and self-contained solutions (e.g. cloud), which could be particularly attractive to industrial users who rather not use public servers.

This workflow engine will also be key in reaching exascale for interactome modelling as explained in the previous section.

When will this be ready?

We aim to release the workflow at the end of Q1 2018. With the addition of new use-cases relevant for BioExcel, we would also like to organise workshops, webinars and tutorials under BioExcel branding that could associate the head developers of HADDOCK, MDstudio and GROMACS.