Protein molecules are inherently dynamic, and their functionality depends on these movements. While predicting simple conformational changes is possible with short molecular dynamics (MD) simulations, it is difficult to predict large-scale conformational transitions that are essential for biological function. Current methods like atomistic MD simulations, even with enhanced sampling techniques, are resource-intensive and rely on prior knowledge of the transition pathway. Specifically, they depend on well-defined collective variables (CVs), which are mathematical functions that simplify the system’s behavior. The project also faces the challenge of predicting unknown protein conformations and uncovering novel pathways.
Why is this project particularly interesting for BioExcel?
The release of AlphaFold2 (AF2) and its database has revolutionized protein structure research by making over 200 million predicted structures available. This has opened up the possibility of understanding protein conformations on an unprecedented scale. By combining AF2 with new tools that can predict ensembles of conformational structures (like AF-Cluster and AlphaFlow) and incorporating coevolutionary data, this project has the potential to uncover previously unexplored alternative conformations and unearth novel transition pathways. The goal is to create a comprehensive dataset of conformational transition trajectories and establish an ontology of protein conformational movements, offering valuable insights into the full range of protein transitions.
What are we doing in BioExcel?
The project is following a four-step pipeline to create a comprehensive dataset and framework for studying protein movements.
- Generating a new dataset: We are using AF2-based tools (specifically AF-Cluster) to generate a new dataset of protein conformations. We are tuning the AF2 Multiple Sequence Alignment (MSA) input with coevolutionary information to produce new intermediate structures not found in existing databases. These new structures are validated using a combination of the predicted local distance difference test (pLDDT) and a Normal Mode Analysis (NMA)-based metric.
- Calculating transition trajectories: The new conformations are used to calculate transition pathways using the fast, coarse-grained method called GOdMD. This method can compute a transition in less than an hour, does not distort the protein’s chemical structure, and can reproduce complex, non-linear transitions.
- Extracting transition coordinates: The trajectories from the GOdMD simulations are used to extract collective variables (CVs) using an AI-based method called Deep Cartograph. This new Python library uses an AutoEncoder to transform the trajectory into a compact representation, filter out less informative features, and apply dimensionality reduction. This process generates a PLUMED input file for the next step.
- Running enhanced-sampling simulations: The final step involves using the new GROMACS version 2025 in combination with the PLUMED plugin to run enhanced-sampling atomistic MD simulations. These simulations use the CVs and input files generated by Deep Cartograph to analyze the resulting free energy landscape and further validate the predicted transition pathways.