Virtual screening (VS) is a computational technique used in the early phases of a drug discovery project to filter compound libraries composed of up to millions of structures. The goal is to narrow them down to just a few hundreds which display a high degree of geometric and electronic complementarity to the target under study when applied from a structure-based point of view. This complementarity ensures that this top % selected will be enriched in compounds with at least marginal (typically single or double-digit micromolar) activity. In the best of cases, compounds with up to single-digit nanomolar activities can be found. By testing in vitro only this top %, one can increase the chances of finding hits without having to test the millions of compounds in the original database, something that is prohibitive in terms of resources. Hits are then taken as the starting point in a medicinal chemistry program to optimize their potency and physchem and ADMEt properties.

In order to carry out the in silico screen, each molecule must be docked in turn against a (typically) rigid image of the macromolecular receptor (usually a protein). Once docked, each molecule is given an in silico estimation of a DG of binding, and therefore all the molecules in the database can finally be ranked by a theoretical value of potency. However, because the degrees of freedom of each docked compound are taken into account when docked, even if the protein is held rigid, the screening of millions of molecules is a highly demanding computational task.

Several computational techniques are available to seek for complementarity between the small molecules and a protein target. However, receptor conformational variability is largely neglected and in most cases the screening of the whole library is only carried out by using one fixed conformation of the protein.

The use of a structural ensemble of the protein receptor is the most suitable approach to account for receptor flexibility. This can be done in a number of ways. If more than one experimentally solved structure is available, each of those can be used separately for screening the whole library. If only one is found, then one can resort to the use of simulation techniques such as molecular dynamics (MD) to generate an ensemble of structures that allows to account for receptor flexibility during the screening procedure. However, generating an MD trajectory for the receptor adds up to the significant computational cost.

Therefore, to perform an extensive Virtual Screening of a huge compound library, including docking in conformational ensembles of the receptor (and eventually its sequence variants), is challenging and approaches the exascale level depending on the characteristics of the system.

What are we doing in BioExcel?

BioExcel is working on providing the necessary software modules and workflows to achieve the automation of the whole VS process described above: from the selection of targets and their available structures, identification of reference compounds, access to compound databases and related information, generation of target’s conformational ensembles through MD simulations, generation of target’s structures and conformations in sequence variants, to the docking process between these compounds and the protein ensembles, and the final scoring and analysis. The whole process can be broken down to the following steps (see workflow figure below):

 

  1.     Access to reference compounds and associated pharmacological data: Open PHACTS
  2.     Access to compound databases and catalogues. Zinc is just one example.
  3.     Decoys recovery: Directory of Useful Decoys (DUD)
  4.     Access to experimentally solved receptor structures
  5.     Access to sequence variants for those receptors
  6.     Model construction for new variants not solved experimentally
  7.     Receptor structure ensemble generation from MD
  8.     Biomolecular recognition: Docking methods
  9.     Scoring and rescoring of docked compounds and their analysis

The library used in the assembly of the pipeline, designed and developed by BioExcel partners, will allow its usage in different computational architectures, including HPC supercomputers. Running this VS workflow efficiently in a massively parallel environment will permit the scanning of millions of compounds in a single run. BioExcel code GROMACS in combination with BioExcel software libraries provide all necessary functionalities for the generation of conformational ensembles of proteins.

A bit more about our collaborators and the context of the Use Case

This use case is taking advantage of the several existing initiatives.

Why is this Use Case particularly interesting for BioExcel?

Virtual Screening is a technique that is being widely used in the drug discovery field as the initial step in the selection of starting points suitable for further development. The computational cost involved usually precludes the consideration of receptor variability either due to sequence or conformational changes. A complete experiment including these aspects is beyond the computational capabilities of normal users. However, BioExcel software and practices can provide a bridge for these operations to large scale HPC, making possible to perform complete VS experiments in a competitive time scope.

Working from a particular example:

For the design and development of the VS workflow, a particular example of great interest in the pharmaceutical industry was chosen as a validation use case, the Epidermal Growth Factor Receptor (EGFR). EGFRs are transmembrane receptors located on the cell membrane. They have an extracellular binding domain, to which Epidermal Growth Factor (EGF) binds, a transmembrane domain and an intracellular tyrosine kinase domain. EGFRs play an important role in controlling normal cell growth, apoptosis and differentiation. Mutations of EGFRs can lead to abnormal activation and signal transduction causing unregulated cell division and ultimately driving some types of cancers. Thus, dysregulation of EGFR activity has been implicated in the oncogenic transformation of various types of cells and represents an important drug target.

Currently, there are two therapeutic approaches hitting EGFR. One of them is based on monoclonal antibodies which bind to the extracellular domain of the receptor, antagonizing either the interaction with its cognate ligand (EGF) or its homo or hetero dimerization. The second therapeutic approach is knocking down its tyrosine-kinase activity. This is also a very interesting option as there are several therapies with marketing authorisation approvals that target its kinase domain. Approved small molecule drugs in this category are ATP competitive inhibitors, either reversible or covalent. An example are: Gefitinib, Vandetanib, Lapatinib, Erlotinib and Afatinib. Although structurally related, some of them require conformational changes in the receptor and thus bind to EGFR kinase domain with some degree of induced fit.

Importantly, the administration of this treatments imposes a selection pressure on the cancer cells which eventually develop mutations in the kinase domain that lead to resistance. One of the most prevalent mutations found in treated patients is the T790M mutation. This change is located in the so-called “gatekeeper” residue, in the interior of the ATP binding site. The replacement of a small threonine amino acid for a much bulkier methionine precludes or partially hinders the binding of the ATP competitive treatments listed above. This problem has spawned the development of a next-generation of ATP competitive inhibitors that target the T790M mutant, such as osimertinib, rociletinib, HM61713, ASP8273, EGF816 and PF-06747775.

Thus, a whole number of first and last-generation small molecule inhibitors is available for this target, some of them hitting the wild type sequence, others specifically designed for hitting mutant variants and having no activity on the WT. This whole body of knowledge can be exploited for setting up and fine-tuning the Bioexcel VS workflow, for testing its performance and reliability in a real target that is nowadays exploited in the clinics.