One month after the BioExcel/PDBe Hackathon, Diego, Panos and Brian share their insights about the event.
Nowadays, structural biologist is synonym of user of the Protein Data Bank (PDB). However, using the PDB has conceptually changed in the last years. Rather than just downloading single PDB-formatted files at a time, the current initiative is to offer added value services to simplify and centralize types of structural information into a single space. Such services include access to the new standard mmCIF files, the raw data (e.g. electron densities or NOEs) and validation reports; and a variable set of tools that depend on the different groups maintaining the PDB database, PDBe, RCSB and PDBj.
In our geographical case, the European Bioinformatics Institute (EMBL-EBI) provides a REST API, a programmatic point of access to the data in the PDB and the EMDB. This is an invaluable resource, since it allows users to find structures of interest and apply data mining pipelines.
At the moment, the PDBe REST API exposes dozens of useful information endpoints to a huge amount of data based on PDB entries, compounds, SIFTSs, structural validation and others, intended for users to poll the database in order to answer their biological questions. However, many of these questions may require complicated and chained pipelines in the actual state of the database. Thus, the PDBe team asked themselves: could we improve the way data is connected and exposed to the users?
The Hackathon and the use cases
A Hackathon is an event where software developers meet with a goal and try to achieve it as fast as they can in a cooperative way. The goal of a Hackathon is always realistic, ambitious and grounded in a strong base of knowledge, hardware and software. By the time the BioExcel/PDBe Hackathon was arranged, the PDBe team had already been working for more than a year in a new graph database using the Neo4j technology. The idea behind such an event was to expose it to the users to get feedback on how to improve it. Our goal was to develop a variety of use cases that made the most out of the new database. But why did they choose to develop a graph database?
The REST API accesses the current database and retrieves the desired data in a process invisible for the user, who does not have to deal with SQL or other languages. Despite the extraordinary performance of their current database, their preliminary tests showed that the graph database could become even a better option. The key concept behind any graph database is the organization of the data in nodes and edges. The nodes represent any kind of unit (e.g. a PDB ID, a chain in the structure, a small ligand), and the edges connect the nodes according with their interactions or relationships (e.g. a given PDB ID contains a certain ligand). This arrangement of the data is particularly useful when you use the Cypher language, which was developed to access the database in terms of the relationships between the nodes. This technology would speed up many of the current REST API calls and allows new calls that are now unavailable.
Coming back to the main goal of the Hackathon, we wanted to develop six use cases of interest to us, some of which are briefly exposed onwards. It is worth mentioning that all the use cases were virtually solved on the paper with the help of the new database, but the real pipelines were only developed partially, since their level of complexity required us to sit down independently to program them, which was not the real objective of the Hackathon.
One of the use cases was the creation of a validated set of ribonucleotides (no outliers, no clashes and good fitting with the experimental data). This use case required a first step of identification of all RNA structures in the database (either “naked” RNA or RNA in complex with proteins), which represented just one line of code in the Cypher language to access the graph database. The pipeline would be followed by a series of queries to the current REST API to get the desired validation data, and the necessary steps to filter the previous dataset of PDB structures.
Another use case also revolved around the creation of a dataset but this time with an emphasis on protein-protein interactions. For this, all biological units of proteins that matched certain search criteria would have to be identified. Those criteria could be similar to a query sequence, membership to a Pfam/SCOPe classification level or overlap with certain GO terms. After identifying the relevant proteins, they would be clustered at a predefined sequence identity level and representative structures would be selected. This could be done by selecting the structures which are closer to the cluster centroids. Alternatively, the best structures (high-resolution) could be selected. Finally, some useful metrics such as protein-protein interface conservation or functional annotations of key interacting residues would be calculated and summarised over the entire set.
From our point of view, the Hackathon was a complete success. We were introduced to a technology with brilliant biological perspectives, which allows us users to focus on the actual data and not the technology itself. We are extremely grateful to the PDBe development team for their time, enthusiasm and willingness to help us access, gather and shape the valuable data which is stored in the PDB and has been contributed by so many people.
If we ever dreamed about accessing biological data in the PDB in a more semantically way, the PDBe team is making our wish real. Now, we wish to see the new infrastructure after such amazing work. Probably, the next stepping stone in our dream will be an interface powered by Artificial Intelligence (AI), which will understand our needs in natural language and provide us with new API endpoints automatically. If that ever happens, computational biologists’ work will become easier than ever, allowing us to forget completely about the non-biological technical details. However, will we be needed when AI applications become so powerful? We will have to wait to know it. Welcome Bio-HAL.
About the authors
Diego Gallego, Panos Koukos, Brian Jiménez-García