December 2015 saw the start of BioExcel – a new EU-wide Centre of Excellence to support academia and industry in using high-performance and high-throughput computing (HPC and HTC) to perform biomolecular research. Funded by the EC Horizon 2020 programme, BioExcel is supported by partners across the EU, including EPCC. The project has recently passed its 6-month milestone, so what better time to provide a broad overview of BioExcel so far?
Biomolecular research, as with life science research in general, is having a larger impact on our lives, from health research to agriculture. Focus has recently shifted towards ensuring researchers can process information quickly and accurately. Technology improvements in gene sequencing, for example, has increased the amount of data that must be analysed. Drug research now incorporates increasingly complex simulations. In general, biomolecular research is requiring more computational resources and tools, while many researchers are not computational experts. This is where BioExcel comes in. The project is aimed at providing expertise and support for research on the building blocks of living organisms: proteins, DNA, molecules, membranes, etc.
Excellence in Software, Usability and Consultancy
BioExcel is based on improving three aspects of biomolecular research. Firstly, improving the performance and scalability of the most commonly used software, such as GROMACS (www.gromacs.org), HADDOCK (www.haddocking.org) and CPMD (www.cpmd.org), to take advantage of next-gen HPC systems and the expected increase in the amount of data produced. It’s also important to improve how easy it is for users to access and use these types of software. Not all researchers have experience in efficiently handling data and software. BioExcel aims to provide customizable workflow environments, which will allow relatively novice HPC/HTC users take advantage of the analysis software provided in ways that suit their specific research. In addition to this, hands-on training and public webinars are already underway, aiming to teach researchers best practices and how to best utilize the software and resources available.
BioExcel will also collaborate with other EU-wide projects, both in biological and computational research, such as PRACE, EUDAT and ELIXIR. Bringing together biomolecular research and computational experts from across the continent will provide a central source for researchers to gain access to expertise and infrastructures that will improve their research.
DNA Sequencing and BCBio
Over the next decade one of the largest research areas to develop for use with HPC, HTC and ‘Big Data’ systems will be genomics. To put it into perspective, in one day a single genome sequencing machine can produce the same amount of DNA sequence data that the Human Genome Project collected in 12 years of operation, and at a fraction of the cost. A small sequencing center alone is expected to produce around 1 Petabytes of data per year, which has significant processing and storage needs.
Edinburgh University has a wealth of expertise in genetics and DNA sequencing: The Roslin Institute, Edinburgh Genomics, the Institute of Genetics and Molecular Medicine and the Centre for Genomic and Experimental Medicine. Collaborating with members of these groups, we have begun working on increasing the efficiency and usability of BCBio, a series of high-throughput best-practice pipelines for variant calling, RNA-seq and small RNA analysis. Typically used on HTC systems, we are investigating how these pipelines could be altered to take advantage of HPC systems like Archer. Not only will this be beneficial to users and developers of pipelines like this, but also inform HPC architects on how to design future systems to best support this growing area of computing.
We hope to have preliminary results to present at an upcoming conference at the Barcelona Supercomputing Centre in October. If you have any interest on the work we or our partners are doing in this area, please feel free to come along. Information will be announced shortly.
Darren White, EPCC