Sequencing workflows are becoming more complex, and could potentially benefit from greater use of HPC, and from greater automation.
This use case brings together existing processes and techniques used in genomics into a sequencing workflow which, if executed quickly at scale, could support novel research and, potentially, clinical applications. The workflow that we are studying is illustrated in the figure below.
A key component in this workflow is bcbio, itself a workflow system with many component parts. The bcbio software is widely used and offers support for parallel execution. It is most commonly used on systems from multicore workstations up to medium-sized clusters.
What are we doing in BioExcel?
In BioExcel we are seeking to automate and scale-up this workflow, which will have benefits both to our partners at IGMM and the wider community: all of the workflows and software created during this work will be made publicly available through BioExcel’s workflow component platform.
Part of this work will be to evaluate the suitability of larger-scale HPC resources such as EPCC’s Cirrus for parts of this workflow. If it turns out that this workflow can indeed benefit from such a platform, then there is scope for BioExcel to offer some or all of this workflow as part of a future service.
A bit more about our collaborators and the context of the Use Case
IGMM participates in ICGC, a large international consortium at the cutting edge of WGS analysis in cancer genomics and to date, there is no established consensus for robust (never mind rapid) variant calling at multiple levels (i.e. single nucleotide, small indel events, larger structural variants). What does exist is a a variety of groups “making the best of it”, constructing analysis pipelines that “kind of work” against a backdrop of a rapidly evolving thicket of different algorithms. Most of the data produced is aimed at individual research publications, and is certainly not useful for clinical diagnostics. IGMM would ultimately want to construct the first rapid turnaround analysis pipeline for ovarian tumours, that is sufficiently rapid and accurate to generate reports for routine clinical use in the Scottish NHS. The workflow being studied in this use case is probably a necessary prerequisite for such a pipeline, but improving this workflow also has wider application since similar workflows involving bcbio are widely used in the field.
There are certainly other players in this field. To give one example, Seven Bridges provides a cloud-based platform for sequence analysis. This service, whilst powerful, is more general; it is not tailored to ovarian cancer data, or for rapid turnaround for generation of clinical reports for clinicians. They also do not include most widely used SV callers (e.g. lumpy, manta, etc) in their current offering. There also remains the key issue of confidentiality. Although there are potentially organizational and administrative solutions which would allow processing data in the cloud using a commercial service, these are likely to come with constraints that would make it harder to tailor workflows to specific applications and to make modifications required for novel research.
Why is this Use Case particularly interesting for BioExcel?
This use case is different from most of the other use cases in that the HPC is being explored for analysis rather than simulation. There are many problems in biology that are currently being studied using small to medium-sized HPC clusters locally and in the cloud, and there is considerable expertise in the computational biology field which allows biologists to make good use of these resources. In many cases such machines will be “the best tool for the job” but the ability to exploit more powerful HPC machines in order to address bigger and more complex problems has scope to broaden the use of large-scale HPC for computational biology. The BioExcel CoE has partners who have experience and expertise in bringing new users into large-scale HPC and this potentially large “market” is an area in which BioExcel could have significant impact.
conda install bioexcel_seqqc
Alternatively see the source code https://github.com/bioexcel/BioExcel_SeqQC