COMPSs programming model is one of the workflow managers that are being developed and provided to the communities by BioExcel CoE. It powers GUIDANCE, a pipeline for large scale genetic studies, by parallelizing it at task level and enabling it to run in distributed computing platforms.
The computational needs around large genetic studies keep growing, both in capacity and in complexity. The full analysis of the genotypes of thousands of individuals, including phasing, imputation and association testing requires of thousands of different types of tasks, each with a particular computational requirement. In order to cover these needs, the BioExcel partner team at the Barcelona Supercomputing Center developed GUIDANCE, a modular compilation of programs for a complete genetic association analysis. GUIDANCE is a novel integrated solution for complete large-scale Genome and Phenome-wide association analysis, which includes the phasing of genotypes into haplotypes, comprehensive genotype imputation possibilities by using multiple reference genetic variation panels, as well as association testing with one or with multiple phenotypes. GUIDANCE allows performing all these steps in a single execution, as well as in a modular way with optional user intervention.
GUIDANCE implementation is based on COMPSs, a task-based programming framework that aims to facilitate the development and execution of parallel applications and workflows in distributed infrastructures, such as HPC clusters, grids and clouds, making this application integrable into multiple parallel platforms. COMPSs is able to parallelize at task level, sequential applications written in Python, Java and C/C++. At execution time, its runtime builds a workflow of the runtime that enables the discovery of the potential parallelism of the application and its execution in a distributed environment. The runtime is responsible of scheduling, balancing and organising internally all the necessary subtasks to ensuring an efficient usage of the computing resources. It also takes care of the data transfers between tasks, when those are distributed in remote nodes.
GUIDANCE’s protocol has been applied to the identification and the study of several loci associated to different complex diseases and traits (PMID: 27680694 and 24241537). One of the most recent examples constitutes a study for the identification of new type 2 diabetes (T2D) associated loci. This study, which is based on the reanalysis of 70 thousand publicly available genetic samples from type 2 diabetic and control individuals, allowed the replication and fine-mapping of 50 known T2D loci, as well as the identification of seven new loci associated to the disease. These novel loci included variants of low and rare frequency in the population, which could have only been found using GUIDANCE methodology.
Functional enrichment studies have shown that the loci identified in this study are statistically correlated to insulin resistance, pancreas biology and other T2D diabetes related processes, indicating that the cohort generated from public data, and the results of the association has captured a large fraction of T2D related biology (figure below).
Horikoshi M, et al, Genome-wide associations for birth weight and correlations with adult disease. Nature. 2016 Oct 13;538(7624):248-252. doi: 10.1038/nature19806. PMID: 27680694
Bønnelykke K, et al., A genome-wide association study identifies CDHR3 as a susceptibility locus for early childhood asthma with severe exacerbations. Nature Genetics. 2014 Jan;46(1):51-5. doi: 10.1038/ng.2830. PMID: 24241537