Common Workflow Language


Common Workflow Language

Scientific workflow systems are popular for organising and executing analytical and computational pipelines, particularly when combining disparate command line tools, repetitive executions and structured data capturing. Compared to scripts like Bash or Python, a workflow can also be described graphically and used for communication with domain scientists and referenced from publications.

For a while the field of existing workflow system has been growing, each system having its own specialities, peculiarities, benefits and communities. For a review of bioinformatics workflow systems, see https://doi.org/10.1093/bib/bbw020 and https://doi.org/10.1186/s13062-015-0071-8, as well as https://doi.org/10.1007/s41019-017-0047-z with a particular focus on scalable pipelines.

The different workflow engines are generally not interchangeable, and have different computational needs (e.g support for cloud or grid infrastructure), and thus the choice of workflow system has traditionally determined how to create the workflow definitions — effectively a “vendor lock” of your own analytical pipeline.

Community-led standard

The Common Workflow Language (CWL) is a community-led effort to counter this limitation, by specifying a portable way to express workflow and tool descriptions, supported by multiple leading workflow engine implementations. Unlike previous standardization attempts, CWL has taken a pragmatic approach and focused on what most workflow systems are able to do: Execute command line tools and pass files around in a top-to-bottom pipeline.

At the heart of CWL workflows are the tool descriptions. A command line is described, with parameters, input and output files, in a YAML format so they can be shared across workflows and linked to from registries like ELIXIR’s bio.tools. These are then combined and wired together in a second YAML file to form a workflow template, which can be executed repeatedly and on different platforms by specifying input files and workflow parameters. The workflows, tools and parameters can be further annotated using the EDAM ontology and schema.org.

The CWL User Guide gives a gentle introduction to the language, while the more detailed CWL specifications formalize CWL concepts so they can be implemented by the different workflow systems. In the BioExcel webinar Introduction to the Common Workflow Language (CWL) project we hear from CWL’s community lead Michael Crusoe how the language is structured and how it is being developed:

Case study: ELIXIR

As part of BioExcel’s partnership with ELIXIR, we have worked with their Interoperability platform to promote the use of CWL, and BioExcel and CWL are now both core components of the ELIXIR workflow and tool interoperability plan.

We are also taking part in an ELIXIR implementation study of reuse, extension, scaling, and reproducibility of scientific workflows which looks at scalability and portability of CWL across diverse compute resources.

Our starting point was with EBI’s metagenomics group who have translated their internal EBI MetaGenomics pipeline (doi:10.1093/nar/gkv1195) to portable CWL workflows:

EMG pipeline v3.0: Sequence reads from an Illumina machine are trimmed for low quality (trimmomatic), converted from fastQ to fastA, then selected by RNA encoding sequences (rRNASelector, nested workflow rna-selector.cwl). Functional analysis prediction using InterProScan (nested workflow functional_analysis) is done after find reads with predicted coding sequences above 60 nucleotides in length (FragGeneScan), as well as a taxonomic analysis of rRNAs annotated using the Greengenes reference database (nested workflow 16S_taxonomic_analysis.cwl).

Moving to CWL gives EBI flexibility on workflow engines and computational backends, which allow balancing between privacy on their private cloud (for sensitive data) against using additional capacity from commercial cloud providers. In addition, using CWL accelerated closer collaboration and workflow sharing with the MG-RAST team at Argonne labs in Chicago:

The EMG Pipeline is built as an exemplar CWL workflow, with help from CWL co-founder Michael R Crusoe.

Workflow blocks

Building CWL workflows follow a gradual increase of workflow quality, starting from a “scruffy” script-like all-in-one-go CWL file, transitioning to a neater composition of reusable nested workflows which combine independently described tools. In BioExcel we are creating such workflow blocks for the end-user to compose their own workflows without starting from scratch.

Related to this we are also working with BioConda and BioContainers to provide easier installation and containerization of tools like Gromacs for CWL engines, desktop and cluster installations.

Viewing CWL workflows

CWL workflow visualization is made using the CWL Viewer, which is developed by The University of Manchester eScienceLab and BioExcel. The CWL Viewer has now become the de-facto standard tool in the growing CWL community for showcasing and exploring workflows on the web, as evidence by the growing collection of visualized workflows. The CWL Viewer was presented at ISMB (BOSC 2017) and even won best poster award.

 

Growing interest in CWL

CWL continues to receive growing interest from a wide range of researchers and developers, and not just in the bioinformatics domain. BioExcel are involved with the US Food and Drugs Administration (FDA) effort for creating HTS Computational Standards for Regulatory Sciences, where CWL is seen to have a key role for portability and reproducibility in regulatory submissions for personalized medicine as part of BioCompute Objects (doi:10.1101/191783).

CWL is a key component of the Global Alliance for Genomics and Health (GA4GH)’s Cloud work stream for distributed execution of workflows and tools.

Going large scale on HPC, IBM are developing support for CWL workflows on IBM Spectrum LSF with Toil and IBM’s LSF Process Manager, which open source implementation IBM introduced in BioExcel’s Webinar on CWLEXEC.

While born out of the Open Bioinformatics Foundation (OBF) and BOSC, the Common Workflow Language project has now joined the Software Freedom Conservancy as its new legal home, neighboring well known open source projects like Boost, Git and Homebrew. This neutral ground reflect not just the independence of the CWL leadership team, but also the growing interest in CWL outside bioinformatics, including astronomy and medical imaging.

BioExcel is very proud to be part of this development and continue our involvement with the CWL community.


Interested in workflows? Sign up to the BioExcel Interest Group for Workflows, and join our Gitter chat room bioexcel/workflows!


Creative Commons Licence
BioExcel blog post is © 2017-2018 The University of Manchester, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The Common Workflow Language logo is © 2016 The Common Workflow Language Project, distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. The workflow EMG Pipeline v3 is © 2016-2018 EMBL – European Bioinformatics Institute, distributed under the Apache License, 2.0.

Leave a Reply