- Community-led standard
- Case study: ELIXIR
- Workflow blocks
- Viewing CWL workflows
- Growing interest in CWL
Scientific workflow systems are popular for organising and executing analytical and computational pipelines, particularly when combining disparate command line tools, repetitive executions and structured data capturing. Compared to scripts like Bash or Python, a workflow can also be described graphically and used for communication with domain scientists and referenced from publications.
For a while the field of existing workflow system has been growing, each system having its own specialities, peculiarities, benefits and communities. For a review of bioinformatics workflow systems, see https://doi.org/10.1093/bib/bbw020 and https://doi.org/10.1186/s13062-015-0071-8, as well as https://doi.org/10.1007/s41019-017-0047-z with a particular focus on scalable pipelines.
The different workflow engines are generally not interchangeable, and have different computational needs (e.g support for cloud or grid infrastructure), and thus the choice of workflow system has traditionally determined how to create the workflow definitions — effectively a “vendor lock” of your own analytical pipeline. This limits collaboration and effective science as researchers build similar workflows in different “camps”.
Community-led standard
The Common Workflow Language (CWL) is a community-led effort to counter this limitation, by specifying a portable way to express workflow and tool descriptions, supported by multiple leading workflow engine implementations. Unlike previous standardization attempts, CWL has taken a pragmatic approach and focused on what most workflow systems are able to do: Execute command line tools and pass files around in a top-to-bottom pipeline.
At the heart of CWL workflows are the tool descriptions. A command line is described, with parameters, input and output files, in a YAML format so they can be shared across workflows and linked to from registries like ELIXIR’s bio.tools. Manual installation of tools is avoided by using container technologies like Docker and Singularity, or packaging systems like Conda and HPC classic Modules. BioConda has more than 3000 installable packages, where BioExcel are contributing both the workflow building blocks (e.g. biobb_md) and GROMACS. For CWL, these packages mean the workflow can be portable without manual software dependency management, which is particular important for distributed cloud execution.
In CWL, such command line tools are then combined and wired together in a second YAML file to form a workflow template, which can be executed repeatedly and on different platforms by specifying input files and workflow parameters. The workflows, tools and parameters can be further annotated using the EDAM ontology and schema.org.
Below is a 64 second explanation of CWL by BioExcel developer Mark Robinson:
The CWL User Guide gives a gentle introduction to the language, while the more detailed CWL specifications formalize CWL concepts so they can be implemented by the different workflow systems. In the BioExcel webinar Introduction to the Common Workflow Language (CWL) project we hear from CWL’s community lead Michael Crusoe how the language is structured and how it is being developed:
Case study: ELIXIR
As part of BioExcel’s partnership with ELIXIR, we have worked with their Interoperability platform to promote the use of CWL, and BioExcel and CWL are now both core components of the ELIXIR workflow and tool interoperability plan.
We are also taking part in an ELIXIR implementation study of reuse, extension, scaling, and reproducibility of scientific workflows which looks at scalability and portability of CWL across diverse compute resources.
Our starting point was with EBI’s metagenomics group who have translated their internal EBI MetaGenomics pipeline (doi:10.1093/nar/gkv1195) to portable CWL workflows:
Moving to CWL gives EBI flexibility on workflow engines and computational backends, which allow balancing between privacy on their private cloud (for sensitive data) against using additional capacity from commercial cloud providers. In addition, using CWL accelerated closer collaboration and workflow sharing with the MG-RAST team at Argonne labs in Chicago:
MG-Rast (@FolkerMeyer & Andreas) @mg_rast are on-site to discuss metagenomics exchange (e.g. exchanging analysis workflows etc.)
— EBI Metagenomics (@EBImetagenomics) 26 April 2017
The EMG Pipeline is built as an exemplar CWL workflow, with help from CWL co-founder Michael R Crusoe.
Workflow blocks
Building CWL workflows follow a gradual increase of workflow quality, starting from a “scruffy” script-like all-in-one-go CWL file with locally installed tools, transitioning to a neater composition of reusable nested workflows which combine independently described portable tools.
As part of the BioExcel H2020 project we are creating a set of portable workflow building blocks to use the BioExcel-supported tools, described as CWL tools. We create corresponding Python wrappers that provide a uniform interface across the tools and handle any input/output parameter adaptions. These blocks, biobb, are available for use from PyCOMPSs, CWL, KNIME, Galaxy ToolShed and Jupyter Notebooks, providing portability of BioExcel-supported tools even beyond workflow systems that implement CWL.
Viewing CWL workflows
CWL workflow visualization is made using the CWL Viewer, which is developed by The University of Manchester eScienceLab and BioExcel. The CWL Viewer has become the de-facto standard tool in the CWL community for showcasing and exploring workflows on the web, as evidence by the growing collection of visualized workflows (>5000). The CWL Viewer was presented at ISMB (BOSC 2017) and even won best poster award.
The CWL Viewer was not intended as a workflow repository, but to present individual workflows. BioExcel has also helped initiate development of a multi-platform workflow repository, which is carried on together with ELIXIR in H2020 projects IBISBA and EOSC-Life to share workflows across the European Open Science Cloud aiming for federated execution on shared computational resources.
Reproducible workflows — capturing provenance from CWL
We are facing a reproducibility crisis, where more than half of researchers admit to being unable to reproduce their own experiments. In particular for computation this is ironic, as software systems should be fully capable of recording what and how they have computed their results. While the mere use of scientific workflow systems can improve reproducibility, it is not enough, as the researcher also need to capture the configuration and input data used, as well as describing the context and limitation of the computation. A workflow can describe the computation, but re-running it will require a similar workflow system setup which can be hard to replicate.
With its emphasis on interoperability, automated tool installation and flexible annotations, CWL is a prime ingredient to improve reproducibility of computational workflows. BioExcel believe it is a particular challenge to ensure reproducible workflows in high-performance pipelines and has thus explored this topic, and based on earlier work from the FP7 project Wf4Ever we helped develop CWLProv (doi:10.5281/zenodo.1966881), a profile of research objects that capture detailed W3C PROV provenance of a CWL workflow run. We are further developing this together with BioExcel partner ELIXIR as part of the implementation study Enabling the reuse, extension, scaling, and reproducibility of scientific workflows to add CWLProv support to the CWL implementation Toil; as well as in NIH Data Commons with Seven Bridges and Mendeley Data, where we focus on long-term availability of large data.
Growing interest in CWL
CWL continues to receive growing interest from a wide range of researchers and developers, and not just in the bioinformatics domain. BioExcel are involved with the US Food and Drugs Administration (FDA) effort for creating HTS Computational Standards for Regulatory Sciences, where CWL is seen to have a key role for portability and reproducibility in regulatory submissions for personalized medicine as part of BioCompute Objects (doi:10.1371/journal.pbio.3000099).
CWL is a key component of the Global Alliance for Genomics and Health (GA4GH)’s Cloud work stream for distributed execution of workflows and tools.
Going large scale on HPC, IBM are developing support for CWL workflows on IBM Spectrum LSF with Toil and IBM’s LSF Process Manager, which open source implementation was introduced by IBM in BioExcel’s Webinar on CWLEXEC.
While born out of the Open Bioinformatics Foundation (OBF) and BOSC, the Common Workflow Language project has now joined the Software Freedom Conservancy as its new legal home, neighboring well known open source projects like Boost, Git and Homebrew. This neutral ground reflect not just the independence of the CWL leadership team, but also the growing interest in CWL outside bioinformatics, including astronomy and medical imaging.
BioExcel is very proud to be part of this development and continue our involvement with the CWL community.
Interested in workflows? Sign up to the BioExcel Interest Group for Workflows, and join our Gitter chat room bioexcel/workflows!
BioExcel blog post is © 2017-2018 The University of Manchester, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The Common Workflow Language logo is © 2016 The Common Workflow Language Project, distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. The workflow EMG Pipeline v3 is © 2016-2018 EMBL – European Bioinformatics Institute, distributed under the Apache License, 2.0.