Scientific workflow systems are popular for organising and executing analytical and computational pipelines, particularly when combining disparate command line tools, repetitive executions and structured data capturing. Compared to scripts like Bash or Python, a workflow can also be described graphically and used for communication with domain scientists and referenced from publications.
For a while the field of existing workflow system has been growing, each system having its own specialities, peculiarities, benefits and communities. For a review of bioinformatics workflow systems, see https://doi.org/10.1093/bib/bbw020 and https://doi.org/10.1186/s13062-015-0071-8, as well as https://doi.org/10.1007/s41019-017-0047-z with a particular focus on scalable pipelines.
The different workflow engines are generally not interchangeable, and have different computational needs (e.g support for cloud or grid infrastructure), and thus the choice of workflow system has traditionally determined how to create the workflow definitions — effectively a “vendor lock” of your own analytical pipeline.
The Common Workflow Language (CWL) is a community-led effort to counter this limitation, by specifying a portable way to express workflow and tool descriptions, supported by multiple leading workflow engine implementations. Unlike previous standardization attempts, CWL has taken a pragmatic approach and focused on what most workflow systems are able to do: Execute command line tools and pass files around in a top-to-bottom pipeline.
At the heart of CWL workflows are the tool descriptions. A command line is described, with parameters, input and output files, in a YAML format so they can be shared across workflows and linked to from registries like ELIXIR’s bio.tools. These are then combined and wired together in a second YAML file to form a workflow template, which can be executed repeatedly and on different platforms by specifying input files and workflow parameters. The workflows, tools and parameters can be further annotated using the EDAM ontology and schema.org.
The CWL User Guide gives a gentle introduction to the language, while the more detailed CWL specifications formalize CWL concepts so they can be implemented by the different workflow systems. In the BioExcel webinar Introduction to the Common Workflow Language (CWL) project we hear from CWL’s community lead Michael Crusoe how the language is structured and how it is being developed:
Case study: ELIXIR
As part of BioExcel’s partnership with ELIXIR, we have worked with their Interoperability platform to promote the use of CWL, and BioExcel and CWL are now both core components of the ELIXIR workflow and tool interoperability plan.
We are also taking part in an ELIXIR implementation study of reuse, extension, scaling, and reproducibility of scientific workflows which looks at scalability and portability of CWL across diverse compute resources.
Moving to CWL gives EBI flexibility on workflow engines and computational backends, which allow balancing between privacy on their private cloud (for sensitive data) against using additional capacity from commercial cloud providers. In addition, using CWL accelerated closer collaboration and workflow sharing with the MG-RAST team at Argonne labs in Chicago:
— EBI Metagenomics (@EBImetagenomics) 26 April 2017
The EMG Pipeline is built as an exemplar CWL workflow, with help from CWL co-founder Michael R Crusoe.
Building CWL workflows follow a gradual increase of workflow quality, starting from a “scruffy” script-like all-in-one-go CWL file, transitioning to a neater composition of reusable nested workflows which combine independently described tools. In BioExcel we are creating such workflow blocks for the end-user to compose their own workflows without starting from scratch.
Viewing CWL workflows
CWL workflow visualization is made using the CWL Viewer, which is developed by The University of Manchester eScienceLab and BioExcel. The CWL Viewer has now become the de-facto standard tool in the growing CWL community for showcasing and exploring workflows on the web, as evidence by the growing collection of visualized workflows. The CWL Viewer was presented at ISMB (BOSC 2017) and even won best poster award.
Growing interest in CWL
CWL continues to receive growing interest from a wide range of researchers and developers, and not just in the bioinformatics domain. BioExcel are involved with the US Food and Drugs Administration (FDA) effort for creating HTS Computational Standards for Regulatory Sciences, where CWL is seen to have a key role for portability and reproducibility in regulatory submissions for personalized medicine as part of BioCompute Objects (doi:10.1101/191783).
Going large scale on HPC, IBM are developing support for CWL workflows on IBM Spectrum LSF with Toil and IBM’s LSF Process Manager, which open source implementation IBM introduce in BioExcel’s Webinar on CWLEXEC.
While born out of the Open Bioinformatics Foundation (OBF) and BOSC, the Common Workflow Language project has now adapted Software Freedom Conservancy as its new legal home, joining well known open source projects like Boost, Git and Homebrew. This neutral ground reflect not just the independence of the CWL leadership team, but also the growing interest in CWL outside bioinformatics, including astronomy and medical imaging.
BioExcel is very proud to be part of this development and continue our involvement with the CWL community.
BioExcel blog post is © 2017-2018 The University of Manchester, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The Common Workflow Language logo is © 2016 The Common Workflow Language Project, distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. The workflow EMG Pipeline v3 is © 2016-2018 EMBL – European Bioinformatics Institute, distributed under the Apache License, 2.0.