Date: Thursday 29 April 2020
Time: 13:00 – 15:30 BST

This free remote training course will be conducted remotely using GoToTraining, supported by collaborative web technologies like GitHub and Google Docs. The course is structured as webinars interspersed with tutorial exercises and limited to 20 participants

– Registration now closed as we have reached maximum capacity for this event –

Materials

The slides from the course are now available here:

Lectures – https://slides.com/soilandreyes/2020-04-29-cwl-virtual-training#/

Tutorials – https://slides.com/soilandreyes/2020-04-29-cwl-tutorial#/

About this course

This course will provide a brief introduction to the Common Workflow Language (CWL) and help you get started with the CWL reference implementation cwltool, as well as giving you a taster of the implementation Toil using the BioExcel Cloud Portal. At the end of this session, you will be able to:

  • Write workflows in CWL and execute them
  • Execute workflows with Docker containers
  • Make your own tool descriptions in CWL
  • Share and reuse workflow and tool descriptions with the CWL community
  • Follow best practice guidelines for writing CWL

Please note the following is not covered in the course:

  • This course will not teach you how to design a scientifically valid bioinformatics pipeline
    • Bioinformatics tools/workflows may be used as examples, alongside generic UNIX tools
  • This course will not explain how to set up other CWL implementations (e.g. Galaxy or CWLEXEC for LSF)
    •  You will be shown how to install the reference implementation cwltool, and cloud support in toil will be demonstrated
    • The strengths and pitfalls of the different engine choices will be covered in brief
  • This course will not detail how to use particular compute/cloud infrastructure
    • Cloud computing may be used as example
  • This course will not provide support for rewriting your existing pipeline to CWL
    • Instructors will assist attendees with tutorial lessons

Prerequisites

No prior knowledge of workflows, CWL or Python is required.

A basic understanding of the command line (UNIX shell as on Linux/OS X) is advised; guidance will also be provided for Windows users.  

Experience with writing JSON or textual scripts in any programming/scripting language is advised, e.g. understanding of quotes, indentation, blocks; however a full introduction to the CWL syntax in YAML will be provided.

No prior knowledge of bioinformatics is required.

Using workflows to improve reproducibility

To help improve replicability and reproducibility in computational analysis, workflows provide a way to describe an analysis as a pipeline of tool execution, which a workflow management system can execute either locally or across a distributed architecture such as cloud instances or HPC clusters, taking care of details such as parallelism, file handling and job submissions. Popular workflow systems in bioinformatics include Galaxy, Nextflow and Snakemake; overall more than 270 workflow systems have emerged.

Common Workflow Language

Many workflow management systems exist and it can be difficult to transition a pipeline encoded for a particular workflow system to another system. The Common Workflow Language was created as a community-driven standard to formally define a workflow language that is executable by multiple workflow engines, focusing on what is deemed as their biggest commonality: parallel pipeline execution of command line tools that exchange files. At the core of CWL is therefore the reusable description of command line tools, which each explicitly declare their configuration options, input files and expected output files, as well as their binding to particular filenames and command line arguments. Reproducible and portable execution of a tool is enabled using containers (Docker, Singularity), as well as packages from BioConda and Linux distributions where available.

The CWL tool descriptions, while executable on their own (as a kind of one-step workflow), are usually combined as steps of a CWL workflow, which declares the data dependencies between the inputs and outputs of the steps, as well as scalability parameters such as parallel concurrency and job scattering over multiple values. While CWL is domain-independent it has received particular emphasis for building workflows in data intensive sciences such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry. 

Technical requirements

Trainers

Stian Soiland-Reyes, The University of Manchester
Robin Long, The University of Manchester

Contact

This session has been created in the context of the BioExcel remote training programme. This course is free to attend but we ask that you provide us with feedback after the training to help us optimise our training programme. If you have any questions, email Marta Lloret at marta.lloret@ebi.ac.uk