Date: Thursday 29 April 2020
Time: 13:00 – 15:30 BST
This free remote training course will be conducted remotely using GoToTraining, supported by collaborative web technologies like GitHub and Google Docs. The course is structured as webinars interspersed with tutorial exercises and limited to 20 participants
– Registration now closed as we have reached maximum capacity for this event –
Materials
The slides from the course are now available here:
Lectures – https://slides.com/soilandreyes/2020-04-29-cwl-virtual-training#/
Tutorials – https://slides.com/soilandreyes/2020-04-29-cwl-tutorial#/
About this course
This course will provide a brief introduction to the Common Workflow Language (CWL) and help you get started with the CWL reference implementation cwltool, as well as giving you a taster of the implementation Toil using the BioExcel Cloud Portal. At the end of this session, you will be able to:
- Write workflows in CWL and execute them
- Execute workflows with Docker containers
- Make your own tool descriptions in CWL
- Share and reuse workflow and tool descriptions with the CWL community
- Follow best practice guidelines for writing CWL
Please note the following is not covered in the course:
- This course will not teach you how to design a scientifically valid bioinformatics pipeline
- Bioinformatics tools/workflows may be used as examples, alongside generic UNIX tools
- This course will not explain how to set up other CWL implementations (e.g. Galaxy or CWLEXEC for LSF)
- You will be shown how to install the reference implementation cwltool, and cloud support in toil will be demonstrated
- The strengths and pitfalls of the different engine choices will be covered in brief
- This course will not detail how to use particular compute/cloud infrastructure
- Cloud computing may be used as example
- This course will not provide support for rewriting your existing pipeline to CWL
- Instructors will assist attendees with tutorial lessons
Prerequisites
No prior knowledge of workflows, CWL or Python is required.
A basic understanding of the command line (UNIX shell as on Linux/OS X) is advised; guidance will also be provided for Windows users.
Experience with writing JSON or textual scripts in any programming/scripting language is advised, e.g. understanding of quotes, indentation, blocks; however a full introduction to the CWL syntax in YAML will be provided.
No prior knowledge of bioinformatics is required.
Using workflows to improve reproducibility
To help improve replicability and reproducibility in computational analysis, workflows provide a way to describe an analysis as a pipeline of tool execution, which a workflow management system can execute either locally or across a distributed architecture such as cloud instances or HPC clusters, taking care of details such as parallelism, file handling and job submissions. Popular workflow systems in bioinformatics include Galaxy, Nextflow and Snakemake; overall more than 270 workflow systems have emerged.
Common Workflow Language
Many workflow management systems exist and it can be difficult to transition a pipeline encoded for a particular workflow system to another system. The Common Workflow Language was created as a community-driven standard to formally define a workflow language that is executable by multiple workflow engines, focusing on what is deemed as their biggest commonality: parallel pipeline execution of command line tools that exchange files. At the core of CWL is therefore the reusable description of command line tools, which each explicitly declare their configuration options, input files and expected output files, as well as their binding to particular filenames and command line arguments. Reproducible and portable execution of a tool is enabled using containers (Docker, Singularity), as well as packages from BioConda and Linux distributions where available.
The CWL tool descriptions, while executable on their own (as a kind of one-step workflow), are usually combined as steps of a CWL workflow, which declares the data dependencies between the inputs and outputs of the steps, as well as scalability parameters such as parallel concurrency and job scattering over multiple values. While CWL is domain-independent it has received particular emphasis for building workflows in data intensive sciences such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry.
Technical requirements
- You will need a laptop or workstation where you have the privileges to install software.
- Unprivileged Windows with domain login? Request your administrator to install Miniconda for Windows (with Python 3.7), which allows user-space installation of tools
- Python 3.5 or later
- Linux and OS X: Usually already installed, try starting python3
- Windows: Download Python 3.8: https://www.python.org/downloads/
- For consistent results also install Miniconda: https://docs.anaconda.com/anaconda/install/
- Installation of cwltool will be covered in training
- Docker (recommended): https://docs.docker.com/engine/install/
- .. or Singularity https://singularity.lbl.gov/
- Text editor: Any editor you are comfortable with
- Recommended: Visual Studio Code; installable for Windows, Linux and OS X.
- To access BioExcel Cloud Portal VMs, install VSCode Remote Development extension
- Expert alternative: ssh to VM in BioExcel Cloud Portal and use terminal-based editor like vim
- SSH client
- A compatible OpenSSH client
- Windows users: SSH comes with the install of Git for Windows
- We recommend using Google Chrome to access the session. You can check the system requirements for attendees on the GoToTraining website.
Trainers
Stian Soiland-Reyes, The University of Manchester
Robin Long, The University of Manchester
Contact
This session has been created in the context of the BioExcel remote training programme. This course is free to attend but we ask that you provide us with feedback after the training to help us optimise our training programme. If you have any questions, email Marta Lloret at marta.lloret@ebi.ac.uk