Workflows


Scientific workflow systems are popular for organising and executing analytical and computational pipelines, particularly when combining disparate command line tools, repetitive executions and structured data capturing. Compared to scripts like Bash or Python, a workflow can also be described graphically, used for communication with domain scientists and more easily be referenced from publications.

Workflow systems

One of the challenges for BioExcel is to provide easy access to computing and data resources through a range of workflow environments, focusing particularly on usage of the BioExcel core codes. The first step in this direction was a thorough analysis of the state-of-the art which identified these workflow systems to be supported by BioExcel:

Common Workflow Language

The Common Workflow Language (CWL) is a community-led specification to express portable workflow and tool descriptions, which can be executed by multiple leading workflow engine implementations. Unlike previous standardisation attempts, CWL has taken a pragmatic approach and focused on what most workflow systems are able to do: Execute command line tools and pass files around in a top-to-bottom pipeline. At the heart of CWL workflows are the tool descriptions. A command line is described, with parameters, input and output files, in a YAML format so they can be shared across workflows and linked to from registries like ELIXIR’s bio.tools. These are then combined and wired together in a second YAML file to form a workflow template, which can be executed on any of the supported implementations, repeatedly and on different platforms by specifying input files and workflow parameters. The CWL User Guide gives a step-by-step introduction to the language, while the more formal CWL specifications define CWL concepts so they can be implemented by the different workflow systems. The CWL Viewer, developed with support from BioExcel, provides a graphical visualization of CWL workflows.

Copernicus

Copernicus is a peer-to-peer distributed computing platform designed for high level parallelization of statistical problems. It provides an easy and effective consolidation of heterogeneous compute resources, automatic resource matching of jobs against compute resources, automatic fault tolerance of distributed work, a workflow execution engine to easily define a problem and trace its results live, as well as flexible plugin facilities allowing programs to be integrated to the workflow execution engine. Copernicus consists of four components: the Server, the Worker, the Client and the Workflow execution engine. The Server is the backbone of the platform and manages projects, generates jobs (computational work units) and matches these to the best computational resource. Workers are programs residing on the computational resources. They are responsible for executing jobs and returning the results back to the Server. Workers can reside on any type of machine – desktops, laptops, cloud instances or a cluster environment. The Client is the tool for setup of projects and their monitoring. In fact, nothing is running on the Client ever, it only sends commands to the server. That way the researcher can run the Client on a laptop, fire up a project, close the laptop, open it up after some time and see the progress of the project. All communication between these three components is encrypted and has to be authorized.

Galaxy

Galaxy is an open, web-based platform for data intensive biomedical research. Galaxy can be accessed on a free public server, or installed locally in the lab. Rather than building a workflow up-front, Galaxy uses a data playground approach, effectively building a workflow implicitly by applying a series of operations on the data items, keeping a History of all intermediate data items that are produced (and how they were made), making it easy to rerun parts of the workflow and share the results with others. Galaxy has tight integration with a large collection of tools for genomics and sequence analysis, and is therefore popular for making Next-Gen Sequencing (NGS) pipelines. Adding a new tool to Galaxy (if it is not already in the Galaxy Toolshed) is done by making a little Python wrapper and a description. Maintaining a Galaxy instance can be a challenge, as it means also keeping track of and updating all the installed tools and reference datasets. Recently Galaxy have also become available as cloud images and as a Docker image, which simplifies the installation. Galaxy is working on Common Workflow Language support.

KNIME

The KNIME Analytics Platform is popular in cheminformatics for data analysis, statistics and visualization. KNIME runs as a graphical desktop application, but can also be used on the command line, remotely on the cloud, or as a server. KNIME workflows are written as a dataflow, connecting a series of operations, passing table-based data items. A typical workflow operation will extend the table by adding new columns (e.g. calculated properties) or summarize inputs to a new, smaller table. KNIME have rich visualization and plotting for supported data types, and allow each operation to be run step by step, or when data or services have changed, re-run all “outdated” operations as indicated by a traffic light system. A KNIME workspace contains a workflow and the data values produced by the latest executions, and can be shared as a ZIP file or folder. KNIME can be extended with plugins developed in Java. KNIME is heavily used in Open PHACTS and by pharmaceutical companies.

Apache Taverna

Apache Taverna (incubating) is a Java-based scientific workflow system with a graphical design interface. Taverna workflows can combine many different service types, including REST and WSDL services, command line tools, scripts (e.g. BeanShell, R) and custom plugins. Taverna is used in a wide range of sciences for data analysis and processing, including bioinformatics, cheminformatics, biodiversity and musicology. Workflow engine features include provenance tracking, implicit parallelism/iterations, retry/failover and looping. Taverna workflows are commonly shared on myExperiment, and can either be created graphically in the Taverna workbench, programmatically using the Taverna Language API or by generating workflow definitions in the SCUFL2 format. With support from BioExcel, Apache Taverna is working on Common Workflow Language support.

COMPSs

COMPSs is a framework, composed of a programming model and a runtime system, which aims to ease the development and deployment of distributed applications and web services. The core of the framework is its programming model, which allows the programmer to write applications in a sequential way and execute them on top of heterogeneous infrastructures exploiting the inherent parallelism of the applications. The COMPSs programming model is task-based, +allowing the programmer to select the methods of the sequential application to be executed remotely. This selection is done by means of an annotated interface where all the methods that have to be considered as tasks are defined with annotations describing their data accesses and constraints on the execution of resources. At execution time this information is used by the runtime to build a dependency graph and orchestrate the tasks on the available resources. One important feature of the COMPSs runtime is the ability to exploit the cloud elasticity by adjusting the amount of resources to the current workload. When the number of tasks is higher than the available cores, the runtime turns to the cloud looking for a provider offering the type of resources that better meet the requirements of the application and with the lowest economical cost. Analogously, when the runtime detects an excess of resources for the actual workload, it will power off unused instances in a cost-efficient way. Such decisions are based on the information on the type of resources that contains the details of the software images and instance templates available for every cloud provider. Since each cloud provider offers its own API, COMPSs defines a generic interface to manage resources and to query about details concerning the execution cost of multiple cloud providers during one and the same execution. These, called connectors, are responsible for translating the generic requests to the actual provider’s API.

Workflow blocks

As part of the BioExcel H2020 project we are creating a set of portable workflow building blocks to use the BioExcel-supported tools. We will be curating their bio.tools descriptions with the EDAM ontology and then describing their execution as Common Workflow Language tools. We create corresponding Python wrappers that provide a uniform interface across the tools and handle any input/output parameter adaptions. These wrappers are configured using YAML/JSON, for example:

BioExcel is working working on a pilot use case of Virtual Screening, with the GitHub project bioexcel/pymdsetup as our initial testing ground, where we have created wrappers for Gromacs, scwrl and other tools, as well as the corresponding CWL tool descriptipons (e.g. pdb2gmx.cwl) – (Note that this repository is under active development and this structure is subject to change, see the master branch for the latest version)

The workflow is expressed in different ways:

CWL workflow schema_w.cwl

CWL Viewer

The above visualization is made using the CWL Viewer, developed with support from BioExcel.

CWL Viewer has become the de-facto standard tool for presenting CWL definitions in the Common Workflow Language and ELIXIR communities.

Talking about workflows

To talk to Bioxcel about workflows or provide feedback, feel free to contact us:

You may also want to sign up to BioExcel’s Workflows Interest Group and newsletter to be informed of upcoming webinars and workflow-related events.

References

  1. Hospital, Adam; Montras, Anna; Soiland-Reyes, Stian; Bonvin, Alexandre; Melquiond, Adrien; Gelpí, Josep Lluís; Lezzi, Daniele; Newhouse, Steven; Dianes, Jose A.; Abraham, Mark; Apostolov, Rossen; Ippoliti, Emiliano; Carter, Adam; White, Darren J. (2016): BioExcel Deliverable 2.1 – State of the art and gap analysis. https://doi.org/10.5281/zenodo.263963
  2. Hospital, Adam; Montras, Anna; Gelpí, Josep Lluís; Badia, Rosa M.; Newhouse, Steven; Dianes, Jose A.; Andrio, Pau; Soiland-Reyes, Stian; White, Darren J; Carter, Adam; Ippoliti, Emiliano; Melquiond, Adrien; de Groot, Bert (2016): BioExcel Deliverable 2.2 – First Release of Workflow Blocks and Portals.. https://doi.org/10.5281/zenodo.263965

Creative Commons Licence
BioExcel page is © 2017 The University of Manchester & KTH Royal Institute of Technology, licensed under a Creative Commons Attribution 4.0 International License. The Common Workflow Language logo is © 2016 The Common Workflow Language Project, distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache, Apache Taverna and its logo are trademarks of The Apache Software Foundation, logo © 2014-2017 The Apache Software Foundation, distributed under the Apache License, 2.0. The KNIME trademark and logo are registered in the United States and/or Germany, owned by KNIME GmbH. Galaxy logo distributed under the Academic Free License version 3.0.