Scientific workflow systems are popular for organising and executing analytical and computational pipelines, particularly when combining disparate command line tools, repetitive executions and structured data capturing. Compared to scripts like Bash or Python, a workflow can also be described graphically, used for communication with domain scientists and more easily be referenced from publications.

  • Workflow systems

    One of the challenges for BioExcel is to provide easy access to computing and data resources through a range of workflow environments, focusing particularly on usage of the BioExcel core codes. The first step in this direction was a thorough analysis of the state-of-the art which identified these workflow systems to be supported by BioExcel:

    Common Workflow Language

    The Common Workflow Language (CWL) is a community-led specification to express portable workflow and tool descriptions, which can be executed by multiple leading workflow engine implementations, which again support multiple execution frameworks (e.g. Slurm, Azure, GCP, SGE, HTCondor, LSF, AWS, Spark, PBS/Torque, TES). Unlike previous standardisation attempts, CWL has taken a pragmatic approach and focused on what most workflow systems are able to do: Execute command line tools and pass files around in a top-to-bottom pipeline. At the heart of CWL workflows are the tool descriptions. A command line is described, with parameters, input and output files, in a YAML format so they can be shared across workflows and linked to from registries like ELIXIR’s bio.tools. These are then combined and wired together in a second YAML file to form a workflow template, which can be executed on any of the supported implementations, repeatedly and on different platforms by specifying input files and workflow parameters. The CWL User Guide gives a step-by-step introduction to the language, while the more formal CWL specifications define CWL concepts so they can be implemented by the different workflow systems. The CWL Viewer, developed with support from BioExcel, provides a graphical visualization of CWL workflows. The BioExcel workflow building blocks are accessible as CWL. Read more in the BioExcel success story about CWL.

     

    Galaxy

    Galaxy is an open, web-based platform for data intensive biomedical research. Galaxy can be accessed on a free public server, or installed locally in the lab. Rather than building a workflow up-front, Galaxy uses a data playground approach, effectively building a workflow implicitly by applying a series of operations on the data items, keeping a History of all intermediate data items that are produced (and how they were made), making it easy to rerun parts of the workflow and share the results with others. Galaxy has tight integration with a large collection of tools for genomics and sequence analysis, and is therefore popular for making Next-Gen Sequencing (NGS) pipelines. Adding a new tool to Galaxy (if it is not already in the Galaxy Toolshed) is done by making a little Python wrapper and a description. Maintaining a Galaxy instance can be a challenge, as it means also keeping track of and updating all the installed tools and reference datasets. Recently Galaxy have also become available as cloud images and as a Docker image, which simplifies the installation. Galaxy is working on Common Workflow Language support.

    KNIME

    The KNIME Analytics Platform is popular in cheminformatics for data analysis, statistics and visualization. KNIME runs as a graphical desktop application, but can also be used on the command line, remotely on the cloud, or as a server. KNIME workflows are written as a dataflow, connecting a series of operations, passing table-based data items. A typical workflow operation will extend the table by adding new columns (e.g. calculated properties) or summarize inputs to a new, smaller table. KNIME have rich visualization and plotting for supported data types, and allow each operation to be run step by step, or when data or services have changed, re-run all “outdated” operations as indicated by a traffic light system. A KNIME workspace contains a workflow and the data values produced by the latest executions, and can be shared as a ZIP file or folder. KNIME can be extended with plugins developed in Java. KNIME is heavily used in Open PHACTS and by pharmaceutical companies.

    COMPSs

    COMPSs is a framework, composed of a programming model and a runtime system, which aims to ease the development and deployment of distributed applications and web services. The core of the framework is its programming model, which allows the programmer to write applications in a sequential way and execute them on top of heterogeneous infrastructures exploiting the inherent parallelism of the applications. The COMPSs programming model is task-based, +allowing the programmer to select the methods of the sequential application to be executed remotely. This selection is done by means of an annotated interface where all the methods that have to be considered as tasks are defined with annotations describing their data accesses and constraints on the execution of resources. At execution time this information is used by the runtime to build a dependency graph and orchestrate the tasks on the available resources. One important feature of the COMPSs runtime is the ability to exploit the cloud elasticity by adjusting the amount of resources to the current workload. When the number of tasks is higher than the available cores, the runtime turns to the cloud looking for a provider offering the type of resources that better meet the requirements of the application and with the lowest economical cost. Analogously, when the runtime detects an excess of resources for the actual workload, it will power off unused instances in a cost-efficient way. Such decisions are based on the information on the type of resources that contains the details of the software images and instance templates available for every cloud provider. Since each cloud provider offers its own API, COMPSs defines a generic interface to manage resources and to query about details concerning the execution cost of multiple cloud providers during one and the same execution. These, called connectors, are responsible for translating the generic requests to the actual provider’s API.

    Workflow blocks

    As part of the BioExcel H2020 project we are creating a set of portable workflow building blocks to use the BioExcel-supported tools. We will be curating their bio.tools descriptions with the EDAM ontology and then describing their execution as Common Workflow Language tools. We create corresponding Python wrappers that provide a uniform interface across the tools and handle any input/output parameter adaptions. These blocks, biobb, are then made available for use from PyCOMPSs, CWL, KNIME, Galaxy ToolShed and Jupyter Notebooks.

    Virtual Screening workflow

    BioExcel worked on a pilot use case of Virtual Screening, with the GitHub project bioexcel/virtualscreening, where we have using the wrappers for Gromacs, scwrl and other tools, as well as the corresponding CWL tool descriptions (e.g. pdb2gmx.cwl).

    Building on this approach we later split out the biobb building blocks as independent and reusable modules.

    The workflow is expressed in different ways:

    The workflow can be installed using anaconda or using apt and pip.

    CWL Viewer

    The above workflow visualization is made using the CWL Viewer, developed with support from BioExcel.

    CWL Viewer has become the de-facto standard tool for presenting CWL definitions in the Common Workflow Language and ELIXIR communities.

    Talking about workflows

    To talk to Bioxcel about workflows or provide feedback, feel free to contact us:

    You may also want to sign up to BioExcel’s newsletter to be informed of upcoming webinars and workflow-related events.

    References

        1. Hospital, Adam; Montras, Anna; Soiland-Reyes, Stian; Bonvin, Alexandre; Melquiond, Adrien; Gelpí, Josep Lluís; Lezzi, Daniele; Newhouse, Steven; Dianes, Jose A.; Abraham, Mark; Apostolov, Rossen; Ippoliti, Emiliano; Carter, Adam; White, Darren J. (2016): BioExcel Deliverable 2.1 – State of the art and gap analysis. https://doi.org/10.5281/zenodo.263963
        2. Hospital, Adam; Montras, Anna; Gelpí, Josep Lluís; Badia, Rosa M.; Newhouse, Steven; Dianes, Jose A.; Andrio, Pau; Soiland-Reyes, Stian; White, Darren J; Carter, Adam; Ippoliti, Emiliano; Melquiond, Adrien; de Groot, Bert (2016): BioExcel Deliverable 2.2 – First Release of Workflow Blocks and Portals.. https://doi.org/10.5281/zenodo.263965

    Creative Commons Licence
    BioExcel page is © 2017 The University of Manchester & KTH Royal Institute of Technology, licensed under a Creative Commons Attribution 4.0 International License. The Common Workflow Language logo is © 2016 The Common Workflow Language Project, distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache, Apache Taverna and its logo are trademarks of The Apache Software Foundation, logo © 2014-2017 The Apache Software Foundation, distributed under the Apache License, 2.0. The KNIME trademark and logo are registered in the United States and/or Germany, owned by KNIME GmbH. Galaxy logo distributed under the Academic Free License version 3.0.