Google Summer of Code (GSOC) is an annual programme where Google sponsors students around the world to contribute to all kinds of open source initiatives, paying a stipend of $5500 for successful students to work full-time on their project for the summer. The programme is highly competitive, giving selected students not just a fun time to code and learn something new (experience for their CV), but also teach them about distributed collaboration and open source development practices, which is often missing from Software Engineering and Computer Science classes in higher education.

An important part of GSOC is that each accepted proposal has one or two dedicated mentors who are already part of the open source project. Mentoring is part technical, part social; helping the students in getting to know the software stack and make good coding decisions, telling them how to communicate their ideas with the community (mailing lists, group chats, wikis, blogs), provide regular GitHub pull requests, and teaching them about the often dreaded task of ensuring the licensing is OK for new files and dependencies that are added to the open source project.

As part of BioExcel’s work package on workflows is to explore building of reusable workflow components for biomolecular simulation in systems like Apache Taverna (incubating), Galaxy, COMP Superscalar, BCBio and KNIME, we knew that interoperability and ease of deployment were important aspects.

I was therefore delighted that two of the GSOC 2016 proposals for Apache Taverna got accepted: Nadeesh Dilanga adding support for the container stack Docker, and Thilina Manamgoda adding support for Common Workflow Language. Through the application process we also found two GSOC-inspired students, Rajan Maurya and Sagar, who volunteered to self-mentor and improve our Android app Taverna Mobile – itself a product of GSOC 2015.


Google Summer of Code 2016

Below I summarize the GSOC 2016 contributions to Apache Taverna. Note that these are just two of the many accepted student proposals across the Apache Software Foundation and other organizations.

Docker

large_h-trans

Docker is a Linux container virtualisation platform that is popular for distributing and running server and command line applications for cloud instances in a reproducible manner; to then form a distributed and horizontally scalable microservice architecture. For research software engineers, Docker can be particularly valuable as it simplifies installing and running open source software, scientific codes and tools. No more ./autoconf.sh && ./configure --with-fortran=/store/lib/fortran !

While Apache Taverna can execute remote and local services and command line tools, it has no built-in support for distributing those services and tools with the workflow. Thus portability and reuse is reduced if a workflow uses custom installed software – but those are two big reasons for using workflows in the first place.

In his project, Nadeesh has therefore added a new taverna-docker-activity to Taverna, which can be used to create and start Docker containers from within the workflow. This mean that the workflow can start its own web services, or download a ready-to-run container image, for instance with GROMACS installed; and such workflows could then execute just as well for a different user or on a remote cloud instance. No more grid certificates!

In BioExcel we are planning to wrap several tools as Docker images and register them in Elixir’s bio.tools. With Nadeesh’s work users can take advantage of those images directly from within Apache Taverna.

Nadeesh Dilanga was sponsored by GSOC 2016 and mentored by Alan Williams from University of Manchester.

Common Workflow Language

cwl

Common Workflow Language (CWL) is a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry. Workflow systems that have or are working on CWL support include the reference implementation cwltool, Arvados,  Galaxy, MG-RAST AWE, bcbio and Rabix Bunny.

Steps in CWL workflows are thoroughly described with metadata about command line parameters, and often provided as Docker images. Thus running a CWL workflow can be seen as composing a series of Docker-wrapped command line tools.

In his project, Thilina added taverna-cwl-activity-ui, a plugin to browse and add CWL tools within the Taverna Workbench, including metadata from the CWL Command Line Tool Description, as well as a prototype taverna-cwl-activity for invoking CWL tools. Further integrations will look at using the Docker activity and loading CWL workflows using the taverna-language AIP for building and manipulating Taverna workflows.

In BioExcel we are planning to describe wrapped tools (workflow components) using the EDAM ontology and CWL. Thilina’s work will make it possible to add such tools directly to the Taverna Workbench.

Thilina Manamgoda was sponsored by GSOC 2016 and mentored by Stian Soiland-Reyes from University of Manchester.

Learn more about workflows in Barcelona

Do you want to learn more about scientific workflow systems? Why not come to Barcelona this October for BioExcel’s workflow training workshop? The sessions include tasters of many of the workflow systems, as well as a chance to meet the developers in the Bring Your Own Workflow session.

Barcelona

Consider GSOC 2017 for your academic projects!

GSOC is open to anyone creating Open Source software, so make sure your organization applies when the GSOC 2017 programme opens in February 2017. Read up on what it means to be a GSOC mentor; it doesn’t take that much time per week, just keep your student informed and engaged.

Many academic institutions join GSOC, even with quite specific projects. You can encourage your own students to apply, as well as pooling from the vast amount of students across the world. To enthuse the students to choose you, develop several Project Ideas on your wiki, with plenty of links to reference materials. Student like to read up before they apply – they have many potential organizations and ideas to choose between!

Here are most of the academic open source projects from GSOC 2016:

Bioinformatics:

Medical:

Scientific programming:

HPC and cloud:

Mathematics and statistics:

Engineering and Robotics:

Physics and astronomy:

Machine learning:

Natural Language Processing:

Geoinformatics:

Others:

Research groups:

(Source: https://summerofcode.withgoogle.com/organizations/ 2016-09-15)

Will you be on the list next year?


This article is authored by Stian Soiland-Reyes, University of Manchester and distributed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The GSOC 2016 logo is reused under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License. Source: developers.google.com

The Docker logo is a trademark used by permission from Dropbox, Inc. Source: docker.com/brand-guidelines

The Common Workflow Language logo is (C) Copyright 2016 the Common Workflow Language Project and are released under Creative Commons Attribution-ShareAlike 3.0 Unported License.