Google Summer of Code (GSOC) is an annual programme where Google sponsors students around the world to contribute to all kinds of open source initiatives, paying a stipend of $5500 for successful students to work full-time on their project for the summer. The programme is highly competitive, giving selected students not just a fun time to code and learn something new (experience for their CV), but also teach them about distributed collaboration and open source development practices, which is often missing from Software Engineering and Computer Science classes in higher education.
An important part of GSOC is that each accepted proposal has one or two dedicated mentors who are already part of the open source project. Mentoring is part technical, part social; helping the students in getting to know the software stack and make good coding decisions, telling them how to communicate their ideas with the community (mailing lists, group chats, wikis, blogs), provide regular GitHub pull requests, and teaching them about the often dreaded task of ensuring the licensing is OK for new files and dependencies that are added to the open source project.
As part of BioExcel’s work package on workflows is to explore building of reusable workflow components for biomolecular simulation in systems like Apache Taverna (incubating), Galaxy, COMP Superscalar, BCBio and KNIME, we knew that interoperability and ease of deployment were important aspects.
I was therefore delighted that two of the GSOC 2016 proposals for Apache Taverna got accepted: Nadeesh Dilanga adding support for the container stack Docker, and Thilina Manamgoda adding support for Common Workflow Language. Through the application process we also found two GSOC-inspired students, Rajan Maurya and Sagar, who volunteered to self-mentor and improve our Android app Taverna Mobile – itself a product of GSOC 2015.
Below I summarize the GSOC 2016 contributions to Apache Taverna. Note that these are just two of the many accepted student proposals across the Apache Software Foundation and other organizations.
Docker
Docker is a Linux container virtualisation platform that is popular for distributing and running server and command line applications for cloud instances in a reproducible manner; to then form a distributed and horizontally scalable microservice architecture. For research software engineers, Docker can be particularly valuable as it simplifies installing and running open source software, scientific codes and tools. No more ./autoconf.sh && ./configure --with-fortran=/store/lib/fortran
!
While Apache Taverna can execute remote and local services and command line tools, it has no built-in support for distributing those services and tools with the workflow. Thus portability and reuse is reduced if a workflow uses custom installed software – but those are two big reasons for using workflows in the first place.
In his project, Nadeesh has therefore added a new taverna-docker-activity to Taverna, which can be used to create and start Docker containers from within the workflow. This mean that the workflow can start its own web services, or download a ready-to-run container image, for instance with GROMACS installed; and such workflows could then execute just as well for a different user or on a remote cloud instance. No more grid certificates!
In BioExcel we are planning to wrap several tools as Docker images and register them in Elixir’s bio.tools. With Nadeesh’s work users can take advantage of those images directly from within Apache Taverna.
Nadeesh Dilanga was sponsored by GSOC 2016 and mentored by Alan Williams from University of Manchester.
Common Workflow Language
Common Workflow Language (CWL) is a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry. Workflow systems that have or are working on CWL support include the reference implementation cwltool, Arvados, Galaxy, MG-RAST AWE, bcbio and Rabix Bunny.
Steps in CWL workflows are thoroughly described with metadata about command line parameters, and often provided as Docker images. Thus running a CWL workflow can be seen as composing a series of Docker-wrapped command line tools.
In his project, Thilina added taverna-cwl-activity-ui, a plugin to browse and add CWL tools within the Taverna Workbench, including metadata from the CWL Command Line Tool Description, as well as a prototype taverna-cwl-activity for invoking CWL tools. Further integrations will look at using the Docker activity and loading CWL workflows using the taverna-language AIP for building and manipulating Taverna workflows.
In BioExcel we are planning to describe wrapped tools (workflow components) using the EDAM ontology and CWL. Thilina’s work will make it possible to add such tools directly to the Taverna Workbench.
Thilina Manamgoda was sponsored by GSOC 2016 and mentored by Stian Soiland-Reyes from University of Manchester.
Learn more about workflows in Barcelona
Do you want to learn more about scientific workflow systems? Why not come to Barcelona this October for BioExcel’s workflow training workshop? The sessions include tasters of many of the workflow systems, as well as a chance to meet the developers in the Bring Your Own Workflow session.
Consider GSOC 2017 for your academic projects!
GSOC is open to anyone creating Open Source software, so make sure your organization applies when the GSOC 2017 programme opens in February 2017. Read up on what it means to be a GSOC mentor; it doesn’t take that much time per week, just keep your student informed and engaged.
Many academic institutions join GSOC, even with quite specific projects. You can encourage your own students to apply, as well as pooling from the vast amount of students across the world. To enthuse the students to choose you, develop several Project Ideas on your wiki, with plenty of links to reference materials. Student like to read up before they apply – they have many potential organizations and ideas to choose between!
Here are most of the academic open source projects from GSOC 2016:
Bioinformatics:
- BioJs – visualize biological data on the web
- Canadian Centre for Computational Genomics (C3G)
- cBioPortal for Cancer Genomics
- Computational Biology @ University of Nebraska-Lincoln
- Ensembl Genome Browser
- National Resource for Network Biology (NRNB)
- GA4GH – Global Alliance for Genomics & Health
Medical:
- International Neuroinformatics Coordinating Facility
- Monarch Initiative – Crowdsourcing rare disease patient symptoms
- OpenMRS – software for health care in developing countries
- Stony Brook University Biomedical Informatics
- Open Ephys – neuroscientists open-source tools
Scientific programming:
- NumFOCUS supports open source scientific software
- SciRuby – Tools for Scientific Computing in Ruby
- Sustainable Computing Research Group ( SCoRe ) at University of Colombo
- Timelab Scientific Software
HPC and cloud:
- Cray Chapel – a productive parallell programming language
- Distributed and Unified Numerics Environment (DUNE) – solving partial differential equations (PDE) using parallel and super-computers
- STE||AR Group – new approach to parallel computation
Mathematics and statistics:
- GNU Octave – high-level interpreted language, primarily intended for numerical computations
- R project for statistical computing
- Sage Mathematical Software System
Engineering and Robotics:
- ASCEND – Equation-solving software for engineering system modelling
- ArchC – Architecture Description Language
- JSK Robotics Laboratory
- CVXPY – a Python-embedded modeling language for convex optimization problems
- Mobile Robot Programming Toolkit (MRPT)
- MBDyn, Department of Aerospace Science and Technology at Politecnico di Milano
- McGill Space Institute
- Open Source Robotics Foundation
- Portland State University
- Scilab – Numerical computation IDE for engineering and scientific applications
Physics and astronomy:
- CERN SFT – Software for Experiments group at the European Organization for Nuclear Research
- OpenAstronomy
- TARDIS – radiative transfer tool to determine theoretical observables for exploding stars by relying on Monte Carlo techniques
Machine learning:
- mlpack: a scalable C++ machine learning library close
- Shogun Machine Learning Toolbox
- The Center for Connected Learning and Computer-Based Modeling, Northwestern University
- Orange – Data Mining Fruitful & Fun
- OpenCV, the Open Source Computer Vision Library
Natural Language Processing:
- Classical Language Toolkit
- Red Hen Lab – research into multimodal communication
- Unitex/GramLab – multilingual, lexicon- and grammar-based corpus processing suite
Geoinformatics:
- 52°North Initiative for Geospatial Open Source Software
- OSGeo – The Open Source Geospatial Foundation
Others:
- BuildmLearn – creating open source tools and enablers for teachers and students
- Gambit – Software Tools for Game Theory
- Public Lab – help communities measure and analyze pollution
Research groups:
- AOSSIE – The Australian National University’s Open-Source Software Innovation and Education
- Berkman Center for Internet and Society
- Boston University
- Computational Science and Engineering at TU Wien
- OSU – The Open Source Lab at Oregon State University
- PLASMA Lab at the University of Massachusetts Amherst
(Source: https://summerofcode.withgoogle.com/organizations/ 2016-09-15)
Will you be on the list next year?
This article is authored by Stian Soiland-Reyes, University of Manchester and distributed under a Creative Commons Attribution-ShareAlike 4.0 International License.
The GSOC 2016 logo is reused under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License. Source: developers.google.com
The Docker logo is a trademark used by permission from Dropbox, Inc. Source: docker.com/brand-guidelines
The Common Workflow Language logo is (C) Copyright 2016 the Common Workflow Language Project and are released under Creative Commons Attribution-ShareAlike 3.0 Unported License.