Expanding the Galaxy: meeting (the needs of) ELIXIR Communities

As data analysis is now common place in life sciences, we need to develop scalable ways to develop and share analysis workflows and train researchers to make use of them. The latter entails an end-to-end approach from access to data over selection and proper usage of the appropriate workflow and deploying this on available (cloud) resources.

The ELIXIR Communities bring together domain experts. This is an ideal way to identify and develop standard workflows for commonly used analyses in that specific domain. Since summer 2016 the Galaxy Training Network has been collecting and further developing training material for analysis in, development and administration of Galaxy in a collaborative way (https://training.galaxyproject.org).

This project has three main goals

  1. Expand the portfolio of Community workflows, including training material to describe them
  2. Facilitate access to data in Core Data Resources and Deposition Databases
  3. Improve the user experience of the Galaxy platform

Background

Galaxy is a workflow management system that 1) provides support for reproducible science, 2) facilitates sharing of data and results and 3) removes the need for users to compile and install tools. Galaxy offers a user-interface, through a web browser, in which virtually any command line tool can be integrated. This is done by defining the inputs, outputs and parameters in a wrapper script. As analyses usually consist of multiple steps, tools can be composed in workflows, which facilitates the processing of multiple samples and reproduction of analyses. Galaxy is available as a world-wide free-to-use online portal, following open-source policy development and can be freely downloaded for a local installation.

The Galaxy workflow system is extensively used as part of national infrastructures in several ELIXIR Nodes. Galaxy itself is considered an integral part of bioinformatics infrastructure by many bioinformatics researchers and core facility groups because it enables simplified access to data and analysis tools under a single “intuitive” interface. Education and training is an integral part of the Galaxy community. The Galaxy Training Network (GTN) are working since several years with Goblet and the ELIXIR Training Platform to enhance and deliver first-class training to the Scientific community - targeting not only scientists but also developers and admins. 

Goals

In this project we will engage with ELIXIR Communities that are not yet well represented in the Galaxy ecosystem. The aim is to deliver deployable workflows for commonly used analysis in these scientific domains. This encompasses the whole stack needed: availability of tools in Galaxy and exemplar workflows as well as access to data, both reference data and published research data.

In collaboration with the Training platform and the involved Community, we will do a gap analysis to identify which components of the stack are missing or can be improved. We will bring experts of the scientific discipline, training and Galaxy together in a hackathon to address this and develop training materials to document the developed workflows and enable trainings. This training material will be included in the Galaxy Training Network initiative https://training.galaxyproject.org (which is indexed by TeSS). As through these events we are bringing together established and potential new trainers, we will combine this with a Train the Trainer event.

We will also organise trainings using already existing and newly developed material, targeted towards researchers within this community. We also will make use of these events to assess the usability of Galaxy.

WP1 : Enable commonly used workflows for Communities

The bulk of the funding in this project will be allocated to organize events, bringing together experts of a scientific Community, Galaxy and the Training platform. We have selected five Communities (Plant, Metagenomics, Metabolomics , Proteomics and 3D BioInfo), for each one we will organize three events : a hackathon, a training for researchers and a Train-the-Trainer event. To reduce costs and travelling, we envision that these are organised co-located and back-to-back (per community).

The aim of the hackathon is to address (selected) issues that have been identified through a gap analysis, to enable researchers to perform standard analysis in Galaxy. This is done in preparation of the hackathon, in collaboration with all stakeholders involved. These issues can range from wrapping tools into Galaxy, making workflows, providing visualisation plugins, etc. Also access to data is in scope, in collaboration with WP2 of this proposal. To disseminate the work done, training materials will be developed using these developments.

We will ensure all developments are appropriately referenced in ELIXIR registries, building on the expertise available in ELIXIR, in the Tools and Interoperability Platform: tools will be added to bio.tools and, if applicable, containers will be made available in BioContainers, and workflows will be registered in MyExperiment. This will be done in alignment with and complementary to the development of this infrastructure in EOSC-Life (WP Tools Collaboratory).

We will combine this hackathon with a Train-the-Trainer event, building on the expertise of the Training Platform. This aims to improve the teaching skills of the trainers as well as make them more familiar with the Galaxy platform and how it can support trainers and training events.

WP2 : Access to data

Access to data is currently a major bottleneck for many users. In collaboration with data providers, we will incorporate ELIXIR Deposition Databases and confer with the communities what additional resources are of interest.

Access to data from UCSC has been integrated in Galaxy through dedicated additions to the web pages that allow searching this resource. Based on work started in the PheNoMeNal projects, a Galaxy tool is developed to communicate with MetaboLights. Dedicated tools can provide both the ability to retrieve as submit data to Deposition Databases. However, these current approaches are very labour intensive to scale to all ELIXIR Core Data Resources and Deposition Databases, as well as difficult to maintain.

The Omics Discovery Index (OmicsDI, http://omicsdi.org) provides an integrated metadata resource for 20 different databases with currently 450,000 datasets (October 2018), including four Elixir Core Data Resources (ArrayExpress, EGA, ENA, PRIDE), as well as two additional Elixir Deposition Databases (BioModels, MetaboLights). OmicsDI already provides web service access for search and metadata retrieval across all integrated resources. To facilitate data access in Galaxy workflows, we will add a method to OmicsDI to provide the direct data download URLs for any selected dataset. This method will be integrated into Galaxy workflows to automatically download and process relevant datasets selected based on standardised metadata criteria.  

The proposed integration of OmicsDI in Galaxy as a data source allows to enable access to datasets from different data sources through a common entrypoint. The data access method that will be developed in this project will be independent of Galaxy. This makes integration in other workflow systems or scripts possible, broadening the impact beyond the Galaxy community. We envision this work as a step towards a common way to (programmatically) access ELIXIR Core Data Resources and Deposition Databases, based on both keyword search as well as commonly used identifiers.

This aligns to the objectives of the ELIXIR Interoperability Platform and Galaxy Community. This will result in improved access to ELIXIR Core Data Resources and Deposition Databases for the whole life science community, as well as a specific integration in Galaxy. The work will be disseminated through e.g. usage in the developed training materials (WP1). In consultation with the ELIXIR Hub, a dedicated webinar can be organised.

Duration
1st June 2019 - 31st May 2021