Data Integration for Compute

The goal of this study is to host and provide standardised compute access to important public reference and sensitive data sets in relevant (cloud) providers such that:

  • Datasets can be accessed locally via consistent, versatile interfaces (e.g. S3 or POSIX)
  • Local copies are updated as new versions are released
  • Users and cloud managers are able to manage their own data set subscriptions
  • There are secure but easy to use access control mechanisms
  • There are assured network connectivity (bandwidth and security) through software-defined networking

Many bioinformatics analysis activities are dependent on reference data sets to undertake their work. Transferring data sets on demand will delay the start of any analysis activity as moving large data sets does not happen instantly. Instead, pre-caching relevant data sets on popular cloud resources mean that they are already available when they are needed. This involves three tasks:

WP2.1: Provisioning of federated storage namespace

Lead: Christine Staiger (ELIXIR NL)

Provide a location-independent mechanism for identifying data that can then be resolved to the location(s) of the data in the physical infrastructure. This will allow a researcher to find where a specific data set is located and to decide if they are able to move their workload to this data location, or if a data transfer Site-2-Site is needed prior to starting computations.

WP2.2: Site-2-Site Data Transfers

Lead: Andrea Cristofori (EMBL-EBI)

The Reference Data Set Distribution Service (RDSDS) or other complementary methods, if needed (e.g. BioMaj) for site-to-site transfer will be used to allow sites to subscribe to specific public data sets that can be provisioned onto their cloud resources at the specified location, and when a new version of a given public data set is made available. Sensitive data sets will be made available through a secure cloud environment whereby the data set can be hosted securely in the remote cloud environment and remain encrypted in situ. The rights to access to the sensitive data is verified each time the data is accessed by the user. An ELIXIR webinar introduced the proposed technologies for secure data transfer between two ELIXIR nodes in May 2018.

WP2.3: Site-2-User Data Transfer

Lead: Giacinto Donvito (ELIXIR IT)

Provide a means for large data sets to be delivered asynchronously from their source to where a user needs them for their analysis.