ELIXIR-CONVERGE WP9: Mobilisation of SARS-CoV-2 variant surveillance data tracking services and tools

The European COVID-19 Data Platform enables the collection and sharing of research data for the European and global research communities. Data mobilisation is a core role for the Platform: via the SARS-CoV-2 Data Hubs viral raw sequence and assembled genomes can be systematically processed, openly shared, and variants examined in a wide ecosystem of visualisation and phylogenetic analysis tools.

The objective of this work-package is to mobilise viral genomes from national sequencing efforts in Europe and beyond into truly open data resources for surveillance of COVID-19 variants and expose variation data rapidly for analysis. This will be achieved by linking nascent and established national sequencing efforts to the COVID-19 Data Platform by setting up submission and annotation pipelines to the European Nucleotide Archive (ENA). Participating efforts will commit to open sharing of variant data through ENA.

Objectives

O9.1	Coordinate nascent and established national data hubs focusing on brokering services to define and foster common best practices.	Task9.1
O9.2	Mobilise open SARS-CoV-2 genome data into the COVID-19 Data Platform (https://www.covid19dataportal.org) from individual data hubs.	Task 9.2
O9.3	Catalyse agreement on SARS-CoV-2 data standards for variants and lineages.	Task 9.3
O9.4	Drive development of SARS-CoV-2 variant analysis tools.	Task 9.4

Tasks

Task 9.1 Strengthen central Covid-19 Data Portal help desk for individual centres and brokers

Subtask 9.1.1 Strengthen central SARS-CoV-2 Data Hubs Help Desk for easier and improved data submission into the European COVID-19 Data Platform for scalability This subtask will ensure appropriate availability of data and tools from the COVID-19 Portal from EU research projects in which EMBL-EBI is a partner (other than VEO and ReCoDID, as these are covered in their respective projects). These projects include (aligned EU projects shown in italics):

deeply mined COVID-19 related literature, from Europe PMC. Development work on this will include COVID-19 specific text mining as an EMBL-EBI contribution to the OpenAIRE initiative (OpenAIRE);
compound screening and assay data relating to ongoing COVID-19 related work (EUbOPEN and eTRANSAFE);
chemoinformatics tools that will assist with the integration of compound-related COVID-19 data (EU-ToxRisk, EUbOPEN and TransQST);
access to tools and interfaces relating to clinical and epidemiological data (CINECA).

Subtask 9.1.2 Provide supported SARS-CoV-2 data brokering interfaces for external data platforms: The national/regional SARS-CoV-2 Data Hubs offer a broad portfolio of tools, from simple intuitive "drag and drop" web pages to RESTful APIs. We will extend these tools to reflect the emerging needs of the SARS-CoV-2 data brokering community. To provide clear and efficient data routing for all data providers, we will coordinate across sub-tasks 9.1.1 and 9.1.2 to ensure that countries and laboratories have a single service provider for their data, be this direct (subtask 9.1.1) or via brokering (subtask 9.1.2).

Leadership: EMBL-EBI

Task 9.2 Coordinate nascent and established national data hubs focusing on brokering services to mobilise data and define and foster common best practices

Subtask 9.2.1 Coordinate nascent and established regional/national data hubs: This task will engage national SARS-CoV-2 data hubs and support software engineering and curation resources at each centre, to establish a network of open data submission pipelines. It will bring together resource managers in charge of the regional/national SARS-CoV-2 data hubs to share expertise, tools and define common best practices for capacity building and harmonised approaches.

This will make sure that hubs share tools/code as much as possible to avoid redundancies, and also that the data standards adopted in Task 9.3 are well implemented across the national hubs. This network will also enable discussions on the open data policies undertaken in each country and support sharing of experience to encourage countries to support/require ENA data submission in addition to GISAID. The network will also discuss how the open source tools can be adaptable to genomics surveillance beyond SARS-CoV-2, in a One-health approach. The network will work to ensure that data flows between stakeholders' platforms, to build together a globally comprehensive set of viral sequence and variation.

Subtask 9.2.2 Establish dedicated user support and capacity building for viral data management and submission: This task will open up already established national tools for broad use across European member states, and provide training and capacity building for countries and institutes that are now rapidly scaling up their efforts.

Robust platforms are already in operation in many ELIXIR Nodes provide foundations for this work. For example, SIB (CH) has developed the Swiss Pathogen Surveillance Platform (SPSP.ch) and ELIXIR-BE (in collaboration with others) has developed a Galaxy-based submission system, as well as the underlying command-line tools.

Further developments to submit consensus sequences to ENA as well as to GISAID will be developed. Methods to submit variant data to EVA will be explored and integrated. Brokering functions operated by these platforms will be supported. Use-case based documentation to disseminate the submission tools and best practices will be integrated into the RDMKit. This targeted approach complements the available information in the COVID19 Data Portal.

Leadership: SIB/ELIXIR-CH

Task 9.3 SARS-CoV-2 data standards

Subtask 9.3.1 Variation calling and variant lineage nomenclature standards: In this subtask, we will establish standards and best practices for variant calling, naming, observation and citation. To achieve this, we will engage key players in viral lineage analysis and naming (e.g. Pangolin, Nextstrain, WHO) to drive harmonisation and establish authoritative naming schemes.

Subtask 9.3.2 Ontologies and controlled vocabularies for metadata: National and regional hubs are collecting metadata in various forms. These metadata are then submitted to the ENA using the ERC000033 checklist that has both compulsory and recommended fields. Many of the fields of this minimum metadata standards are currently free text. These data are however likely to be stored in a structured form at local hubs, using ontologies and controlled vocabularies. In this subtask, the need for using ontologies and controlled vocabularies will be discussed and best practices recommendations will be outlined and shared within the network established in Task 9.2.

Leadership: UKZN/South Africa

Task 9.4 SARS-CoV-2 Analysis

Subtask 9.4.1 Adoption of VEO (Versatile Emerging Disease Observatory) variant and lineage analysis tools into the portal: Supporting work in VEO WP16, we will provide compute capacity and an appropriate workflow engine, integrated in the Data Hubs, to accommodate the evolving and new analytical workflows that emerge. We will provide appropriate adaptation and configuration of these workflows and make available the global viral data content to these.

These workflows will generate processed data that will be integrated in the COVID19 Data Portal, ensuring common analyses across all data. Expected work will include the hosting of high-throughput phylogenetics tools, a reference implementation of lineage naming, tools to link variation to variants and country report generation tools

Substask 9.4.2 COVID-19 Data Portal data access and analysis: This subtask aims to enable data analysis for all researchers across Europe and globally, irrespective of the availability of local computational infrastructure. We will provide a common environment through Galaxy to enable analysis for all global stakeholders.

We will integrate existing, relevant workflows, including the workflows used within the SARS-CoV-2 Data Hubs. We will ensure relevant data is readily available in the platform, through automatic procedures integrated with the European COVID-19 Data Portal through, for example, APIs or FTP.

Synchronising the data, combined with the mobilisation of data in this WP, allows the development of monitoring services and visualisations. Relevant visualisations can be integrated in Galaxy through regular (static) visualisation tools, interactive Galaxy tools or through integration with external services e.g. viral Beacons, Nextstrain.

The workflows will be made available on public servers (e.g. https://usegalaxy.eu), as well as for local deployment (e.g. through containers). All code developed will be available under Open Source licenses.

Substask 9.4.3 Dissemination and capacity building An important aspect of Task 9.4 is training of researchers and support personnel to use the workflows and resources for data analysis and sharing. The training in data submission is covered in Task 9.2. Through hackathons, we will engage with the broader community to develop and implement workflows, data access tools and features of Galaxy. These dissemination efforts will build on the Galaxy Training Network and the online course materials that have been developed in recent months.

Leadership: ALU-FR/ELIXIR-DE

Deliverables

D9.1	Report on mobilised sequence and variation data	March 2023
D9.2	Document where code from the various regional/national data hubs is available; links to analysis tools made available to Data Hubs; training documentation on data submission by the platforms.	May 2023
D9.3	Best practices documents on data standards	July 2023

WP leaders