ELIXIR Compute Platform (2024-26)

The ELIXIR Compute Platform ensures that European cloud, compute, storage and access services fulfil the requirements and are available for the life-science research community. European regulation on health data processing is evolving, computing capacity in national clouds is increasing and international standards are improving interoperability between the compute environments.

Such capabilities are also increasingly important to service the growing volumes of biodiversity, food security and pathogen data which comprise complex data types and must be linked to climate data resources. Also, supercomputer investments - especially EuroHPC - provide completely new kinds of computing capacities for researchers across these domains.

All this shapes the landscape of European compute resources. Especially, computing on sensitive data near the data resources is becoming obligatory but the same is true for non-sensitive data e.g. plant phenotyping data and associated imaging data.

In 2024-26, the Platform will build technological capability to enable compute in new European-wide federated and multi-cloud settings by building on existing development and utilising previous work, especially Life Science Login. Beyond this, the platform actively seeks sustainable ways to operate the developed services for European researchers.

The Platform will deliver the services to support federated data management and analytics in life science through five complementary WPs:

Lead partners: CSC (Finland), University of Masaryk (Czech Republic), Berlin Institute of Health at Charité (Germany)

The purpose of this work package is to provide the overall project management and coordination of the activities of ELIXIR Compute Platform. This work package ensures that Platform work progresses as planned.

The work package will also ensure that the platform work plan is aligned with ELIXIR’s strategic vision and key EC funded projects where ELIXIR leads or participates including GDI, ENTRUST, STEERS and EVERSE which are relevant for the overall development of European research infrastructures, data spaces and regulation.

The work package facilitates active dialog with life-science research communities in order to understand the real needs of researchers. This is done especially by collaborating with ELIXIR Communities that represent researchers’ use cases.

Besides technology development, delivering impact on research requires that the technology can be sustained and provided for researchers. This work package seeks ways how the developed technology can be sustained as part of operational services. This can be done both by integrating technology into existing services and by establishing new services when applicable.

The landscape of European research infrastructures, data spaces and related initiatives, such as EOSC, EuroHPC and GAIA-X, are evolving and the work package is actively participating in this development.

Establishing services in cross-infrastructure fashion and providing them in wider collaboration is seen as a favourable option, especially partnering with other research infrastructures and providing services as part of EOSC service offering supports the objective of long term service sustainability. Also, the work package ensures that the platform activities are well connected and communicated to the research community.

The work package will also bridge the platform activities to GA4GH and participate in the steering of GA4GH by utilising the ELIXIR and GA4GH strategic partnership.

Activity 1: Project management and coordination

The ExCo will set up and facilitate regular Platform-wide meetings, both online and face-to-face. It will also establish proper communication channels for the work and monitor and report the progress regularly.

Activity 2: Sustainability and dissemination

The ExCo will prioritize sustainability planning for platform development activities to ensure long-term viability. They will establish effective bi-directional communication channels to engage research communities, leveraging platforms like ELIXIR Communities, the ELIXIR Training Platform, and existing channels such as ELIXIR webinars.

Additionally, the ExCo will promote knowledge sharing with other ELIXIR platforms and active participation in joint-platform activities to maximize collaborative efforts. It will also serve as the platform's representative in the strategic partnership between ELIXIR and GA4GH. Furthermore, the ExCo will oversee dissemination initiatives aimed at promoting the Compute Platform's advancements and benefits to relevant stakeholders.

Lead partners: CSC (Finland), University of Masaryk (Czech Republic)

An increasing number of services are moving towards working with sensitive data. Such services need to perform authentication of the person working with data and enforce authorization rules with a high level of assurance.

ELIXIR has built and operated the Authentication and Authorization Infrastructure (AAI) service for such tasks, which formed the basis for the joint effort to create a common Life Science AAI (LS AAI) in the EOSC-Life project. The idea of a common AAI service is the ability to provide unified authentication mechanisms, a central place to execute authorization decisions and provide the ability for the relying services to outsource such tasks on the operated infrastructure.

For authentication purposes, ELIXIR AAI and LS AAI are relying on the eduGAIN interfederation to enable researchers access to the services using their home organisations. The service is now a mature production service with over 13,000 users who login per month and increasing adoption by other Research Infrastrures in the EOSC-Life Science Cluster including BBMRI-ERIC and Instruct-ERIC (https://doi.org/10.5281/zenodo.8144216).

In the next phase, the authentication needs to move towards improving the assurance of the authentication process by adopting various mechanisms providing additional security for user accounts. As for the authorization part, ELIXIR has been piloting the GA4GH Passports and Visa standard, as well as defining its own mechanisms for enabling distributed authorization. In the following development, distributed authorization will focus more on effectively defining authorization rules in different contexts and places, efficiently collecting them and applying them in a broad spectrum of places.

This will boost real world capabilities for LS-AAI usage in critical projects such as GDI, where it is to be implemented as the first line of user access and control for accessing and processing sensitive genomic data across participating 1M+G European member states. New solutions will be tested in an even broader cross domain context across the science clusters through the EOSC AAI common framework. Last but not least, identity governance will be another point of interest for this task.

Activity 1: Advanced authentication mechanisms

The goal of this work package is to explore new areas of authentication mechanisms by extending the basis built in the previous work packages. This WP will concentrate on exploring new approaches to working with digital identity like Self-Sovereign Identity, e-wallets, and similar technologies and rising alternatives to the traditional Identity Federations approach, while it will also try to integrate these solutions as alternatives to the authentication mechanisms in place.

The federated approach will be iterated on to reach out to entities not participating in the eduGAIN federation, like commercial entities from the pharmaceutical industry, who are not able to join such a federation. Thus, a focus will be building standardised solutions and procedures for integrating with such entities. The aim is also to extend the security aspects of the authentication process by exploring Multi-Factor Authentication (MFA) in the widely distributed EOSC AAI environment.

This WP will additionally try to explore the possibility of providing passwordless authentication as an accompanying authentication mechanism in a federated login environment. By shifting towards this authentication mechanism we expect to achieve a more streamlined authentication process and expect the federative authentication to move towards the support role in the area of identity vetting and backup authentication solutions.

From the perspective of managing the identity for authentication purposes, we want to investigate the ability to correctly de-provision identity data, especially in distributed environments and improve the identity governance aspects by defining strict lifecycle policies of the identity and relevant data which are conformant with auditing and accounting needs.

Activity 2: Authorization in distributed infrastructures

The traditional role of the Authentication and Authorization Infrastructures lies in basic authorization execution and mostly in providing authorization data to the relying parties, which can then perform local authorization based on this data. In the years of operating ELIXIR AAI and participation in Life Science AAI, we have learned about a lot of use cases shared across various relying parties.

This work package should focus on extracting as many repetitive authorization scenarios as possible and integrating them directly into the AAI environment, to remove the burden of designing and implementing such authorization scenarios from the relying parties, enabling them to focus more on the provided functionality.

Iterating on this approach, the goal is to provide Relying Parties, organisations, or managers within the AAI with a framework, where they can define authorization rules in the AAI itself and AAI will evaluate the conditions defined at the time of performing other authorization tasks, resulting in a simple AUTHORISED/UNAUTHORISED information provided to the service with a possible reasoning why the decision has been carried out.

Such an approach could be even escalated into environments similar to EOSC AAI to explore these possibilities in a multi-federated and multi-infrastructure environment with an emphasis on efficient definition, sharing, execution and communication of these rules and their results.

We are also expecting continuous work on the existing standards like entitlements or GA4GH Passports and Visas with the possibility of extending their applicability into areas outside of the genomics data, and extending authorization into finer granularity level, file in a data set or sections of the fileLast but not least, with the increased emphasis on compliance with policies like GDPR and the need to manage approvals of policies like AUP, Privacy Notice, and Terms of Use, this work package will explore possibilities to record such things and re-use these records as another authorization data inputs.

Activity 3: User Experience, community engagement, training and outreach

One of the main two ideas of Authentication and Authorization Infrastructures is to provide users accessing integrated services via a unified familiar process and to enable performing authorization for the whole infrastructure in a single place.

Looking at the regular user process, the service access workflow might get fairly complicated, especially in the federated model, where users use their Home Organizations as authentication providers or can fall back to social logins or other authentication means if they cannot or do not want to use their institute, university, or similar Identity Provider.

The situation might get even more complicated when users hop between services or use multiple external accounts to access them. To mitigate the risks of leaving the user confused, this task will focus on continuous improvements of the AAI user-facing interfaces to improve the user experience and achieve streamlining of the user workflows. These activities will involve developing methodologies for KPIs collection and user evaluation, user testing, as well as applying the user-centric design from the start.

We expect to cooperate with Tasks 1 and 2 of WP1 as well as with WPs 3-5, especially on their deliverables that overlap with authentication and authorization. We envision this cooperation mainly in providing feedback and inputs into the design aspects of the activities executed in different tasks.

To support the goal of providing the best possible user experience, this task will explore the areas of evaluating Authentication and Authorization Infrastructure as a service not just from the end-user perspective, but also from the point of view of people representing management roles utilising AAI primarily for the second A - authorization. We would like to approach this in an Agile approach by applying Continuous Discovery meaning periodical small-incremental improvements proposed to the most critical parts of the infrastructure.

From the integration perspective, we would like to explore the activity of building an AAI-centric SDK. We envision not to directly develop a set of tools, but rather to pick and evaluate existing Software components and provide in-depth guidance on their usage with the AAI

To spread the utilisation of the Authentication and Authorization Infrastructure, this Task will also focus on building the community. The primary focus will be on organising events based primarily on the inputs from Relying Party integrators, but not omitting end-users and Identity providers to discuss the AAI-relevant topics, or just discuss various integrations or share recent developments in the AAI area. Last but not least, this Task will also focus on the outreach and education spread towards AAI-related customers.

Lead partners: CSC (Finland), University of Masaryk (Czech Republic), Berlin Institute of Health at Charité (Germany)

Sensitive data is a critical asset for the ability to carry out essential research in even potentially critical situations, e.g. as proven during the COVID-19 pandemic. Providing easy access to such data speeds up the research process, resulting in faster development of new disease cures, or exploring and further understanding of rare diseases.

However, by nature, input data for such research tasks contain sensitive information and thus needs to be correctly handled and protected. In the previous ELIXIR programme, the Compute Platform has been working in this area already, and we would like to continue this effort.

We expect to deliver to researchers a framework of products for secure access to sensitive data for use in a wide range of research settings. Our aim is to build a complex solution covering the process from the start (data discovery) through the analysis/processing up to the very end (research publication finalisation). It involves building and evolving a framework of tools for the secure discovery of sensitive data, enabling research groups to then implement and adapt to their needs using standard components in the framework.

Moving further in the process, tools for possible access negotiation, secure transfer of the data to a computing environment, executing the computing workflow, and last but not least, delivery or interpretation of the result in a secure way must be considered. The work will be supported by new external funding in the Horion Europe INFRA-EOSC and INFRA-DEV Programmes as well as the ongoing development in the GDI project.

The whole process needs to be accompanied by also focusing on Authentication, Authorization, Auditing, and optional accounting, and aiming to be in coherence with the regulations in a transnational, federated environment. Hereby we will closely align the activities of WP3 with WP2, WP4 and WP5. We will identify, build and gather technical solutions on top of the outcomes of previously carried out work in the ELIXIR programme, such as leveraging on and further development of the AAI and cloud infrastructures, sensitive data management platforms (REMS), federated EGA repositories, and similar.

In order to understand how such environments can be utilised by researchers the aim is to provide guidance documentation on both the technical and organisational measures to secure the data as well as outreach to researchers through workshops in order to get feedback on how such services can be utilised.

Activity 1: Sensitive data access as a process

Researchers may require access to sensitive data in different scenarios, which we here describe as use cases. The aim of this task is to collect and document potential use cases for accessing sensitive data in detail. These use cases will be refined, revised and extended over the course of the project. Currently, the different use cases include:

Direct access to data in a repository: Here, data is archived in a single database. Either directly by the user or indirectly via a platform, the user requests access to the data. The user is then authorised to use a specific set of data. To do this, the user logs on to a database or platform and transfers the data in encrypted form to their own secure environment. In the secure environment, the user can decrypt the data and analyse it further. For example, a Trusted Research Environment (TRE) or can be used here by the user.
Federated database: The data of interest is stored in a federated form on different data nodes. The data nodes provide metadata via APIs, which can be browsed in a central portal. Through the central portal, a user can request access to selected data at the respective data nodes. The nodes then grant the user access to individual datasets. The data is then collected from the nodes, combined and being made available to the user either for data transfer or for processing within a TRE.
Centralised platform: The platform contains centralised sensitive data. Users can request access to a subset of the data. If authorised, users can then log into the platform and work directly with the data. This is done, for example, by running workflows or interactive analysis using scripts. The platform provides users with a secure VRE for this purpose.
Federated processing: Here, sensitive data resides on different nodes that are not directly connected and the transfer of data to other environments is restricted. The nodes provide APIs through which analysis can be performed on the data, which together forms a secure processing environment (SPE). The user can perform an analysis that combines data from different sources (nodes) by executing an analysis on a platform, which then distributes individual tasks to different nodes based on where the data is available. The results are then combined and being made available to a user. A special case of this is federated learning, where models are trained in several iterations and updated on different data sets. Application of technologies such as Confidential Containers (https://github.com/confidential-containers) will also be in the scope of this use-case.

Activity 2: Sensitive Processing Environment Services

In WP3.2 we will support the technical implementation of secure platforms by, among other things, linking the use cases from WP3.1 with the framework components provided by WP2 and WP4. On the basis of a catalogue of services, we will enable the development of secure environments that meet the requirements of at least one of the use cases defined in WP3.1.

These environments will combine authentication and authorisation features from WP2 with the solutions and services from WP4. For demonstration purposes, the platform will be implemented, tested and documented in cloud environments. Experiences, gaps and issues from the use case implementations will be fed back to WP2 and WP4.

Activity 3: Guidance and outreach

General direction of the subpackage WP3.3 is towards landscaping and availability of documentation/policy, alignment and mutual exchange on standards, guidelines, SOPs and in practice handling of sensitive data access.

These Sensitive Data Processing environments will be implemented, tested and documented as part of WP 3.2, yet the providers of such environments need to be able to reach out with researchers how such environments are run and under what conditions they can be made use of. For these reasons WP 3.3 will seek to bring providers of SPE as well as consult with Legal and Ethics Officers in order to list:

User policies for SPEs on working with sensitive data (out of scope is compliance with legislation and frameworks)
Service policies
Infrastructure policies
This will result in a guidance document on technical and organisational security measures for protection and working of sensitive data.

At the same time during this work package we will investigate through workshops:

Involvement of patient organisations / other stakeholder organisations / representatives as guests / participants of the workshop(s)
Connection to national medical informatics infrastructures (e.g. NUM & MII in Germany)
EU Cross border data sharing or International (US, Asia) working with /sharing sensitive data

Lead Partners: CSC (Finland), University of Masaryk (Czech Republic), Berlin Institute of Health at Charité (Germany)

With use cases such as outscaling peak performance needs and bringing compute to where the data resides, rather than fetching the data first and computing on it on premise, data analytics in the Life Sciences is increasingly shifting towards the cloud. Previously, ELIXIR Compute addressed these developments by exploring hybrid cloud and containerization techniques and developing services based on widely adopted community standards, including Cloud APIs defined by the Global Alliance for Genomics and Health (GA4GH), for which ELIXIR Compute is leading a Driver Project.

In the next phase, the various previous efforts will be bundled, hardened and - with the help of ELIXIR-internal Driver Projects including GDI - further matured. In particular, this task will coordinate the development of an ELIXIR-wide federated hybrid- and multicloud-ready set of packages consisting of various GA4GH-compatible backend microservices deployed across the ELIXIR Nodes, and a micro-frontend-based web portal to enable end users to operationalise the platform.

Through it, users will be able to discover, fetch and execute a range of workflow types (e.g., CWL, Nextflow, Snakemake, Galaxy) on HPC clusters, native cloud clusters and commercial clouds, accessing data on commonly used storage solutions (e.g., s3). Access control management, the handling of sensitive data and compute-related provenance will be addressed together with WPs 2, 3 and 5, respectively.

In addition to the ELIXIR::GA4GH cloud platform, the task will further coordinate the bundling of backend and frontend components into a "ELIXIR Cloud SDK", which will allow (1) systems administrators to easily deploy on premise, hybrid and multicloud solutions, and (2) service developers to quickly adopt the necessary APIs and guidelines to interoperate with the ELIXIR cloud platform.

Finally, the WP will coordinate the engagement of use cases, Nodes and service developers, provide documentation, training and support mechanisms, and liaise and align with developments in relevant organisations beyond ELIXIR, such as GA4GH and EOSC.

Activity 1: Harden and expand ELIXIR Cloud SDK use case portfolio

Acivity will harden existing service components by improving resilience, scalability, and service security and by integrating work provided by other WPs with respect to access control, data security, and provenance/accounting/reproducibility (e.g,. via RO-Crates).

It will expand existing use case portfolio with regard to compute federation, e.g.,by providing support for different workflow engines/languages, compute backends and storage solutions, including integration with commercial/public cloud providers (“hybrid cloud”) and specialised workload distribution logics.

Finally, it will support use cases requiring service-to-service communication (e.g., distributed computing, federated learning, federated imputation), as well as additional use cases as requested by user base.

Activity 2: Harden and expand ELIXIR Cloud SDK use case portfolio

Activity will deploy centralised and federated ELIXIR Cloud SDK service components across different ELIXIR Nodes, deploy and maintain service registry and system to automatically upgrade service instances in the network. It will further develop and deploy web portal and CLI entry points that operationalise the supported ELIXIR Cloud use cases to end users.

Activity 3: Policies, documentation and outreach

Activity will promote ELIXIR Cloud & ELIXIR Cloud SDK within ELIXIR and global community, including industry partners and provide extensive documentation for various audiences (end users, devs and admins), including tutorials and interactive examples.

Lead partners: Forschungszentrum Jülich (Germany), Athena Research Center (Greece), VIB (Belgium)

To address the increasing complexity of life science analytics, bioinformaticians need access to advanced and large scale compute capacities, often in an international context. They need to be able to do so using sustainable compute resources which follow appropriate regulatory compliance, accounting, and provenance.

While aspects of this have and are being addressed by the e-infrastructures in Europe, life scientists have additional demands for computing sensitive data which only increases the importance of appropriate audit trails which will fit with the relevant information governance and security requirements.

In this Task ELIXIR will identify best practices in our communities, Nodes and projects in the reporting of resources (facilities, services) used to lay the groundwork for billing or other financial accounting of services (within or between institutions, and eventually across borders). The Platform will identify automated, lightweight solutions which are FAIR in practice in order to track provenance. In turn this will improve the reproducibility of scientific analysis and the recognition of relevant service providers as well as researchers at the individual and organisational level.

The Compute Platform is directly involved in relevant technical developments in secure and distributed computing in the Nodes and can help address the integration of non automated systems such as electronic notebooks.

Activity 1: Dashboard of ELIXIR Computing Providers

This activity will create a centralised place to publish data around what compute resources are available across ELIXIR; along with compute resource reporting summaries. Output will be in both machine-readable and human-friendly forms (i.e. visualisations that can be understood by non-specialists). It will be a living (continuously updated) successor to the one-off snapshots from EOSC-Life WP7 “Cloud Observatory Report”, ELIXIR EXCELERATE - D4.3, CONVERGE. It will link to WP3.1 ELIXIR Cloud Service Registry, based on GA4GH Service Registry API. Implementation under development, currently deployed at the CH Node. There is also a matching GUI client under development, implemented as a reusable Web Component. Both could be used to power (part of) a dashboard. May also include a “living map” of ELIXIR compute.

Activity 2: Provenance consumption and production by infrastructure

It is important for platforms executing computational research experiments to utilise RO-Crates aiming for reproducibility and reusability. Using this packaging format can allow bundling together all the components and dependencies of a research experiment, ensuring that it can be reproduced,re-executed, and reused in the future.

More specifically, reusability can be enhanced with the use of WorkflowRun profile RO-Crates that capture and encapsulate the entire research experiment, including data, code, parameters, and workflows. This enables that the experiment can be accurately reused, allowing other researchers to validate and build upon the results. Moreover, by packaging all the necessary components, including data sources, software versions, and configurations, RO-Crates preserve the context in which the experiment was conducted. This enhances transparency and enables others to understand the experimental setup and potentially identify any issues or improvements.

Additionally the use of RO-Crates can facilitate collaboration and sharing of research experiments. They provide a standardised format that can be easily shared among researchers, enabling them to reproduce, validate, and build upon each other's work. This promotes open science practices and fosters scientific advancements.

Use cases will come from MGnify; WP5.1; and the ELIXIR Communities.

Activity 3: Reporting of resource usage (accounting)

Systems that provide compute resources to researchers need to implement mechanisms for tracking the resource usage of each researcher and enforcing limits to that usage. In addition, the usage data kept (including consumed resources, user numbers, geolocation, or carbon footprint) can be a valuable resource to steer user behaviour towards greening, and needs to be reported to funders and inform long-term policy-making. At the same time, they can also be useful when they are shared with the research community at large and the society for transparency.

This work package will investigate the needs of the ELIXIR communities, Nodes, and projects regarding resource tracking and accounting/reporting and will explore existing mechanisms that have been implemented in various ELIXIR Nodes. Moreover, the work package will collect (where possible) feedback on the related challenges and expectations of the relevant stakeholders. Finally, the WP will develop a reference framework that can provide a standardised method for monitoring and reporting resource usage in research compute systems.