Theme: Federated Data Analysis
Federated analysis (FA) is transforming genomics research by enabling collaborative computation across distributed datasets, all while preserving data privacy. It supports comprehensive insight generation without centralising sensitive data – a crucial advancement in genomic medicine. Federated access and analysis of human datasets is a key component of the ELIXIR Scientific Programme. ELIXIR is actively involved in several major initiatives, including the European Cancer Imaging Initiative (EUCAIM) and coordination of the European Genomic Data Infrastructure (GDI) project, which aims to facilitate federated access to over one million whole genome sequences (WGS). While GDI is exploring federated solutions for data analysis, it does not plan to deploy full FA systems for evaluation at this stage.
In genomics, many well-established algorithms assume centralised data environments and are not directly compatible with federated architectures. From an infrastructure perspective, FA introduces technical and organisational challenges. These include deploying and maintaining new software stacks, ensuring data interoperability across sites, and securely isolating compute environments capable of executing remote algorithms.
Project objectives and implementation
This project will implement federated analysis across four ELIXIR Nodes, using both synthetic and publicly accessible Genome-Wide Association Studies (GWAS) datasets. It will leverage the EUCAIM orchestration framework, Flower, to conduct real federated computations on synthetic genomic data. Through this work, we aim to:
- Develop technical capacity at each Node for managing FA infrastructure (e.g. server configuration, security, software stack maintenance)
- Investigate data skewness and its effects on result reliability and generalisability
- Compare federated and centralised analysis outputs to assess statistical power, consistency and potential bias
- Identify practical deployment barriers that only surface through real-world testing
Improving provenance and compliance
Beyond infrastructure, the project contributes to improving provenance tracking in FA settings. We will adapt the RO-Crate standard to generate provenance packages at each participating site. These will capture both machine- and human-readable metadata about local data, tools, and processes. RO-Crates will then be aggregated, either automatically or via secure manual upload, into a central provenance dashboard. This ensures traceability and compliance with regulatory frameworks such as GDPR.
The RO-Crate packages will reflect the Five Safes framework (safe data, projects, people, settings, outputs) within federated settings that may include varied infrastructure configurations. This approach also supports the FAIR principles by documenting each computational step in a standardised, shareable format. These efforts lay essential groundwork for building scalable, trustworthy federated infrastructures across national and international networks.
Broader context and collaborations
This project builds on existing collaborations to deploy and test FA solutions across sensitive data projects including GDI, EUCAIM, BY-COVID and TRE-FX. It also aligns with the BRIDGE staff exchange between ELIXIR and the NIH’s Division of Cancer Epidemiology and Genetics (DCEG), where the Yjs framework has been selected for collaborative work.
We aim to reinforce this partnership by using Flower to evaluate and compare FA frameworks and by sharing relevant datasets. All participating Nodes are active members of the ELIXIR Human Data Communities, particularly the Federated Human Data and Cancer Data communities. Results and insights from this project will be shared broadly across these communities and other ELIXIR initiatives where federated analysis is relevant.
Co-leads
Dilza Campos, Carles Hernandez