E-PAN: Enhancing pan-genome analysis in plants

With the declining cost of genome sequencing, the focus of plant researchers is shifting towards characterising the wide genomic diversity present within a species. Crop pan-genomes consist of the sequencing, comparison and integration of multiple different genomes from the same agriculturally important species such as wheat, rice and potatoes. Exploiting the information encoded within these pan-genomes can lead to the development of new cultivars more resilient to upcoming challenges like increased drought and heat stress.

Multiple consortia are independently generating and integrating these pan-genomes, but there is currently little progress in streamlining and homogenising these efforts. While sequence quality is no longer a major issue, the completeness of both assembly and subsequent gene annotation are much harder to correctly quantify, while being the major drivers in explaining the adaptive differences between genotypes. Where there are efforts to visualise and browse pan-genomes, for example by using graph representations, the easy retrieval of gene Presence Absence Variation information or structural rearrangements is currently lacking, hampering knowledge learning.

E-PAN aims to streamline the efforts of different research groups within the ELIXIR Plant Science Community. This encompasses the development of effective standards, computational pipelines and tutorials to assess the quality of pan-genomes and provide solutions to identified problems. We will also evaluate and integrate different approaches for data visualisation and browsing, which will be used by different partners sharing pan-genomics results. A one-day meeting and an online workshop will be organised to disseminate results and initiate new collaborative projects. These concerted efforts will lead to a standardised approach to be used in future pan-genome projects, a reduction in duplication efforts across consortia, and a set of tools to visualise and mine pan-genomics results.

Project objectives

The adaptive differences between genotypes of the same species can only be explained by exploring genomic diversity through pan-genomics. The sequencing and assembly of pan-genomes no longer pose significant scientific challenges, but the subsequent data integration and exploration is hampered by a lack of standards and tools. For example, a simple comparison of the sequences of 44 potato genomes would not lead to any significant insights. However, the investigation of gene Presence Absence Variation (PAV) and structural rearrangements highlighted loci of interest to link genotypic with phenotypic diversity.

The global objective of the E-PAN project is to accelerate the use of pan-genome data through advances in data quality control, curation, integration, and visualization. These objectives can be placed on different axes, which can be progressed in parallel:

Methods for ensuring high-quality pan-genome data. In order to ascertain that PAVs are biological and not technical artifacts, the need for quality control (QC) and standardization is clear. Solely using BUSCO for estimating genome and gene space completeness of pan-genomes will miss nearly all intra-species differentiation. Therefore new standards and methodologies, benchmarked against real data from a broad phylogenetic species selection, are required. This will result in new guidelines and tutorials for pan-genome QC.

Computation and standardised reporting of gene-based PAVs and structural variants. Apart from implementing efficient and scalable algorithms to quantify different types of structural variants, the obtained results will be shared in a standardised manner between the different plant resources, resulting in better access to pan-genomics results for diverse plant species.

Visualisation and integration of pan-genomes and derived results. Current approaches to visualising pan-genomes do not automatically lead to easily-interpretable results using the current set of data integration tools. Therefore, we aim to evaluate and prototype a set of visualisation modules that can be (re)used in multiple plant databases. These will connect complementary resources and enhance the extraction of biological knowledge for key agricultural species.

Based on the expertise of all partners involved and the numerous interactions with different stakeholders involved in data generation, tool development or adaptation as well as biological interpretation of plant genome information, these objectives will be addressed in multiple work packages.

Project outcomes

Quality control standardisation

The E-PAN project will create standards for the Quality Control (QC) of pan-genomes, and provide reference implementations that adhere to these standards.
Additional standardisation proposals for working with pan-genomes will also be provided, such

Pan-genomics PAV characterisation and visualisation

The E-PAN project will develop and implement algorithms for the characterisation of pan-genome dynamics at the gene level. A direct follow-up to detection of PAVs consists of visualization templates for further inspection of PAVs by researchers.

Sharing pan-genome information

Development of standards for data exchange, and prototyping of said data formats, will lead to improvements in pan-genome research, as results and data can now more easily be shared and interpreted by multiple platforms.

Co-leads

Klaas Vandepoele, Sebastian Beier, Uwe Scholz, Keywan Hassani-Pak

Duration: 2024 to 2027