Scalable extraction of human genetic and phenotypic data from peer-reviewed literature (2022-23)

Health research is advanced through a deeper understanding of disease aetiology provided by detecting associations between genetic variants and disease traits in population samples. The GWAS Central and DisGeNET data services provide extensive gene/variant-phenotype/disease associations. However, an absence of tools and resources to support the text mining of comprehensive data sources prevents scalable import of association data, which is currently limited to text mining abstracts or requires manual curation.

This project will extend and integrate the participants’ existing text mining tools to provide a reusable workflow to extract human genotype-phenotype associations from scientific literature full-texts, tables and supplementary materials. These data will be imported into GWAS Central and DisGeNET, accelerating FAIR access to pioneering findings such as COVID-19 GWAS. The development of an annotated GWAS corpus based on full-text articles will enable the evaluation of existing and future text mining methodologies for extracting genotype-phenotype associations and metadata.