Indexing and search support for single-cell cytometry data repositories

Single-cell flow and mass cytometry has become a widely used method for biological research in developmental biology, immunology, clinical oncology, and various other areas. In recent years, a multitude of algorithms for analyzing the collected samples has emerged; allowing vast simplifications and speed-ups in processing large samples and sample sets. Various research efforts (including clinical data collection and e.g. International mouse phenotyping consortium (IMPC)) have collected hundreds of thousands of annotated samples.

While it is beneficial to store the collected sample sets e.g. for further reference, possibilities of storing, accessing and searching in the stored data are usually limited (most often to searching in the manually extracted metadata). Recent advances in indexing of   multidimensional databases (CZ Node, mainly concerning cheminformatics), speed of automated cytometry sample processing (LU Node, GigaSOM.jl) and visualization and interpretation of the cytometry data (CZ Node, EmbedSOM) create a new combination of techniques that allows efficiently indexing the actual single-cell distributions, including individual populations and their properties and relations.

The main aim of the project is to implement and document the software for this kind of search. The proposed system should be able to provide quick answers to questions such as “Was this phenotype ever observed before?” and “Is there any historical metadata available from samples with similar distribution of cells in a given population?” even in very large datasets, thus exposing dataset properties that were not easily searchable before, and making more specific data easily findable. The software created in the project will provide a baseline for standardization of searchable multi-sample cytometry databases, allowing improvements in interoperability of the published datasets.

As the main outcome, the community will be able to use the new system for improving the findability of information in existing datasets, e.g. from mass phenotyping efforts (as is the case of IMPC) or large-scale standardized clinical screening. 


1 February 2020 to 31 May 2020

Nodes involved: 

People involved: 

Jiri Vondrasek
Christophe Trefois
Miroslav Kratochvíl