3D-BioInfo: Protein Language Models, Design and Disorder

Tue 7 May 2024, 17:00
CEST
This webinar is part of a series run by the ELIXIR 3D-BioInfo Community. There is a complete list of webinars here.  

The event is hosted by

Prof. Shoshana Wodak (Chair) 

VIB-VUB Center for structural Biology 

Dr. Gonzalo Parra
 

Barcelona Supercomputing Center  (BSC)  

Dr Neeladri Sen   
 

University College London (UCL)       

Programme: 

 

ProstT5: Bilingual Language Model for Protein Sequence and Structure

Dr. Michael Heizinger (Technical University Munich, Germany)

Adapting large language models (LLMs) to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction.
Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences.

Here, we leverage pLMs to simultaneously model both modalities by combining 1D sequences with 3D structure in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. This new foundation pLM extracts the features and patterns of the resulting “structure-sequence” representation. Toward this end, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences.

As a proof-of-concept for our novel approach, dubbed Protein structure-sequence T5 (ProstT5), we showed improved performance for subsequent prediction tasks, and for “inverse folding”, namely the generation of novel protein sequences adopting a given structural scaffold (“fold”). Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions, and opens new research avenues in the post-AlphaFold2 era. Our model is freely available for all.

Protein Embeddings Predict Binding Residues in Disordered Regions’

Celine Marquet
(PhD Student, Technical University Munich, Germany)

The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Today’s methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs).

Here, we presented a novel machine learning (ML) model trained to predict binding regions specifically in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2±3.6% (95% confidence interval). This was numerically slightly higher than the performance of the state-of-the-art (SOTA) methods ANCHOR2 (52.4±2.7%) and DeepDISOBind (56.9±5.6%) that rely on expert-crafted features and/or evolutionary information from multiple sequence alignments (MSAs).

IDBindT5’s SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available here

You can find previous webinars from the 3D-BioInfo Community on the Community webinars page.

 

Speakers

  • Dr. Michael Heizinger (Technical University Munich, Germany)
  • Celine Marquet (Technical University Munich, Germany)

Contacts

Event administration: Dr. Neeladri Sen (n.sen@ucl.ac.uk)

See also: 3D-BioInfo Community