Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
544 Discussions

ProtST: Intel and Mila Collaborate on a New Multi-Modal Protein Language Model

Santiago_Miret
Employee
1 0 6,032

Santiago Miret is an AI research scientist at Intel Labs, where he focuses on developing artificial intelligence (AI) solutions and exploring the intersection of AI and the physical sciences.

 

Highlights:

  • Intel and Mila collaborated on ProtST, a multi-modal protein language model that enables users to create protein sequences and structures based on natural language prompts.
  • The ProtST research paper was presented in a prestigious spotlight talk at this year’s International Conference on Machine Learning (ICML 2023).
  • Using ProtST’s natural language interface, a user can simply input the description of a protein, and ProtST will generate the corresponding structures.

 

An ongoing research collaboration between Intel and the Mila - Quebec AI Institute to use advanced artificial intelligence (AI) to solve some of the world’s most critical and challenging issues has produced ProtST, a first-of-its-kind multi-modal protein language model (PLM) that enables users to create protein sequences and structures based on natural language prompts. This generative AI protein language model can help in the understanding and discovery of novel protein-based medicines, such as antibody-based drugs that directly target malignant organisms.

The ProtST research paper from Intel Labs and Mila’s Jian Tang group was presented in a prestigious spotlight talk at this year’s International Conference on Machine Learning (ICML 2023). ProtST aligns the representation of protein sequences with relevant natural language descriptions to train a multi-modal deep learning model.

While recent studies have shown that PLMs trained on protein sequences have great promise in predicting protein structures and functionality, these models cannot explicitly acquire protein functions and other important properties such as subcellular location. To reach the end goal of biomedical-text enhanced protein sequence representation learning, ProtST is trained on ProtDescribe, a paired dataset that augments protein sequences with text descriptions of protein functions and other important properties.

 

Natural Language Interface

ProtSt-1.png

 

Figure 1: ProtST designs proteins based on a textual prompt describing binding to ATP (shown on top), a major component of human muscles.

 

Figure 1 shows how a user can simply input the description of a protein, and ProtST will generate the corresponding structures. For example, by entering adenosine triphosphate (ATP) in a text prompt, ProtST will generate protein structures that bind to ATP, an energy-carrying molecule found in the cells of all living things. ATP deficiency has been linked to various body dysfunctions, including diabetes and liver malfunctions, and binding a protein to ATP could help regulate the activity of ATP in the body to mitigate those effects. In addition to introducing a fundamentally new modeling interface for researchers and practitioners in protein design, the ProtST model also provides a significant performance boost for many protein modeling tasks, such as predicting protein properties in various settings.

 

Multi-Modal Training Objectives

ProtST-2.png

Figure 2: ProtST model training method using protein sequence information (orange) and text descriptions (blue). Based on the data, ProtST is trained using a set of multi-modal training objectives.

 

The success of ProtST rests on the joint learning of protein sequence representations and natural language in a unified manner. To do so, we constructed and jointly released the ProtDescribe dataset, which contains over 500,000 high-quality protein sequence information aligned with text-based descriptions. With ProtDescribe, we can train the ProtST model using a variety of self-supervised learning techniques that make use of the data without requiring specific labels.

Shown in Figure 2, this process involves three distinct learning objectives:

  1. Masked protein modeling (MPM): This training objective encourages the ProtST model to correctly “fill-in-the-blank” of protein sequence data in the ProtDescribe dataset.
  2. Multi-modal mask prediction (MMP): This training objective encourages the ProtST model to fill-in-the-blanks of protein sequences and text descriptions at the same time.
  3. Global contrastive (GC): This training objective ensures that the ProtST model clusters the protein sequences with the correct textual descriptions. Contrastive learning has shown tremendous success in enabling image generation with textual descriptions in the computer vision domain.

In addition to the training objectives, Figure 2 shows how ProtST makes use of prior work by applying pre-trained protein language models and biomedical language models. These pre-trained models have already shown success in prior tasks and contain implicit knowledge about protein sequences and biomedical text. Rather than training from scratch, ProtST makes use of the implicit knowledge of prior deep learning models and improves upon them by leveraging the aligned data in the ProtDescribe dataset. In many ways, the fusion of the two data modalities in an effective way is what makes ProtST such as powerful tool. The results in the paper show that the multi-modal ProtST outperforms state-of-the-art protein language models across many protein modeling tasks, showing the benefit of using data with aligned text descriptions.

 

Experimental Results

 

Prot-ST-Figure 3.png

Figure 3: Benchmark results on protein locations and property prediction taken from the ProtST paper. ProtST models consistently outperform other methods across multiple tasks as shown in blue highlights (darker blue indicates better performance).

 

ProtST-based PLMs clearly outperform the vanilla PLMs. ProtST-ProtBert outperforms the vanilla ProtBert on 21 out of 24 benchmark metrics, including fix-encoder learning and full-model training. ProtST-ESM-1b surpasses the vanilla ESM-1b on 22 out of 24 benchmark metrics. In addition, ProtST-ESM-2 outperforms the vanilla ESM-2 on all 24 benchmark metrics. The results demonstrate that ProtST pre-training is generally beneficial to different PLMs, boosting their performance on diverse downstream tasks such as supervised learning and zero-shot prediction.

 

 

Prot-ST-Figure 4.png

Figure 4: Benchmark results on protein function annotations tasks taken from the ProtST paper. ProtST models consistently outperform other methods across multiple tasks as shown in blue highlights (darker blue indicates better performance).

 

ProtST-ESM-1b performs best on fitness prediction while ProtST-ESM-2 performs best on localization prediction and function annotation, which demonstrates their potential as new state-of-the-art PLMs.

While ProtST offers a new way to integrate concepts from many fields to solve a difficult scientific problem, there is still more work to do in applying AI to protein design. In addition to continuing to improve protein models’ predictive capabilities, including property prediction and protein folding, researchers at Mila and Intel Labs are looking for new ways to apply AI to completely new protein structures.

Tags (2)