esm.doi.bio/esm33/out34

\section*{A.1.8.3. Residue Annotations Track}

Residue annotations label a protein's sites of functional residues with a vocabulary of 1474 multi-hot labels emitted by InterProScan. To gather this data, we run InterProScan with databases (SFLD, CDD, PIR) on all cluster members in our UniRef and Mgnify datasets (seq-id 90 clustered). We take all unique residue annotation descriptions that occur in more than $1 \mathrm{k}$ proteins across all of UniRef90 and MGnify 90 , and deduplicate labels by punctuation and case insensitivity. We join these annotations into our UniRef, MGnify, AFDB, and ESMAtlas datasets for training.

As introduced in Appendix A.1.5.1, ESM3 has a track dedicated to processing residue annotations that supports input conditioning, and an output head for generation. The residue annotation labels for a protein are tokenized into a sequence of token-sets in length equal to the protein. At each position there is an unordered set of tokens representing the residue annotations present at that position. The tokens are input to ESM3 first through an embedding lookup followed by a sum over embeddings. The permutation invariance of the sum retains that the labels are represented to an unordered set as a model. The per-position embedding sums are then added onto the per-position sequence embedding before input into the first transformer block. Positions with no residue annotations are represented by a token which has an embedding fixed to zeros. The residue annotations track has an output head which outputs a set of binary classification logits predicting for each position the presence or absence of each residue annotation in the vocabulary. We apply a masking procedure to partially/fully mask residue annotation labels, and train the output head with a binary cross-entropy loss function to reconstruct the full residue annotation. In pre-training, with $90 \%$ probability all residue annotations are masked, and otherwise we independently sample positions to mask with a square root schedule. The head is trained to predict the presence of any residue annotation label that was masked.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Residue Annotations Track in ESM3

Introduction

In protein biology, understanding the function of individual amino acid residues within a protein is crucial. Some residues may be part of an active site, binding domain, or other functional regions. Annotating these residues helps researchers understand protein function and interactions.

In machine learning, especially with models like ESM3 (a transformer-based model for proteins), incorporating residue annotations can improve the model's understanding and predictions about proteins. This explanation will break down how residue annotations are collected, processed, and integrated into the ESM3 model for training.


1. What Are Residue Annotations?

Residue annotations are labels assigned to specific amino acid residues within a protein to indicate their functional roles. For example, a residue might be labeled as part of an enzyme's active site or a binding site for a particular molecule.

In the context of ESM3, we have:


2. Data Collection for Residue Annotations

To gather data for training, the following steps are taken:

a. Data Sources:

b. Annotation Tool:

c. Databases Used with InterProScan:

d. Process:

  1. Run InterProScan on all protein sequences in the UniRef90 and MGnify90 datasets using the above databases.
  2. Collect Residue Annotations:
  1. Result: A filtered and deduplicated vocabulary of 1,474 residue annotation labels.

e. Integration into Datasets:


3. Incorporating Residue Annotations into ESM3

ESM3 is an advanced protein language model that uses transformers, similar to models like BERT in natural language processing.

a. Structure of the Input:

b. Tokenization and Embedding:

  1. Tokenization:
  1. Embedding Look-Up:
  1. Summing Embeddings:
  1. Combining with Residue Embeddings:
  1. Zero Embedding for Unannotated Positions:

c. Feeding into the Transformer:


4. Residue Annotations Output Head and Training

a. Output Head for Annotations:

b. Masking Strategy for Training:

To train the model to predict annotations, a masking strategy is employed:

  1. Purpose of Masking:
  1. Masking Procedure:
  1. Square Root Schedule Explained:

c. Training Objective:


5. Key Takeaways


Conclusion

This approach integrates residue-level functional annotations into a protein language model, enabling it to learn rich representations that combine sequence and functional information. The masking strategy ensures that the model doesn't just memorize annotations but learns to predict them based on the protein sequence, which is valuable for understanding protein function and aiding in tasks like protein engineering and drug discovery.

By processing residue annotations as per-position unordered sets and training the model to reconstruct masked annotations, ESM3 effectively learns the complex relationships between amino acid sequences and their functional roles within proteins.


This explanation should provide a clear understanding of how residue annotations are incorporated into the ESM3 model, how the model processes these annotations, and how it is trained to predict them. It covers the data collection process, model architecture, input processing, output prediction, and training methodology, all of which are crucial for a graduate student beginning in machine learning to grasp the concepts involved.