\section*{A.1.8.3. Residue Annotations Track}
Residue annotations label a protein's sites of functional residues with a vocabulary of 1474 multi-hot labels emitted by InterProScan. To gather this data, we run InterProScan with databases (SFLD, CDD, PIR) on all cluster members in our UniRef and Mgnify datasets (seq-id 90 clustered). We take all unique residue annotation descriptions that occur in more than $1 \mathrm{k}$ proteins across all of UniRef90 and MGnify 90 , and deduplicate labels by punctuation and case insensitivity. We join these annotations into our UniRef, MGnify, AFDB, and ESMAtlas datasets for training.
As introduced in Appendix A.1.5.1, ESM3 has a track dedicated to processing residue annotations that supports input conditioning, and an output head for generation. The residue annotation labels for a protein are tokenized into a sequence of token-sets in length equal to the protein. At each position there is an unordered set of tokens representing the residue annotations present at that position. The tokens are input to ESM3 first through an embedding lookup followed by a sum over embeddings. The permutation invariance of the sum retains that the labels are represented to an unordered set as a model. The per-position embedding sums are then added onto the per-position sequence embedding before input into the first transformer block. Positions with no residue annotations are represented by a
Please explain this in detail to a beginning grad student in machine learning.
Response: Understanding the Residue Annotations Track in ESM3
Introduction
In protein biology, understanding the function of individual amino acid residues within a protein is crucial. Some residues may be part of an active site, binding domain, or other functional regions. Annotating these residues helps researchers understand protein function and interactions.
In machine learning, especially with models like ESM3 (a transformer-based model for proteins), incorporating residue annotations can improve the model's understanding and predictions about proteins. This explanation will break down how residue annotations are collected, processed, and integrated into the ESM3 model for training.
1. What Are Residue Annotations?
Residue annotations are labels assigned to specific amino acid residues within a protein to indicate their functional roles. For example, a residue might be labeled as part of an enzyme's active site or a binding site for a particular molecule.
In the context of ESM3, we have:
2. Data Collection for Residue Annotations
To gather data for training, the following steps are taken:
a. Data Sources:
b. Annotation Tool:
c. Databases Used with InterProScan:
d. Process:
e. Integration into Datasets:
3. Incorporating Residue Annotations into ESM3
ESM3 is an advanced protein language model that uses transformers, similar to models like BERT in natural language processing.
a. Structure of the Input:
b. Tokenization and Embedding:
<pad>
token whose embedding is fixed to zero.c. Feeding into the Transformer:
4. Residue Annotations Output Head and Training
a. Output Head for Annotations:
b. Masking Strategy for Training:
To train the model to predict annotations, a masking strategy is employed:
With 90% Probability:
With 10% Probability:
c. Training Objective:
Binary Cross-Entropy Loss:
The model is trained to minimize the binary cross-entropy loss between the predicted annotations and the true annotations at each position.
This loss function is appropriate for multi-label classification problems where each position can have multiple annotations.
Reconstruction of Masked Annotations:
The model's task during training is to correctly predict any annotations that were masked.
By reconstructing these annotations, the model learns to associate sequence patterns with functional annotations.
5. Key Takeaways
Multi-Label Classification:
Residue annotation prediction is a multi-label classification task at each residue position.
Permutation Invariance:
Summing embeddings ensures that the model treats the set of annotations at each position as unordered, which is appropriate since the order doesn't matter biologically.
Input Conditioning:
The model can condition on available annotations when they are not masked, providing additional context for prediction.
Generalization:
By learning to predict annotations from sequence context, the model can generalize to unseen proteins and predict annotations for residues without prior annotations.
Conclusion
This approach integrates residue-level functional annotations into a protein language model, enabling it to learn rich representations that combine sequence and functional information. The masking strategy ensures that the model doesn't just memorize annotations but learns to predict them based on the protein sequence, which is valuable for understanding protein function and aiding in tasks like protein engineering and drug discovery.
By processing residue annotations as per-position unordered sets and training the model to reconstruct masked annotations, ESM3 effectively learns the complex relationships between amino acid sequences and their functional roles within proteins.
This explanation should provide a clear understanding of how residue annotations are incorporated into the ESM3 model, how the model processes these annotations, and how it is trained to predict them. It covers the data collection process, model architecture, input processing, output prediction, and training methodology, all of which are crucial for a graduate student beginning in machine learning to grasp the concepts involved.