out33

\section*{A.1.8.3. Residue Annotations Track}

Residue annotations label a protein's sites of functional residues with a vocabulary of 1474 multi-hot labels emitted by InterProScan. To gather this data, we run InterProScan with databases (SFLD, CDD, PIR) on all cluster members in our UniRef and Mgnify datasets (seq-id 90 clustered). We take all unique residue annotation descriptions that occur in more than $1 \mathrm{k}$ proteins across all of UniRef90 and MGnify 90 , and deduplicate labels by punctuation and case insensitivity. We join these annotations into our UniRef, MGnify, AFDB, and ESMAtlas datasets for training.

As introduced in Appendix A.1.5.1, ESM3 has a track dedicated to processing residue annotations that supports input conditioning, and an output head for generation. The residue annotation labels for a protein are tokenized into a sequence of token-sets in length equal to the protein. At each position there is an unordered set of tokens representing the residue annotations present at that position. The tokens are input to ESM3 first through an embedding lookup followed by a sum over embeddings. The permutation invariance of the sum retains that the labels are represented to an unordered set as a model. The per-position embedding sums are then added onto the per-position sequence embedding before input into the first transformer block. Positions with no residue annotations are represented by a token which has an embedding fixed to zeros. The residue annotations track has an output head which outputs a set of binary classification logits predicting for each position the presence or absence of each residue annotation in the vocabulary. We apply a masking procedure to partially/fully mask residue annotation labels, and train the output head with a binary cross-entropy loss function to reconstruct the full residue annotation. In pre-training, with $90 \%$ probability all residue annotations are masked, and otherwise we independently sample positions to mask with a square root schedule. The head is trained to predict the presence of any residue annotation label that was masked.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Residue Annotations Track in ESM3

Introduction for a Beginning Grad Student in Machine Learning:

In modern computational biology, understanding protein functions at the residue level is crucial. Proteins are made up of amino acids, and specific residues (amino acid positions) within a protein can have important functional roles, like binding to other molecules or catalyzing reactions. Machine learning models can help predict these functional sites, and one such model is ESM3.

This explanation will walk you through how the ESM3 model incorporates residue annotations into its architecture to predict functional sites in proteins. We'll cover data collection, preprocessing, model input and output, and the training procedure.

1. Residue Annotations: What Are They?

Proteins and Residues:
Proteins are sequences of amino acids.
Each amino acid in the sequence is referred to as a residue.
Functionally Important Residues:
Some residues play crucial roles in the protein's function (e.g., active sites, binding sites).
Identifying these residues helps in understanding protein function and drug design.
Residue Annotations:
Labels assigned to specific residues indicating their functional roles.
They can include annotations like "catalytic site," "binding site," etc.
These annotations are multi-label—a single residue can have multiple annotations.

2. Collecting Residue Annotation Data

InterProScan:
A bioinformatics tool that scans protein sequences to predict functional regions and domains.
It integrates multiple databases to provide comprehensive annotations.
Databases Used:
SFLD (Structure-Function Linkage Database): Focuses on enzymes and their functions.
CDD (Conserved Domain Database): Contains conserved domain alignments.
PIR (Protein Information Resource): Provides protein sequences and annotations.
Datasets:
UniRef90: A database of clustered protein sequences with at least 90% sequence identity.
MGnify90: Similar clustering for metagenomic sequences.
Data Collection Steps:

Running InterProScan:
- Run InterProScan on all protein sequences in UniRef90 and MGnify90.
- Obtain residue-level annotations for each protein.
Filtering Annotations:
- Select residue annotations that occur in more than 1,000 proteins.
- This ensures the annotations are common enough to provide meaningful training data.
Deduplication:
- Remove duplicate labels caused by variations in punctuation or case sensitivity.
- For example, "Active Site" and "active site" would be considered the same.
Building the Vocabulary:
- The final set includes 1,474 unique residue annotations.
- This serves as the vocabulary for residue annotations.

Datasets for Training:
The annotations are integrated into:
- UniRef
- MGnify
- AFDB (AlphaFold Protein Structure Database)
- ESMAtlas
These datasets now include both protein sequences and residue annotations.

3. Incorporating Annotations into the ESM3 Model

ESM3 Model Overview:
ESM3 is a transformer-based language model designed for protein sequences.
Similar to models like BERT in natural language processing.
It uses self-attention mechanisms to capture relationships in the data.
Residue Annotations Track:
A dedicated pathway within ESM3 for processing residue annotations.
Supports both input conditioning (incorporating annotations into inputs) and output generation (predicting annotations).
Input Representation:

Tokenization:
- Protein Sequence: Each amino acid is a token in the sequence.
- Residue Annotations: At each residue (position), there is a set of annotation tokens.
- This forms a sequence where each position has:
- The amino acid token.
- A set of annotation tokens (could be empty).
Embedding Lookup:
- Annotation Embeddings:
- Each annotation token has an embedding vector.
- All annotations at a position are embedded separately.
- Summing Embeddings:
- At each position, sum the embeddings of all annotation tokens.
- Permutation Invariance:
  - The sum operation is order-independent.
  - Ensures the model treats the set of annotations as unordered, which is appropriate since the order doesn't matter.
- Protein Embeddings:
- The amino acid tokens are also embedded.
Combining Embeddings:
- Add the summed annotation embeddings to the amino acid embeddings at each position.
- This combined embedding represents both sequence and annotation information.
Handling Missing Annotations:
- Positions with no annotations use a special <pad> token.
- The embedding for <pad> is a vector of zeros.
- Ensures that the absence of annotations doesn't introduce noise.

Model Input:
The combined embeddings are fed into the first transformer block of ESM3.
The model processes this enriched sequence to capture complex patterns.

4. Model Output and Training Procedure

Output Head:
The residue annotations track has a specialized output layer.
Per-Position Predictions:
- For each position in the sequence, the model predicts the presence or absence of each annotation in the vocabulary.
Binary Classification Logits:
- Outputs logits (before activation function) representing the likelihood of each annotation being present.
- Since multiple annotations can be present, it's a multi-label classification problem.
Training Objective:
Binary Cross-Entropy Loss:
- Used for each annotation at each position.
- Measures the difference between the predicted probabilities and the actual annotations.
The goal is to minimize this loss, improving the model's annotation predictions.
Masking Procedure:

Purpose of Masking:
- Prevent Overfitting:
- Ensures the model doesn't just memorize the annotations.
- Encourage Generalization:
- Forces the model to learn relationships between the sequence and annotations.
- Similar to Masked Language Modeling:
- Like BERT masks words to predict them, we mask annotations.
Full Masking (90% Probability):
- With 90% chance, all annotations are masked for a given sequence.
- The model receives only the amino acid sequence and tries to predict the annotations.
- Encourages the model to learn from sequence patterns alone.
Partial Masking (10% Probability):
- With 10% chance, only some annotations are masked.
- Positions to mask are selected independently.
- Square Root Schedule:
- The probability of masking a position follows a square root function.
- Positions are more likely to remain unmasked than in a linear schedule, but exact details depend on implementation.
Training Task:
- The model is trained to reconstruct the masked annotations.
- It uses context from the sequence and any unmasked annotations.
- This way, it learns to associate sequence features with functional annotations.

5. Key Concepts and Takeaways

Multi-Label Classification:
Unlike multi-class classification (where each input belongs to one class), multi-label allows for multiple classes (annotations) per input (residue).
Requires loss functions and evaluation metrics that account for multiple simultaneous labels.
Permutation Invariance:
The sum of embeddings is not affected by the order of annotations.
Important because the set of annotations at a residue is unordered.
Embedding Strategy:
Combining annotation embeddings with sequence embeddings enriches the input.
Helps the model learn nuanced relationships between sequence patterns and functional sites.
Masking Strategies:
Similar to dropout, masking helps prevent over-reliance on certain inputs.
Forces the model to learn from context and generalize better.

6. Why Is This Important?

Understanding Protein Function:
Predicting residue annotations helps in identifying important functional sites.
Useful for drug targeting, understanding diseases, and bioengineering.
Advancing Machine Learning in Biology:
Demonstrates how advanced NLP techniques (like transformers) can be adapted to biological data.
Encourages interdisciplinary approaches combining ML and bioinformatics.

7. Further Considerations

Model Evaluation:
Assessing performance in multi-label classification can be complex.
Metrics like accuracy, precision, recall, and F1-score need to be adapted.
Data Imbalance:
Some annotations may be much rarer than others.
Techniques like weighting loss functions or resampling may be necessary.
Biological Relevance:
Understanding the biological significance of predictions is crucial.
Collaborating with domain experts enhances model applicability.

Conclusion

By integrating residue annotations into its architecture, the ESM3 model leverages both sequence information and functional annotations to improve the prediction of functional sites in proteins. This approach combines advanced machine learning techniques with biological domain knowledge, opening avenues for more accurate and insightful models in computational biology.

As a Beginning Grad Student, You Should Now:

Understand how residue annotations are collected and processed for machine learning models.
Recognize the challenges of multi-label classification and permutation invariance.
Appreciate how masking strategies in training help models generalize.
See the connection between language models in NLP and sequence models in bioinformatics.

Suggested Next Steps:

Explore Transformer Models:
Study how the transformer architecture works, focusing on self-attention mechanisms.
Dive into Multi-Label Classification:
Learn about different loss functions (e.g., binary cross-entropy) and evaluation metrics suited for multi-label problems.
Understand Embeddings:
Look into how embeddings represent discrete tokens and how they can be summed or averaged.
Learn About Masked Language Modeling:
Investigate models like BERT to see how masking inputs during training improves performance.
Apply to a Project:
Consider working on a project that involves sequence prediction or annotation, applying these concepts hands-on.

I hope this detailed explanation helps you grasp how residue annotations are integrated into the ESM3 model and inspires you to delve deeper into machine learning applications in biology!