out34

\section*{A.1.8.3. Residue Annotations Track}

Residue annotations label a protein's sites of functional residues with a vocabulary of 1474 multi-hot labels emitted by InterProScan. To gather this data, we run InterProScan with databases (SFLD, CDD, PIR) on all cluster members in our UniRef and Mgnify datasets (seq-id 90 clustered). We take all unique residue annotation descriptions that occur in more than $1 \mathrm{k}$ proteins across all of UniRef90 and MGnify 90 , and deduplicate labels by punctuation and case insensitivity. We join these annotations into our UniRef, MGnify, AFDB, and ESMAtlas datasets for training.

As introduced in Appendix A.1.5.1, ESM3 has a track dedicated to processing residue annotations that supports input conditioning, and an output head for generation. The residue annotation labels for a protein are tokenized into a sequence of token-sets in length equal to the protein. At each position there is an unordered set of tokens representing the residue annotations present at that position. The tokens are input to ESM3 first through an embedding lookup followed by a sum over embeddings. The permutation invariance of the sum retains that the labels are represented to an unordered set as a model. The per-position embedding sums are then added onto the per-position sequence embedding before input into the first transformer block. Positions with no residue annotations are represented by a token which has an embedding fixed to zeros. The residue annotations track has an output head which outputs a set of binary classification logits predicting for each position the presence or absence of each residue annotation in the vocabulary. We apply a masking procedure to partially/fully mask residue annotation labels, and train the output head with a binary cross-entropy loss function to reconstruct the full residue annotation. In pre-training, with $90 \%$ probability all residue annotations are masked, and otherwise we independently sample positions to mask with a square root schedule. The head is trained to predict the presence of any residue annotation label that was masked.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Residue Annotations Track in ESM3

Introduction

In protein biology, understanding the function of individual amino acid residues within a protein is crucial. Some residues may be part of an active site, binding domain, or other functional regions. Annotating these residues helps researchers understand protein function and interactions.

In machine learning, especially with models like ESM3 (a transformer-based model for proteins), incorporating residue annotations can improve the model's understanding and predictions about proteins. This explanation will break down how residue annotations are collected, processed, and integrated into the ESM3 model for training.

1. What Are Residue Annotations?

Residue annotations are labels assigned to specific amino acid residues within a protein to indicate their functional roles. For example, a residue might be labeled as part of an enzyme's active site or a binding site for a particular molecule.

In the context of ESM3, we have:

A vocabulary of 1,474 unique residue annotation labels.
These labels are multi-hot encoded at each residue position, meaning multiple annotations can apply to a single residue.

2. Data Collection for Residue Annotations

To gather data for training, the following steps are taken:

a. Data Sources:

UniRef90 Dataset: A collection of protein sequences clustered at 90% sequence identity to reduce redundancy.
MGnify90 Dataset: Similar to UniRef90 but includes sequences from environmental (metagenomic) samples.

b. Annotation Tool:

InterProScan: A bioinformatics tool that scans protein sequences and predicts functional annotations based on conserved protein domains and families.

c. Databases Used with InterProScan:

SFLD (Structure-Function Linkage Database)
CDD (Conserved Domain Database)
PIR (Protein Information Resource)

d. Process:

Run InterProScan on all protein sequences in the UniRef90 and MGnify90 datasets using the above databases.
Collect Residue Annotations:

Extract all unique residue annotation descriptions that occur in more than 1,000 proteins across both datasets.
Deduplicate Labels: Merge labels that differ only by punctuation or case sensitivity to avoid redundancy (e.g., "Active_site" and "active site" are considered the same).

Result: A filtered and deduplicated vocabulary of 1,474 residue annotation labels.

e. Integration into Datasets:

The collected annotations are added to the UniRef, MGnify, AFDB (probably the AlphaFold Database), and ESMAtlas datasets.
These enriched datasets are used for training the ESM3 model.

3. Incorporating Residue Annotations into ESM3

ESM3 is an advanced protein language model that uses transformers, similar to models like BERT in natural language processing.

a. Structure of the Input:

Protein Sequence: A sequence of amino acid residues.
Residue Annotations: At each residue position, a set (possibly empty) of annotation labels.

b. Tokenization and Embedding:

Tokenization:

Each residue's annotations are tokenized into an unordered set of tokens.
The sequence of residues and their annotations result in a sequence of token sets matching the protein length.

Embedding Look-Up:

Each annotation token is mapped to an embedding vector through an embedding layer (similar to word embeddings in NLP).

Summing Embeddings:

For each residue position, the embeddings of all annotation tokens are summed to obtain a single embedding vector for that position's annotations.
Permutation Invariance: Summing ensures that the order of annotations doesn't affect the result, reflecting that the set is unordered.

Combining with Residue Embeddings:

The summed annotation embedding at each position is added to the embedding of the amino acid residue.
This combined embedding incorporates both the amino acid identity and its annotations.

Zero Embedding for Unannotated Positions:

Positions with no annotations use a special <pad> token whose embedding is fixed to zero.
This ensures that unannotated residues don't contribute additional information from annotations.

c. Feeding into the Transformer:

The combined embeddings (residues + annotations) are then input into the transformer layers of ESM3.
The transformer processes this enriched sequence to capture complex relationships.

4. Residue Annotations Output Head and Training

a. Output Head for Annotations:

The model includes an output head dedicated to residue annotations.
At each residue position, it outputs binary classification logits for each annotation label.
This setup enables the model to predict, for each position, which annotations are present.

b. Masking Strategy for Training:

To train the model to predict annotations, a masking strategy is employed:

Purpose of Masking:

Masking annotations forces the model to learn to infer missing annotations based on the surrounding context and the amino acid sequence.

Masking Procedure:

With 90% Probability:
- Full Masking: All annotations in the sequence are masked.
- The model receives no annotation information and must predict annotations solely from the amino acid sequence.
With 10% Probability:
- Partial Masking: Annotations are masked at certain positions.
- Sampling Positions: Positions to mask are sampled independently using a square root schedule.

Square Root Schedule Explained:

This schedule likely refers to the probability of masking a position being proportional to the square root of some variable, such as the position index.
This approach gives higher masking probability to certain positions, perhaps earlier ones, but the exact implementation details would depend on the context provided.

c. Training Objective:

Binary Cross-Entropy Loss:
The model is trained to minimize the binary cross-entropy loss between the predicted annotations and the true annotations at each position.
This loss function is appropriate for multi-label classification problems where each position can have multiple annotations.
Reconstruction of Masked Annotations:
The model's task during training is to correctly predict any annotations that were masked.
By reconstructing these annotations, the model learns to associate sequence patterns with functional annotations.

5. Key Takeaways

Multi-Label Classification:
Residue annotation prediction is a multi-label classification task at each residue position.
Permutation Invariance:
Summing embeddings ensures that the model treats the set of annotations at each position as unordered, which is appropriate since the order doesn't matter biologically.
Input Conditioning:
The model can condition on available annotations when they are not masked, providing additional context for prediction.
Generalization:
By learning to predict annotations from sequence context, the model can generalize to unseen proteins and predict annotations for residues without prior annotations.

Conclusion

This approach integrates residue-level functional annotations into a protein language model, enabling it to learn rich representations that combine sequence and functional information. The masking strategy ensures that the model doesn't just memorize annotations but learns to predict them based on the protein sequence, which is valuable for understanding protein function and aiding in tasks like protein engineering and drug discovery.

By processing residue annotations as per-position unordered sets and training the model to reconstruct masked annotations, ESM3 effectively learns the complex relationships between amino acid sequences and their functional roles within proteins.

This explanation should provide a clear understanding of how residue annotations are incorporated into the ESM3 model, how the model processes these annotations, and how it is trained to predict them. It covers the data collection process, model architecture, input processing, output prediction, and training methodology, all of which are crucial for a graduate student beginning in machine learning to grasp the concepts involved.