esm.doi.bio/esm33/out33

\section*{A.1.8.3. Residue Annotations Track}

Residue annotations label a protein's sites of functional residues with a vocabulary of 1474 multi-hot labels emitted by InterProScan. To gather this data, we run InterProScan with databases (SFLD, CDD, PIR) on all cluster members in our UniRef and Mgnify datasets (seq-id 90 clustered). We take all unique residue annotation descriptions that occur in more than $1 \mathrm{k}$ proteins across all of UniRef90 and MGnify 90 , and deduplicate labels by punctuation and case insensitivity. We join these annotations into our UniRef, MGnify, AFDB, and ESMAtlas datasets for training.

As introduced in Appendix A.1.5.1, ESM3 has a track dedicated to processing residue annotations that supports input conditioning, and an output head for generation. The residue annotation labels for a protein are tokenized into a sequence of token-sets in length equal to the protein. At each position there is an unordered set of tokens representing the residue annotations present at that position. The tokens are input to ESM3 first through an embedding lookup followed by a sum over embeddings. The permutation invariance of the sum retains that the labels are represented to an unordered set as a model. The per-position embedding sums are then added onto the per-position sequence embedding before input into the first transformer block. Positions with no residue annotations are represented by a token which has an embedding fixed to zeros. The residue annotations track has an output head which outputs a set of binary classification logits predicting for each position the presence or absence of each residue annotation in the vocabulary. We apply a masking procedure to partially/fully mask residue annotation labels, and train the output head with a binary cross-entropy loss function to reconstruct the full residue annotation. In pre-training, with $90 \%$ probability all residue annotations are masked, and otherwise we independently sample positions to mask with a square root schedule. The head is trained to predict the presence of any residue annotation label that was masked.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Residue Annotations Track in ESM3

Introduction for a Beginning Grad Student in Machine Learning:

In modern computational biology, understanding protein functions at the residue level is crucial. Proteins are made up of amino acids, and specific residues (amino acid positions) within a protein can have important functional roles, like binding to other molecules or catalyzing reactions. Machine learning models can help predict these functional sites, and one such model is ESM3.

This explanation will walk you through how the ESM3 model incorporates residue annotations into its architecture to predict functional sites in proteins. We'll cover data collection, preprocessing, model input and output, and the training procedure.


1. Residue Annotations: What Are They?


2. Collecting Residue Annotation Data

  1. Running InterProScan:

  2. Filtering Annotations:

  3. Deduplication:

  4. Building the Vocabulary:


3. Incorporating Annotations into the ESM3 Model

  1. Tokenization:

  2. Embedding Lookup:

  3. Combining Embeddings:

  4. Handling Missing Annotations:


4. Model Output and Training Procedure

  1. Purpose of Masking:

  2. Full Masking (90% Probability):

  3. Partial Masking (10% Probability):

  4. Training Task:


5. Key Concepts and Takeaways


6. Why Is This Important?


7. Further Considerations


Conclusion

By integrating residue annotations into its architecture, the ESM3 model leverages both sequence information and functional annotations to improve the prediction of functional sites in proteins. This approach combines advanced machine learning techniques with biological domain knowledge, opening avenues for more accurate and insightful models in computational biology.


As a Beginning Grad Student, You Should Now:


Suggested Next Steps:


I hope this detailed explanation helps you grasp how residue annotations are integrated into the ESM3 model and inspires you to delve deeper into machine learning applications in biology!