\section*{A.1.8.2. FUNCTION PREdiCTION}
ESM3 is trained to predict all 8 function tokens, each spanning 256 possible values. To extract interpretable predictions of protein function from ESM3 we decode the predicted function tokens into function keywords using a seperately trained function token decoder.
\section*{A.1.8.2.1. Function Token Decoder}
We train a 3-layer transformer model to learn the inverse map of the function tokenization process. The model takes as input the 8 function tokens representing the locality sensitive hash of function keywords. It outputs for each residue the binary-classification predictions predicting the presence of each function keyword, as well as predicting InterPro annotations from which the keywords originate. We find that unpacking the 8-bit LSH tokens into single-bit tokens improves training dynamics of the function token decoder. We train the function token decoder offline using combinations of InterPro tags from the UniRef annotated proteins. Since the function token vocabulary is fixed the decoder is applied identically across different ESM3 model sizes.
\section*{A.1.8.2.2. Evaluation}
To evaluate ESM3's performance in predicting protein function, we compute Average Precision, a standard measure of information retrieval, using the validation set of proteins from the UniRef and their associated InterProScan function annotations. We present results in Fig. S8.
Please explain this in detail to a beginning grad student in machine learning.
Response:
### Understanding Function Prediction in ESM3
Overview
In this explanation, we'll delve into how ESM3, a protein language model, predicts protein functions and how this process is evaluated. We'll break down each concept to ensure clarity for someone new to machine learning.
Section A.1.8.2: FUNCTION PREDICTION
What is ESM3?
- ESM3 is a machine learning model designed to understand proteins by learning from large amounts of protein sequence data.
- It's a type of language model, similar to those used in natural language processing, but tailored for protein sequences.
Why Predict Protein Function?
- Proteins are fundamental to biological processes, and understanding their functions is crucial in biology and medicine.
- Predicting protein function helps in identifying how proteins interact, which can lead to discoveries in drug development, understanding diseases, and more.
How Does ESM3 Predict Function?
- ESM3 is trained not just on protein sequences but also to predict function tokens, which are representations of protein functions.
- Function tokens are special tokens added to the sequence that encode information about the protein's function.
Section A.1.8.2.1: Function Token Decoder
Understanding Function Tokens
- Function Tokens: In ESM3, each protein is associated with 8 function tokens.
- Each function token can have 256 possible values (since (256 = 2^8), meaning each token is 8 bits).
- These tokens are generated using a technique called Locality Sensitive Hashing (LSH).
- LSH: A method to hash input items so that similar items map to the same hash with high probability. It's useful for handling large datasets by reducing dimensionality while preserving similarity.
Why Use LSH for Function Tokens?
- Proteins can have a vast number of possible functions.
- LSH helps to encode these functions into a manageable number of tokens that still capture functional similarities.
The Challenge with Function Tokens
- The LSH process and function tokens are not directly interpretable; the tokens don't immediately tell us the specific functions in human-readable terms.
- To make sense of these tokens, we need a way to decode them back into actual function annotations (keywords).
The Function Token Decoder
- Purpose: To convert the 8 function tokens back into function keywords and annotations we can understand.
- Model Architecture: A 3-layer transformer model.
- Transformers are a type of neural network architecture well-suited for processing sequence data.
- They have been highly successful in natural language processing tasks like translation and text generation.
- Input: The decoder receives the 8 function tokens as input.
- Output: It predicts, for each residue (amino acid) in the protein:
- Binary classification for each function keyword (indicating presence or absence).
- InterPro annotations: These are standardized protein function classifications from the InterPro database.
Improving the Decoder with Single-Bit Tokens
- Issue with 8-Bit Tokens: Each function token can be any value from 0 to 255, making the prediction task more complex.
- Solution: Unpack each 8-bit token into single-bit tokens.
- This means instead of 8 tokens with 256 possible values each, we have 64 tokens (8 function tokens × 8 bits) with binary values (0 or 1).
- Benefit: Simplifies the learning task for the decoder, as predicting binary values is generally easier than predicting one of 256 possible values.
- Improved Training Dynamics: The decoder learns faster and more effectively when dealing with binary predictions.
Training the Decoder
- Offline Training: The decoder is trained separately from ESM3.
- Data Used: Combinations of InterPro tags from proteins annotated in the UniRef database.
- UniRef: A comprehensive database of protein sequences and their annotations.
- InterPro: A resource that categorizes proteins into families and predicts domains and important sites.
- Why Offline Training?
- Allows the decoder to focus solely on learning the mapping from tokens to functions.
- Can use extensive annotated data for better learning without interfering with ESM3's primary training.
- Applicability Across Models: Since the function token vocabulary (the way functions are encoded into tokens) is fixed, the same decoder can be used with any size of ESM3 model.
Section A.1.8.2.2: Evaluation
Assessing ESM3's Function Prediction Performance
- Objective: To determine how well ESM3 can predict protein functions through its function tokens and decoder.
Average Precision (AP) as a Metric
- Average Precision (AP): A common evaluation metric in information retrieval and classification tasks.
- Combines both precision (the proportion of true positives among all positive predictions) and recall (the proportion of true positives detected among all actual positives) at various threshold levels.
- High AP indicates that the model ranks correct function annotations higher than incorrect ones consistently across thresholds.
Using a Validation Set
- Validation Set: A subset of data not used during training, used exclusively for evaluation.
- Data Used: Proteins from the UniRef database with known function annotations determined by InterProScan (a tool that annotates protein sequences with InterPro annotations).
- Process:
- Run ESM3 on the validation proteins to get predicted function tokens.
- Use the decoder to obtain predicted function keywords.
- Compare these predictions against the actual known annotations.
- Compute the Average Precision to quantify performance.
Results Presentation
- Fig. S8: (Not provided here) Likely shows a plot or table summarizing the AP scores across various functions or comparing ESM3's performance to other models.
Key Takeaways for a Grad Student in Machine Learning
- Function Tokenization and Decoding:
- Encodes complex, high-dimensional function annotations into fixed-size, manageable tokens using LSH.
- Decoding back to function keywords requires a targeted model due to the lossy nature of hashing.
- Model Design Considerations:
- Unpacking tokens into binary bits can simplify learning and improve model performance.
- Using transformers leverages their strength in handling sequential data and capturing long-range dependencies.
- Training Strategies:
- Training specialized components (like the decoder) offline allows for focused learning and flexibility.
- Reusing the decoder across different model sizes ensures consistency and efficiency.
- Evaluation Metrics:
- Choosing appropriate metrics (like AP) is crucial for evaluating model performance in a way that reflects practical relevance.
- Domain Knowledge Integration:
- Incorporating domain-specific databases (UniRef, InterPro) enhances the model's applicability and relevance to biological research.
Final Thoughts
Understanding how ESM3 predicts protein functions provides valuable insights into integrating machine learning with biological data. The approach of encoding complex annotations into tokens and decoding them back illustrates the challenges and solutions in handling specialized domain knowledge. As a grad student, appreciating these methods broadens your perspective on applying machine learning to diverse fields.