out31

\section*{A.1.8. Function Tokenization}

ESM3 processes annotations of functional characteristics of proteins through two tracks: function tokens, and residue annotations. Both support input conditioning and output heads for generation. Appendix A.1.5.1 outlines how tokens are processed into the network: we further describe the creation of the tokens themselves in this section.

\section*{A.1.8.1. FUNCTION TOKENS}

Function tokens are a dense semantic representation of functional characteristics of proteins derived from free-text descriptions of the InterPro and Gene Ontology (GO) terms at each residue. At training time, function tokens are produced from each protein's InterPro annotations by a multi-step process illustrated in Fig. S7. At a high level:

For each residue, we gather free-text for each InterPro annotation via annotation term names, associated GO terms per annotation (via InterPro2GO mapping), and all ancestor GO terms. We parse the free-text into counts from a vocabulary of 68,103 keywords. The vocabulary is composed of unigram and bigrams extracted from the free-text of all valid InterPro annotations (and their associated GO/ancestor GO terms) in our training datasets.
The keywords are converted to a sparse TF-IDF vector per InterPro annotation. During training, we also produce a corrupted version by dropping keywords at the protein level (i.e. the same keywords have their counts set to 0 across all residues) at a $15 \%$ probability per keyword.
To create a vector per residue from the per annotation vectors, we max pool the TF-IDF vectors for the annotations per residue. During training, we further corrupt the "corrupted" version by dropping annotations at the protein level (i.e. the same annotations are removed from the max pool across all residues) at a $15 \%$ probability per annotation.
We then quantize each residue's vector (a highly sparse vector with float entries) into a discrete representation suitable for input to the language model as tokens by applying a fixed series of 8 locality sensitive hashes $(\mathrm{LSH})$, each with 8 hyperplanes.

The result is a sequence of 8 tokens each ranging in value from 0 to 255 per-residue. We reserve a special token to represent positions with an empty set of InterPro annotations. For proteins that lack any functional annotations, the tokens are filled with the $<$ pad> token which has an embedding value fixed to all zeros. At test time, we can produce per residue vectors using the process described, or directly creating a TF-IDF vector with keywords.

During pre-training we use the corrupted versions of the function tokens at input, predicting the un-corrupted version function tokens at positions which have been masked. $90 \%$ of the time, the entire input is replaced with $<$ mask $>$. The other $10 \%$ of the time, we replace all 8 tokens of selected residue with a $<$ mask $>$, with the per-residue selection probability sampled from a cosine masking schedule per protein. The model has an output head which predicts each of the 8 function tokens in positions with $<$ mask $>$ as input, and is trained with a categorical cross entropy loss

Function tokenization offers several key advantages as compared to simpler approaches for example using a dedicated InterPro tag vocabulary. Encoding functional annotations with a generic functional keyword vocabulary enables flexible prompting of the model at test time, by combinations of keywords that were not encountered during training time. This enhances the programmability of ESM3 in designing novel proteins with not previously observed functional characteristics.

Function tokenization can also be viewed through the lens of data compression. This choice of representation reduces the input/output space from all possible InterPro combinations which would naively be represented by $35 \mathrm{k}$ bits, to a reduced space of 8 tokens $ imes 8$ bits $/$ token $=64$ bits. This also affords significant memory saving during pre-training by eliminating the need to perform multi-class multi-label binary classification.

Please explain this in detail to a beginning grad student in machine learning.

Response: Certainly! Let's break down the process described and explain it step by step.

Background

When building machine learning models for proteins, such as ESM3 (Evolutionary Scale Modeling version 3), incorporating functional annotations can significantly enhance the model's understanding and predictions about protein functions. Functional annotations provide insights into what a protein does, how it interacts with other molecules, and its role in biological processes.

However, incorporating these annotations into a machine learning model isn't straightforward due to their complexity and richness. The annotations come from databases like InterPro and Gene Ontology (GO), which contain vast amounts of free-text descriptions, hierarchical relationships, and a large number of unique terms.

To effectively use this information in a model, we need to convert these complex annotations into a form that the model can process—specifically, into tokens suitable for input into a language model. This process is known as function tokenization.

Objective

The goal is to create a dense, semantic representation of functional characteristics (function tokens) from free-text annotations so that they can be input into the language model. The process also aims to handle the vast number of possible annotations efficiently and to allow for flexibility in handling new or unseen functional terms.

Function Tokenization Process

The function tokenization process is a multi-step method that converts free-text functional annotations into tokens that the language model can understand.

Step 1: Gathering and Parsing Functional Annotations

a. Collect Free-Text Annotations

For each amino acid residue in a protein:

InterPro Annotations: Obtain the names (terms) of all InterPro annotations associated with that residue. InterPro is a database that classifies proteins into families and predicts the presence of domains and functional sites.
Gene Ontology (GO) Terms: For each InterPro annotation, get the associated GO terms using the InterPro2GO mapping. GO provides a hierarchical ontology of terms for molecular functions, biological processes, and cellular components.
Ancestor GO Terms: Collect all ancestor GO terms for each associated GO term. This means including higher-level categories in the GO hierarchy, not just the most specific terms.

b. Build a Vocabulary

Extract Keywords: From all the free-text collected (annotation names and GO terms), extract unigrams (single words) and bigrams (pairs of words).
Create a Keyword Vocabulary: Compile a vocabulary of these keywords, resulting in a list of 68,103 unique keywords.

c. Parse Text into Counts

Keyword Counts: For each residue, parse the free-text annotations to count how many times each keyword appears.

At this point, each residue is represented by a sparse vector (most entries are zero) where each dimension corresponds to a keyword from the vocabulary, and the value is the count of that keyword.

Step 2: Converting to TF-IDF Vectors and Introducing Corruption

a. TF-IDF Transformation

Term Frequency-Inverse Document Frequency (TF-IDF): Convert the keyword counts into TF-IDF vectors for each InterPro annotation. TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection. It increases with the number of times a word appears in the document but is offset by how frequently the word appears in the entire corpus.
Purpose of TF-IDF: This transformation helps in emphasizing keywords that are particularly characteristic of specific annotations while downplaying keywords that are common across many annotations.

b. Corrupting the TF-IDF Vectors (Keyword Dropping)

During Training: Introduce corruption to the TF-IDF vectors to make the model robust to missing data.
Corruption Process: At the protein level (affecting all residues in a protein), with a 15% probability for each keyword, set its count to zero (i.e., drop the keyword). This simulates the case where certain functional information might be missing.

Step 3: Creating Per-Residue Vectors and Further Corruption

a. Max Pooling Across Annotations

Per-Residue Vector: For each residue, aggregate the TF-IDF vectors from all its annotations by taking the maximum value for each keyword across all annotations. This is known as max pooling.
Purpose of Max Pooling: It compresses the information from multiple annotations into a single vector per residue, capturing the most significant TF-IDF score for each keyword.

b. Further Corruption (Annotation Dropping)

Annotation Dropping: Again, at the protein level, with a 15% probability for each annotation, completely remove the TF-IDF vector of that annotation from the max pooling process for all residues. This simulates missing annotations.

Step 4: Quantization into Discrete Tokens

a. High-Dimensional to Discrete Representation

Issue with Current Vectors: The per-residue vectors are high-dimensional (68,103 dimensions), sparse, and contain floating-point numbers, which aren't suitable for input into a language model that expects sequences of tokens.

b. Locality Sensitive Hashing (LSH)

LSH Overview: LSH is a technique used to hash input items in such a way that similar items map to the same hash bucket with high probability. It helps in reducing dimensionality while preserving the locality (similarity) of the data.
Application in Tokenization:
Apply 8 separate LSH functions to each per-residue vector.
Each LSH function uses 8 hyperplanes to divide the space. Each hyperplane can be thought of as a decision boundary in the vector space.
For each LSH function, the per-residue vector is projected, and depending on which side of the hyperplanes the vector falls, it results in an 8-bit code (since there are 8 hyperplanes, and each can be viewed as a binary decision).
The 8-bit code is an integer between 0 and 255 (2^8 possibilities).

c. Resulting Tokens

Sequence of Tokens: Each residue is now represented by a sequence of 8 tokens, each ranging from 0 to 255.
Special Tokens:
: Used to represent residues with no InterPro annotations.
: Used for proteins lacking any functional annotations entirely, with an embedding value fixed to all zeros.

d. Benefits of Quantization

Reduced Dimensionality: Instead of handling large sparse vectors, we now have compact tokens suitable for sequential models.
Preserved Similarity: Similar functional annotations map to similar token sequences due to the properties of LSH.

Using Function Tokens in the Model

a. Input and Output of the Language Model

Input Conditioning: The function tokens are used as additional input to the language model alongside the amino acid sequence.
Output Head: The model has an output head specifically for generating function tokens, enabling it to predict functional annotations.

b. Masking Strategy During Training

Corrupted Inputs: Use the corrupted version of function tokens as input.
Masking Positions:
90% of the time: The entire input sequence of function tokens is replaced with a special <mask> token.
10% of the time: Individual residues are selected based on a cosine masking schedule (a method to vary the masking probability smoothly across the protein), and the 8 tokens corresponding to those residues are replaced with <mask>.
Training Objective: The model is trained to predict the original (uncorrupted) function tokens for the masked positions using a categorical cross-entropy loss.

c. Purpose of Masking and Corruption

Robustness: Encourages the model to rely on context and learn meaningful relationships rather than memorizing the input.
Generalization: Helps the model handle missing or noisy data and infer functions based on sequence information.

Advantages of Function Tokenization

Flexibility and Programmability

Generic Keyword Vocabulary: By using keywords extracted from annotations rather than fixed annotation IDs, the model can handle new combinations of functions not seen during training.
Flexible Prompting: At test time, users can prompt the model with combinations of keywords to design proteins with novel functional characteristics.

Data Compression

Reduced Representation Size: The potential space of all InterPro annotations is huge (would require approximately 35,000 bits to represent directly). By reducing this to 8 tokens of 8 bits each (64 bits), the representation is much more compact.
Reduced Computational Complexity: A smaller input/output space makes the model more efficient to train and evaluate.

Memory Savings

Efficient Training: By avoiding multi-class multi-label binary classification over a vast number of possible annotations, the model requires less memory and computational resources.
Simplified Output Space: The model predicts tokens rather than probabilities over thousands of classes.

Conclusion

Function tokenization is a strategic method to represent complex functional annotations in a form suitable for language models. It allows ESM3 to incorporate rich functional information efficiently and flexibly, enhancing its ability to model protein functions and to be used in protein design with new functional characteristics.

Key Takeaways for a Beginning Grad Student:

Understanding Functional Annotations: Familiarize yourself with biological databases like InterPro and Gene Ontology, which provide functional annotations for proteins.
Tokenization Challenges: Recognize the difficulties in representing complex, high-dimensional, and hierarchical data (like functional annotations) for machine learning models.
Dimensionality Reduction Techniques: Learn about TF-IDF for text representation and Locality Sensitive Hashing for reducing high-dimensional data while preserving similarity.
Model Training Strategies: Appreciate the use of data corruption (e.g., masking, dropping) during training to improve model robustness and generalization.
Advantages of Compact Representations: Understand how data compression and efficient representations can lead to better-performing and more manageable models.

If you have any questions or need further clarification on any of these steps or concepts, feel free to ask!