esm.doi.bio/esm33/out31

\section*{A.1.8. Function Tokenization}

ESM3 processes annotations of functional characteristics of proteins through two tracks: function tokens, and residue annotations. Both support input conditioning and output heads for generation. Appendix A.1.5.1 outlines how tokens are processed into the network: we further describe the creation of the tokens themselves in this section.

\section*{A.1.8.1. FUNCTION TOKENS}

Function tokens are a dense semantic representation of functional characteristics of proteins derived from free-text descriptions of the InterPro and Gene Ontology (GO) terms at each residue. At training time, function tokens are produced from each protein's InterPro annotations by a multi-step process illustrated in Fig. S7. At a high level:

  1. For each residue, we gather free-text for each InterPro annotation via annotation term names, associated GO terms per annotation (via InterPro2GO mapping), and all ancestor GO terms. We parse the free-text into counts from a vocabulary of 68,103 keywords. The vocabulary is composed of unigram and bigrams extracted from the free-text of all valid InterPro annotations (and their associated GO/ancestor GO terms) in our training datasets.

  2. The keywords are converted to a sparse TF-IDF vector per InterPro annotation. During training, we also produce a corrupted version by dropping keywords at the protein level (i.e. the same keywords have their counts set to 0 across all residues) at a $15 \%$ probability per keyword.

  3. To create a vector per residue from the per annotation vectors, we max pool the TF-IDF vectors for the annotations per residue. During training, we further corrupt the "corrupted" version by dropping annotations at the protein level (i.e. the same annotations are removed from the max pool across all residues) at a $15 \%$ probability per annotation.

  4. We then quantize each residue's vector (a highly sparse vector with float entries) into a discrete representation suitable for input to the language model as tokens by applying a fixed series of 8 locality sensitive hashes $(\mathrm{LSH})$, each with 8 hyperplanes.

The result is a sequence of 8 tokens each ranging in value from 0 to 255 per-residue. We reserve a special token to represent positions with an empty set of InterPro annotations. For proteins that lack any functional annotations, the tokens are filled with the $<$ pad> token which has an embedding value fixed to all zeros. At test time, we can produce per residue vectors using the process described, or directly creating a TF-IDF vector with keywords.

During pre-training we use the corrupted versions of the function tokens at input, predicting the un-corrupted version function tokens at positions which have been masked. $90 \%$ of the time, the entire input is replaced with $<$ mask $>$. The other $10 \%$ of the time, we replace all 8 tokens of selected residue with a $<$ mask $>$, with the per-residue selection probability sampled from a cosine masking schedule per protein. The model has an output head which predicts each of the 8 function tokens in positions with $<$ mask $>$ as input, and is trained with a categorical cross entropy loss

Function tokenization offers several key advantages as compared to simpler approaches for example using a dedicated InterPro tag vocabulary. Encoding functional annotations with a generic functional keyword vocabulary enables flexible prompting of the model at test time, by combinations of keywords that were not encountered during training time. This enhances the programmability of ESM3 in designing novel proteins with not previously observed functional characteristics.

Function tokenization can also be viewed through the lens of data compression. This choice of representation reduces the input/output space from all possible InterPro combinations which would naively be represented by $35 \mathrm{k}$ bits, to a reduced space of 8 tokens $ imes 8$ bits $/$ token $=64$ bits. This also affords significant memory saving during pre-training by eliminating the need to perform multi-class multi-label binary classification.

Please explain this in detail to a beginning grad student in machine learning.

Response: Certainly! Let's break down the process described and explain it step by step.

Background

When building machine learning models for proteins, such as ESM3 (Evolutionary Scale Modeling version 3), incorporating functional annotations can significantly enhance the model's understanding and predictions about protein functions. Functional annotations provide insights into what a protein does, how it interacts with other molecules, and its role in biological processes.

However, incorporating these annotations into a machine learning model isn't straightforward due to their complexity and richness. The annotations come from databases like InterPro and Gene Ontology (GO), which contain vast amounts of free-text descriptions, hierarchical relationships, and a large number of unique terms.

To effectively use this information in a model, we need to convert these complex annotations into a form that the model can process—specifically, into tokens suitable for input into a language model. This process is known as function tokenization.

Objective

The goal is to create a dense, semantic representation of functional characteristics (function tokens) from free-text annotations so that they can be input into the language model. The process also aims to handle the vast number of possible annotations efficiently and to allow for flexibility in handling new or unseen functional terms.

Function Tokenization Process

The function tokenization process is a multi-step method that converts free-text functional annotations into tokens that the language model can understand.

Step 1: Gathering and Parsing Functional Annotations

a. Collect Free-Text Annotations

For each amino acid residue in a protein:

b. Build a Vocabulary

c. Parse Text into Counts

At this point, each residue is represented by a sparse vector (most entries are zero) where each dimension corresponds to a keyword from the vocabulary, and the value is the count of that keyword.

Step 2: Converting to TF-IDF Vectors and Introducing Corruption

a. TF-IDF Transformation

b. Corrupting the TF-IDF Vectors (Keyword Dropping)

Step 3: Creating Per-Residue Vectors and Further Corruption

a. Max Pooling Across Annotations

b. Further Corruption (Annotation Dropping)

Step 4: Quantization into Discrete Tokens

a. High-Dimensional to Discrete Representation

b. Locality Sensitive Hashing (LSH)

c. Resulting Tokens

d. Benefits of Quantization

Using Function Tokens in the Model

a. Input and Output of the Language Model

b. Masking Strategy During Training

c. Purpose of Masking and Corruption

Advantages of Function Tokenization

  1. Flexibility and Programmability
  1. Data Compression
  1. Memory Savings

Conclusion

Function tokenization is a strategic method to represent complex functional annotations in a form suitable for language models. It allows ESM3 to incorporate rich functional information efficiently and flexibly, enhancing its ability to model protein functions and to be used in protein design with new functional characteristics.

Key Takeaways for a Beginning Grad Student:

If you have any questions or need further clarification on any of these steps or concepts, feel free to ask!