esm.doi.bio/esm33/out22

\section*{A.1.3. Tokenization}

During tokenization, special beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens are prepended and appended to mark the real start of sequences. When sequences are cropped due to length, the BOS and EOS tokens are cropped out to indicate protein fragments. In all cases, one token per track is used for each amino acid.

Sequence Protein sequences are tokenized as the 20 canon- ical amino acids, plus BOS, EOS, mask, pad, unknown. We keep four non-standard amino acids as in Lin et al. (5), B - Asparagine, U - selenocysteine, Z - glutamic acid, and O - ornithine. This totals to 29 tokens.

Structure Structure tokenization is described in Appendix A.1.7.1. ESM3 uses a codebook size of 4096 with 4 special tokens - EOS, BOS, mask, and pad.

Secondary Structure Secondary structure is taken to be the canonical 8-class tokens (60), with unknown and mask, for a total of 10 tokens. The mask token is forced to be the 0 -vector during embedding.

SASA The continuous values representing SASA are tokenized by discretization into a fixed set of 16 bins. SASA bin boundaries were chosen by computing SASA on 100 random structures and ensuring an equal number of residues belong in each bin. Unknown and mask are used for a total of 18 tokens. The mask token is forced to be the 0 -vector during embedding.

Function annotations We tokenize function annotations as bags of keywords, described in Appendix A.1.8. Keywords are quantized using LSH into 8 tokens per residue, each of which can be one of 255 tokens. There are three special tokens, empty set, no-annotation, and mask. Again, the mask token is forced to be the 0 vector during embedding.

Residue annotations InterPro annotations are tokenized as a multi-hot feature vector (1478 dimensions) over possible InterPro labels (38). Input annotations are limited to a maximum of 16 . When annotations are not present, we enforce that the 0 -vector is added.

Please explain this in detail to someone who is a quant on wall street who likes science.

Response: Certainly! Let's dive into the details of the tokenization process described in the text, and I'll explain it in a way that aligns with your analytical background and interest in science.


Overview

In computational biology and bioinformatics, especially when applying machine learning models to protein data, it's essential to convert biological information into a numerical format that algorithms can process. This process is known as tokenization. It involves representing protein sequences and their associated properties using discrete tokens, much like how words are tokenized in natural language processing (NLP).

The text outlines how various aspects of proteins—such as their sequences, structures, and functional annotations—are tokenized for input into models like transformers, which require fixed-size token embeddings.


Key Components of Protein Tokenization

1. Special Tokens


2. Protein Sequences


3. Protein Structures


4. Secondary Structure


5. SASA (Solvent-Accessible Surface Area)


6. Function Annotations


7. Residue Annotations (InterPro Annotations)


Additional Considerations

Mask Token Embedding as Zero Vector


Significance and Applications


Why This Matters to a Quant


Conclusion

The tokenization process detailed above is a sophisticated method for preparing protein data for computational models. By converting various protein characteristics into tokenized forms, the models can effectively process and learn from the data, enabling tasks like protein structure prediction, function annotation, and understanding protein interactions.

In essence, this approach bridges biology and data science, utilizing advanced computational techniques to unravel the complexities of biological systems. For someone with a quantitative background, exploring these methods offers a glimpse into how similar modeling challenges are addressed in different fields, potentially sparking ideas for innovative solutions in your own domain.


I hope this detailed explanation helps you understand the tokenization process in protein modeling and how it's applied in computational biology!