esm.doi.bio/esm33/out21

\section*{A.1.3. Tokenization}

During tokenization, special beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens are prepended and appended to mark the real start of sequences. When sequences are cropped due to length, the BOS and EOS tokens are cropped out to indicate protein fragments. In all cases, one token per track is used for each amino acid.

Sequence Protein sequences are tokenized as the 20 canon- ical amino acids, plus BOS, EOS, mask, pad, unknown. We keep four non-standard amino acids as in Lin et al. (5), B - Asparagine, U - selenocysteine, Z - glutamic acid, and O - ornithine. This totals to 29 tokens.

Structure Structure tokenization is described in Appendix A.1.7.1. ESM3 uses a codebook size of 4096 with 4 special tokens - EOS, BOS, mask, and pad.

Secondary Structure Secondary structure is taken to be the canonical 8-class tokens (60), with unknown and mask, for a total of 10 tokens. The mask token is forced to be the 0 -vector during embedding.

SASA The continuous values representing SASA are tokenized by discretization into a fixed set of 16 bins. SASA bin boundaries were chosen by computing SASA on 100 random structures and ensuring an equal number of residues belong in each bin. Unknown and mask are used for a total of 18 tokens. The mask token is forced to be the 0 -vector during embedding.

Function annotations We tokenize function annotations as bags of keywords, described in Appendix A.1.8. Keywords are quantized using LSH into 8 tokens per residue, each of which can be one of 255 tokens. There are three special tokens, empty set, no-annotation, and mask. Again, the mask token is forced to be the 0 vector during embedding.

Residue annotations InterPro annotations are tokenized as a multi-hot feature vector (1478 dimensions) over possible InterPro labels (38). Input annotations are limited to a maximum of 16 . When annotations are not present, we enforce that the 0 -vector is added.

Please explain this in detail to a beginning grad student in machine learning.

Response: Explanation of Section A.1.3: Tokenization

In this section, we discuss how protein data is converted into a format suitable for machine learning models through a process called tokenization. Tokenization involves breaking down complex data into discrete units called tokens that models can process. Here's a detailed explanation of how different types of protein-related data are tokenized.


1. General Tokenization Process


2. Protein Sequence Tokenization


3. Structure Tokenization

Note: Detailed methods for structure tokenization are provided in Appendix A.1.7.1 (not included here).


4. Secondary Structure Tokenization


5. Solvent Accessible Surface Area (SASA) Tokenization


6. Function Annotations Tokenization


7. Residue Annotations Tokenization


Key Concepts Explained


Importance in Machine Learning for Proteins


Note to the Reader:


Summary

Tokenization is a crucial step in preparing protein data for machine learning models. By converting sequences, structures, and annotations into tokens, models can process and learn from complex biological data. Special tokens and embedding strategies ensure that the data is represented accurately while accommodating the needs of the learning algorithms.

Feel free to ask if you have any questions or need further clarification on any of the topics covered!