\section*{A.1.3. Tokenization}
During tokenization, special beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens are prepended and appended to mark the real start of sequences. When sequences are cropped due to length, the BOS and EOS tokens are cropped out to indicate protein fragments. In all cases, one token per track is used for each amino acid.
Sequence Protein sequences are tokenized as the 20 canon- ical amino acids, plus BOS, EOS, mask, pad, unknown. We keep four non-standard amino acids as in Lin et al. (5), B - Asparagine, U - selenocysteine, Z - glutamic acid, and O - ornithine. This totals to 29 tokens.
Structure Structure tokenization is described in Appendix A.1.7.1. ESM3 uses a codebook size of 4096 with 4 special tokens - EOS, BOS, mask, and pad.
Secondary Structure Secondary structure is taken to be the canonical 8-class tokens (60), with unknown and mask, for a total of 10 tokens. The mask token is forced to be the 0 -vector during embedding.
SASA The continuous values representing SASA are tokenized by discretization into a fixed set of 16 bins. SASA bin boundaries were chosen by computing SASA on 100 random structures and ensuring an equal number of residues belong in each bin. Unknown and mask are used for a total of 18 tokens. The mask token is forced to be the 0 -vector during embedding.
Function annotations We tokenize function annotations as bags of keywords, described in Appendix A.1.8. Keywords are quantized using LSH into 8 tokens per residue, each of which can be one of 255 tokens. There are three special tokens, empty set, no-annotation, and mask. Again, the mask token is forced to be the 0 vector during embedding.
Residue annotations InterPro annotations are tokenized as a multi-hot feature vector (1478 dimensions) over possible InterPro labels (38). Input annotations are limited to a maximum of 16 . When annotations are not present, we enforce that the 0 -vector is added.
Please explain this in detail to someone who is a quant on wall street who likes science.
Response: Certainly! Let's dive into the details of the tokenization process described in the text, and I'll explain it in a way that aligns with your analytical background and interest in science.
In computational biology and bioinformatics, especially when applying machine learning models to protein data, it's essential to convert biological information into a numerical format that algorithms can process. This process is known as tokenization. It involves representing protein sequences and their associated properties using discrete tokens, much like how words are tokenized in natural language processing (NLP).
The text outlines how various aspects of proteins—such as their sequences, structures, and functional annotations—are tokenized for input into models like transformers, which require fixed-size token embeddings.
BOS (Beginning of Sequence) and EOS (End of Sequence):
These tokens mark the start and end of a protein sequence.
They help the model understand the boundaries of the sequence.
When sequences are cropped due to length constraints, BOS and EOS are removed to indicate that the sequence is a fragment, not a full protein.
Mask Token:
Used during training, especially in masked language modeling tasks.
Represents positions in the sequence where the model is asked to predict the missing token.
Pad Token:
Used to pad sequences to a uniform length for batch processing.
Ensures that sequences in a batch are the same length without adding meaningful information.
Unknown Token:
Represents tokens that are unrecognized or not among the predefined set.
Canonical Amino Acids:
Proteins are made up of 20 standard amino acids.
Each amino acid is represented by a unique token.
Non-Standard Amino Acids:
In addition to the 20 standard amino acids, four non-standard amino acids are included:
Total Tokens for Sequences:
29 tokens, which include:
Structure Tokenization:
Protein structures are represented using tokens from a codebook.
Codebook Size: 4096 tokens.
Special Tokens: EOS, BOS, mask, pad (same as before).
Purpose:
Captures the 3D conformation of the protein.
Essential for understanding protein function and interactions.
Secondary Structure Elements:
Proteins have local structural motifs like alpha-helices and beta-sheets.
These are categorized into 8 classes, following the DSSP (Dictionary of Secondary Structure of Proteins) classification:
Tokenization:
Each class is assigned a unique token.
Total Tokens:
Definition:
SASA measures how much of each amino acid residue is exposed to the solvent (usually water).
Important for understanding protein folding and interactions.
Tokenization:
Discretization:
Total Tokens:
Definition:
Functional annotations provide information about the role of each residue in the protein, such as binding sites or active sites.
Tokenization:
Keywords as Bags of Words:
Quantization using LSH (Locality-Sensitive Hashing):
Special Tokens:
Purpose:
Efficiently captures functional information at the residue level.
Handles high-dimensional annotation data in a scalable way.
InterPro Database:
A resource that provides functional analysis of proteins by classifying them into families and predicting domains and important sites.
Tokenization:
Multi-Hot Feature Vector:
Input Annotations Limit:
Zero Vector Enforcement:
Integrating Multiple Protein Features:
By tokenizing different aspects of proteins—sequence, structure, functional annotations—the model can learn rich representations that capture various biological properties.
It's akin to how in NLP, models learn from words, syntax, semantics, and context.
Machine Learning Models:
These tokens are used as inputs to models like ESM3, a transformer-based architecture tailored for proteins.
The codebook for structure with 4096 tokens is particularly useful for capturing complex 3D conformations in a discrete manner.
Discretization Techniques:
Equal-Frequency Binning for SASA ensures that the model receives a balanced representation across the range of SASA values.
LSH Quantization for function annotations allows the handling of high-dimensional data efficiently.
Handling Ambiguity and Rare Events:
Including non-standard amino acids and ambiguous tokens (B and Z) ensures that the model can handle real-world protein data, which often includes such anomalies.
The unknown token acts as a catch-all for any out-of-vocabulary items.
Analogies to Financial Modeling:
Just as tokenization converts complex biological data into numerical formats for modeling, in finance, similar preprocessing steps are crucial for handling diverse datasets (e.g., time-series data, categorical data).
Techniques like discretization and embedding are also used in quantitative finance for features like volatility regimes or market states.
Data Complexity and Modeling Techniques:
The intricate methods of representing protein features mirror the complexity of financial instruments and markets.
Understanding these methods can provide insights into advanced modeling techniques applicable across domains.
Innovation Through Cross-Disciplinary Knowledge:
Appreciating how computational biology tackles data representation challenges can inspire novel approaches in quantitative finance, such as using transformer architectures for sequential market data.
The tokenization process detailed above is a sophisticated method for preparing protein data for computational models. By converting various protein characteristics into tokenized forms, the models can effectively process and learn from the data, enabling tasks like protein structure prediction, function annotation, and understanding protein interactions.
In essence, this approach bridges biology and data science, utilizing advanced computational techniques to unravel the complexities of biological systems. For someone with a quantitative background, exploring these methods offers a glimpse into how similar modeling challenges are addressed in different fields, potentially sparking ideas for innovative solutions in your own domain.
I hope this detailed explanation helps you understand the tokenization process in protein modeling and how it's applied in computational biology!