\section*{A.1.3. Tokenization}
During tokenization, special beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens are prepended and appended to mark the real start of sequences. When sequences are cropped due to length, the BOS and EOS tokens are cropped out to indicate protein fragments. In all cases, one token per track is used for each amino acid.
Sequence Protein sequences are tokenized as the 20 canon- ical amino acids, plus BOS, EOS, mask, pad, unknown. We keep four non-standard amino acids as in Lin et al. (5), B - Asparagine, U - selenocysteine, Z - glutamic acid, and O - ornithine. This totals to 29 tokens.
Structure Structure tokenization is described in Appendix A.1.7.1. ESM3 uses a codebook size of 4096 with 4 special tokens - EOS, BOS, mask, and pad.
Secondary Structure Secondary structure is taken to be the canonical 8-class tokens (60), with unknown and mask, for a total of 10 tokens. The mask token is forced to be the 0 -vector during embedding.
SASA The continuous values representing SASA are tokenized by discretization into a fixed set of 16 bins. SASA bin boundaries were chosen by computing SASA on 100 random structures and ensuring an equal number of residues belong in each bin. Unknown and mask are used for a total of 18 tokens. The mask token is forced to be the 0 -vector during embedding.
Function annotations We tokenize function annotations as bags of keywords, described in Appendix A.1.8. Keywords are quantized using LSH into 8 tokens per residue, each of which can be one of 255 tokens. There are three special tokens, empty set, no-annotation, and mask. Again, the mask token is forced to be the 0 vector during embedding.
Residue annotations InterPro annotations are tokenized as a multi-hot feature vector (1478 dimensions) over possible InterPro labels (38). Input annotations are limited to a maximum of 16 . When annotations are not present, we enforce that the 0 -vector is added.
Please explain this in detail to a beginning grad student in machine learning.
Response: Explanation of Section A.1.3: Tokenization
In this section, we discuss how protein data is converted into a format suitable for machine learning models through a process called tokenization. Tokenization involves breaking down complex data into discrete units called tokens that models can process. Here's a detailed explanation of how different types of protein-related data are tokenized.
Special Tokens:
BOS (Beginning-of-Sequence): Marks the start of a sequence.
EOS (End-of-Sequence): Marks the end of a sequence.
Mask: Used during training for tasks like masked language modeling.
Pad: Used to pad sequences to a uniform length.
Unknown: Represents unknown or rare elements.
Sequence Cropping:
When sequences exceed a maximum length, they are cropped.
BOS and EOS tokens are removed in cropped sequences to indicate that these are fragments, not complete sequences.
Amino Acid Representation:
20 Canonical Amino Acids: The standard amino acids found in proteins.
Non-Standard Amino Acids:
Total Tokens for Protein Sequences:
24 Amino Acid Tokens: 20 canonical + 4 non-standard.
Special Tokens: BOS, EOS, mask, pad, unknown.
Total: 29 tokens for representing protein sequences.
Note: Detailed methods for structure tokenization are provided in Appendix A.1.7.1 (not included here).
Secondary Structure Elements:
8 Classes: According to the DSSP (Define Secondary Structure of Proteins) classification:
Additional Tokens:
Unknown: For residues where the secondary structure is not known.
Mask: Used during model training.
Total Tokens: 8 classes + unknown + mask = 10 tokens.
Definition: SASA measures the surface area of a protein that is accessible to a solvent, indicating how exposed each amino acid residue is.
Discretization:
Continuous to Discrete: SASA values are continuous and are converted into discrete tokens by binning.
16 Bins: SASA values are divided into 16 bins.
Additional Tokens:
Unknown: For residues where SASA cannot be determined.
Mask: Represented by the zero vector during embedding.
Total Tokens: 16 bins + unknown + mask = 18 tokens.
Purpose: To represent the functional information associated with each amino acid residue.
Representation:
Bags of Keywords: Each residue is associated with a set of function keywords.
Quantization:
Locality Sensitive Hashing (LSH):
8 Tokens per Residue:
Token Space:
255 Possible Tokens: Each of the 8 tokens can be one of 255 possible values.
Special Tokens:
Empty Set: Indicates that there are no function annotations.
No-Annotation: Distinguishes between genuinely lacking annotations and missing data.
Mask: Represented by the zero vector in embeddings.
InterPro Annotations:
InterPro: A database that integrates various protein annotation resources, providing functional analysis of protein sequences.
Representation:
Multi-Hot Encoding:
Limitations:
Maximum of 16 Annotations: For computational efficiency, input annotations per residue are limited to 16.
Handling Missing Annotations:
Residues without annotations are represented by the zero vector.
Tokenization: The process of converting raw data into discrete units (tokens) that a model can process.
Special Tokens: Tokens that have a specific role in sequence processing, such as indicating the start or end of a sequence, or masking parts of the data during training.
One Token per Track:
Track: A type of data or feature associated with amino acid residues.
Example: Amino acid identity, secondary structure class, SASA bin.
Each residue has one token for each track, capturing different aspects of its properties.
Embedding Zero Vector:
When a token is represented by the zero vector, it contributes no information to the model.
Used for mask tokens to prevent them from influencing the model's predictions during training.
Discretization: Converting continuous data into discrete categories or bins, which allows models designed for categorical data to process continuous variables.
Locality Sensitive Hashing (LSH):
A method for hashing high-dimensional data so that similar items are more likely to collide (hash to the same bucket).
Useful for reducing dimensionality while preserving similarity.
Multi-Hot Encoding:
Unlike one-hot encoding (where only one position is '1'), multi-hot encoding allows multiple positions to be '1', representing multiple labels or features simultaneously.
Uniform Representation:
Tokenization enables the diverse types of protein data (sequences, structures, annotations) to be represented uniformly as sequences of tokens.
Facilitates the use of sequence-based models like Transformers.
Handling Variable-Length Sequences:
Special tokens and padding allow models to handle sequences of different lengths in batch processing.
Incorporating Multiple Data Types:
By assigning tokens to different features (tracks), models can learn from multiple aspects of protein data simultaneously.
Efficient Computation:
Discretization and quantization reduce computational complexity by converting high-dimensional or continuous data into manageable discrete tokens.
Preserving Biological Information:
Careful tokenization preserves important biological characteristics that are crucial for tasks like protein structure prediction, function annotation, and interaction analysis.
Note to the Reader:
Summary
Tokenization is a crucial step in preparing protein data for machine learning models. By converting sequences, structures, and annotations into tokens, models can process and learn from complex biological data. Special tokens and embedding strategies ensure that the data is represented accurately while accommodating the needs of the learning algorithms.
Feel free to ask if you have any questions or need further clarification on any of the topics covered!