out21

\section*{A.1.3. Tokenization}

During tokenization, special beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens are prepended and appended to mark the real start of sequences. When sequences are cropped due to length, the BOS and EOS tokens are cropped out to indicate protein fragments. In all cases, one token per track is used for each amino acid.

Sequence Protein sequences are tokenized as the 20 canon- ical amino acids, plus BOS, EOS, mask, pad, unknown. We keep four non-standard amino acids as in Lin et al. (5), B - Asparagine, U - selenocysteine, Z - glutamic acid, and O - ornithine. This totals to 29 tokens.

Structure Structure tokenization is described in Appendix A.1.7.1. ESM3 uses a codebook size of 4096 with 4 special tokens - EOS, BOS, mask, and pad.

Secondary Structure Secondary structure is taken to be the canonical 8-class tokens (60), with unknown and mask, for a total of 10 tokens. The mask token is forced to be the 0 -vector during embedding.

SASA The continuous values representing SASA are tokenized by discretization into a fixed set of 16 bins. SASA bin boundaries were chosen by computing SASA on 100 random structures and ensuring an equal number of residues belong in each bin. Unknown and mask are used for a total of 18 tokens. The mask token is forced to be the 0 -vector during embedding.

Function annotations We tokenize function annotations as bags of keywords, described in Appendix A.1.8. Keywords are quantized using LSH into 8 tokens per residue, each of which can be one of 255 tokens. There are three special tokens, empty set, no-annotation, and mask. Again, the mask token is forced to be the 0 vector during embedding.

Residue annotations InterPro annotations are tokenized as a multi-hot feature vector (1478 dimensions) over possible InterPro labels (38). Input annotations are limited to a maximum of 16 . When annotations are not present, we enforce that the 0 -vector is added.

Please explain this in detail to a beginning grad student in machine learning.

Response: Explanation of Section A.1.3: Tokenization

In this section, we discuss how protein data is converted into a format suitable for machine learning models through a process called tokenization. Tokenization involves breaking down complex data into discrete units called tokens that models can process. Here's a detailed explanation of how different types of protein-related data are tokenized.

1. General Tokenization Process

Special Tokens:
BOS (Beginning-of-Sequence): Marks the start of a sequence.
EOS (End-of-Sequence): Marks the end of a sequence.
Mask: Used during training for tasks like masked language modeling.
Pad: Used to pad sequences to a uniform length.
Unknown: Represents unknown or rare elements.
Sequence Cropping:
When sequences exceed a maximum length, they are cropped.
BOS and EOS tokens are removed in cropped sequences to indicate that these are fragments, not complete sequences.

2. Protein Sequence Tokenization

Amino Acid Representation:
20 Canonical Amino Acids: The standard amino acids found in proteins.
Non-Standard Amino Acids:
- B: Asparagine or Aspartic acid (used when the exact amino acid is uncertain).
- U: Selenocysteine.
- Z: Glutamine or Glutamic acid (used when the exact amino acid is uncertain).
- O: Ornithine (though in biology, O often represents Pyrrolysine).
Total Tokens for Protein Sequences:
24 Amino Acid Tokens: 20 canonical + 4 non-standard.
Special Tokens: BOS, EOS, mask, pad, unknown.
Total: 29 tokens for representing protein sequences.

3. Structure Tokenization

Purpose: To represent the 3D structural information of proteins.
Codebook:
Size: 4096 unique tokens.
Each token represents a unique structural configuration or feature.
Special Tokens:
EOS, BOS, mask, pad.
Total Tokens: 4096 codebook tokens + 4 special tokens.

Note: Detailed methods for structure tokenization are provided in Appendix A.1.7.1 (not included here).

4. Secondary Structure Tokenization

Secondary Structure Elements:
8 Classes: According to the DSSP (Define Secondary Structure of Proteins) classification:
- H: Alpha helix.
- B: Isolated beta-bridge.
- E: Extended strand (beta-sheet).
- G: 3₁₀ helix.
- I: Pi helix.
- T: Turn.
- S: Bend.
- C (or ' '): Coil (non-structured region).
Additional Tokens:
Unknown: For residues where the secondary structure is not known.
Mask: Used during model training.
- Embedding: Mask token is represented by the zero vector, meaning it does not contribute information.
Total Tokens: 8 classes + unknown + mask = 10 tokens.

5. Solvent Accessible Surface Area (SASA) Tokenization

Definition: SASA measures the surface area of a protein that is accessible to a solvent, indicating how exposed each amino acid residue is.
Discretization:
Continuous to Discrete: SASA values are continuous and are converted into discrete tokens by binning.
16 Bins: SASA values are divided into 16 bins.
- Equal Frequency: Bins are defined such that an equal number of residues fall into each bin.
- Bin Boundaries: Determined by calculating SASA for 100 random protein structures.
Additional Tokens:
Unknown: For residues where SASA cannot be determined.
Mask: Represented by the zero vector during embedding.
Total Tokens: 16 bins + unknown + mask = 18 tokens.

6. Function Annotations Tokenization

Purpose: To represent the functional information associated with each amino acid residue.
Representation:
Bags of Keywords: Each residue is associated with a set of function keywords.
Quantization:
Locality Sensitive Hashing (LSH):
- Used to map high-dimensional data (keywords) into tokens.
- Preserves similarity, meaning similar keywords are hashed to the same or nearby tokens.
8 Tokens per Residue:
- Each residue can have up to 8 tokens representing its function annotations.
Token Space:
255 Possible Tokens: Each of the 8 tokens can be one of 255 possible values.
Special Tokens:
Empty Set: Indicates that there are no function annotations.
No-Annotation: Distinguishes between genuinely lacking annotations and missing data.
Mask: Represented by the zero vector in embeddings.

7. Residue Annotations Tokenization

InterPro Annotations:
InterPro: A database that integrates various protein annotation resources, providing functional analysis of protein sequences.
Representation:
Multi-Hot Encoding:
- Each residue is represented by a binary vector indicating the presence or absence of specific InterPro labels.
- 1478 Dimensions: The vector length corresponds to the number of possible InterPro labels (assumed to be 1478, though the text mentions 38, which may require clarification).
Limitations:
Maximum of 16 Annotations: For computational efficiency, input annotations per residue are limited to 16.
Handling Missing Annotations:
Residues without annotations are represented by the zero vector.

Key Concepts Explained

Tokenization: The process of converting raw data into discrete units (tokens) that a model can process.
Special Tokens: Tokens that have a specific role in sequence processing, such as indicating the start or end of a sequence, or masking parts of the data during training.
One Token per Track:
Track: A type of data or feature associated with amino acid residues.
Example: Amino acid identity, secondary structure class, SASA bin.
Each residue has one token for each track, capturing different aspects of its properties.
Embedding Zero Vector:
When a token is represented by the zero vector, it contributes no information to the model.
Used for mask tokens to prevent them from influencing the model's predictions during training.
Discretization: Converting continuous data into discrete categories or bins, which allows models designed for categorical data to process continuous variables.
Locality Sensitive Hashing (LSH):
A method for hashing high-dimensional data so that similar items are more likely to collide (hash to the same bucket).
Useful for reducing dimensionality while preserving similarity.
Multi-Hot Encoding:
Unlike one-hot encoding (where only one position is '1'), multi-hot encoding allows multiple positions to be '1', representing multiple labels or features simultaneously.

Importance in Machine Learning for Proteins

Uniform Representation:
Tokenization enables the diverse types of protein data (sequences, structures, annotations) to be represented uniformly as sequences of tokens.
Facilitates the use of sequence-based models like Transformers.
Handling Variable-Length Sequences:
Special tokens and padding allow models to handle sequences of different lengths in batch processing.
Incorporating Multiple Data Types:
By assigning tokens to different features (tracks), models can learn from multiple aspects of protein data simultaneously.
Efficient Computation:
Discretization and quantization reduce computational complexity by converting high-dimensional or continuous data into manageable discrete tokens.
Preserving Biological Information:
Careful tokenization preserves important biological characteristics that are crucial for tasks like protein structure prediction, function annotation, and interaction analysis.

Note to the Reader:

This explanation is intended to provide a clear understanding of how protein data is tokenized for machine learning purposes.
The specific details, such as the exact number of tokens or dimensions, may vary depending on the dataset and implementation.
For further details on the methods (e.g., structure tokenization in Appendix A.1.7.1), it's recommended to refer to the corresponding sections in the original document or supplementary materials.

Summary

Tokenization is a crucial step in preparing protein data for machine learning models. By converting sequences, structures, and annotations into tokens, models can process and learn from complex biological data. Special tokens and embedding strategies ensure that the data is represented accurately while accommodating the needs of the learning algorithms.

Feel free to ask if you have any questions or need further clarification on any of the topics covered!