out26

\section*{A.1.7. Structure Tokenizer}

Each residue is associated with one of 4,096 structure tokens ( +4 special tokens), designed to provide a rich, learned representation of its local neighborhood. The tokens are generated with a VQ-VAE encoder, with a corresponding decoder to enable decoding of generated tokens back to $3 \mathrm{D}$ coordinates.

\section*{A.1.7.1. ENCODER}

The VQ-VAE encoder $f{ ext {enc }}$ consists of two geometric attention blocks (Transformer blocks, but self-attention replaced with geometricmha) with an embedding width of 1024 and 128 geometric heads per geometric attention layer.

The VQ-VAE encoder reasons over the backbone frames and the relative sequence position of residues in the local structure. Relative sequence positions are encoded through a learned positional embedding. Sequence positions are determined relative to the query residue (i.e., if the query residue has residue index 56 , then the residue in index 58 has a +2 sequence position). Relative sequence positions are clamped to $+/-32$ before encoding, meaning long-range contacts share sequence positional embeddings. Relative sequence positional embeddings define the initial encoder state $N$, and has shape $L imes 16 imes d$ (Algorithm 7, line 4). Note that this means the input to the VQ-VAE encoder is purely structural: no sequence (amino acid), function or other information is used here. Furthermore, each neighborhood is processed completely independently; for each residue, the encoder only uses the information of its 16 nearest neighbors.

Geometric attention blocks operate similar to Transformer blocks in that they transform a state according to an attention operation ( geometricmha ) and feedforward network (SwiGLU MLP). As such, the output has the same shape as the input. In this case, this means that the encoder outputs 16 latents per residue. However, we want to learn a single token, i.e., a single latent per residue, hence we take the embedding corresponding to the query residue position $N{:, 0,:}$.

The process of generating structure tokens (Algorithm 7) from the full 3D coordinates of the protein then is as follows:

Local Neighborhood: For each residue, obtain the indices $N{ ext {idx }} \in{0 . . L-1}^{L imes 16}$ of the 16 nearest residues (as measured by $C{lpha}$ distance). The first of the 16 neighbors is always the residue itself. We also obtain the frames for each residue in a local neighborhood with $T_{ ext {knn }}$.
Embed Neighbors: Embed the relative distance in sequence space for each neighbor, $\Delta i=\operatorname{clamp}\left(N_{\mathrm{idx}}- ight.$ $i,-32,32$ ) to form $N \in \mathbb{R}^{L imes 16 imes d}$
Encode: Pass $N$ through a shallow encoder $f{ ext {enc }}$ consisting of 2 Transformer blocks, with regular multihead self-attention swapped with geometricmha. The attention is unmasked, all-to-all over the entire neighborhood.
Quantize: Extract the first element $N{:, 0,:}$ from the neighborhood, which corresponds to the residue itself. Project it linearly, and quantize by replacing with the nearest vector in a codebook. This yields the structure token per residue. Algorithm 7 structureencode

Input: $x_{C_{lpha}} \in \mathbb{R}^{L     imes 3}, T \in S E(3)^{L}$
    1: $N_{\mathrm{idx}}=\operatorname{knn}\left(x_{C_{lpha}}
ight) \quad     riangleright\{0 . . L-1\}^{L    imes 16}$
    $: T_{\mathrm{knn}}=T\left[N_{\mathrm{idx}}
ight] \quad     riangleright S E(3)^{L  imes 16}$
    $\Delta i=\operatorname{clamp}\left(N_{\mathrm{idx}}-i,-32,32
ight)$
    $N=\operatorname{embed}(\Delta i)$
    $\Delta \mathbb{R}^{L     imes 16     imes d}$
    5: $N=f_{    ext {enc }}\left(N, T_{\mathrm{knn}}
ight)$
    $    riangle \mathbb{R}^{L   imes 16     imes d}$
    6: $z=\operatorname{Linear}\left(N_{:, 0,:}
ight) \quad     riangleright \mathbb{R}^{L  imes d^{\prime}}$
7: $z=$ quantize $(z) \quad     riangleright\{0 . .4095\}^{L    imes 16}$

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Structure Tokenizer and Its Encoder

Note: This explanation is aimed at a beginning graduate student in machine learning.

Introduction

In computational biology and bioinformatics, representing protein structures in a way that machines can efficiently process is crucial. Proteins are composed of amino acid residues, each with its own position in 3D space. To simplify the complex 3D structure for computational models, we convert these structures into a sequence of tokens, similar to words in a sentence.

The Structure Tokenizer is a method to encode each amino acid residue's local 3D neighborhood into a discrete token. Specifically, each residue is associated with one of 4,096 possible tokens (plus 4 special tokens). These tokens capture important structural information about the residue's surroundings.

To generate these tokens, we use a specialized neural network called a Vector Quantized Variational Autoencoder (VQ-VAE). The VQ-VAE has an encoder that processes the local structure around a residue to produce a latent representation, which is then quantized (assigned to the nearest code in a predefined codebook of tokens). There's also a decoder (not detailed here) that can reconstruct the 3D coordinates from the tokens.

Understanding the Components

1. Vector Quantized Variational Autoencoder (VQ-VAE)

Autoencoder: A neural network that learns to compress data (encoder) and then reconstruct it (decoder).
Variational Autoencoder: An autoencoder that models the data in a probabilistic framework, allowing for a generative model.
Vector Quantization: The process of mapping a large set of input values to a smaller set (codebook) of discrete values, minimizing the loss due to quantization.

In the context of the structure tokenizer, the VQ-VAE:

Encodes the local structural information around each residue.
Quantizes this information to one of the predefined tokens.
Allows for reconstructing the 3D coordinates from the tokens (decoder part).

2. Geometric Attention Blocks

Based on the Transformer architecture, commonly used in natural language processing.
Instead of traditional self-attention, we use geometric multi-head attention (geometric_mha), which considers the spatial relationships (geometry) of the residues.
Each block processes information about residues and how they're arranged in 3D space.

Detailed Step-by-Step Explanation

Let's walk through how the structure tokenizer works, focusing on the encoder part.

Algorithm Overview

Input: We have a protein structure represented by:
The coordinates of the C-alpha atoms (x_{C_α}) for each residue. C-alpha atoms form the backbone of protein structures.
The frames (T), representing the position and orientation of each residue in 3D space. These are elements of the Special Euclidean group SE(3), accounting for rotations and translations in 3D.

Step 1: Local Neighborhood

For each residue in the protein:

Find Nearest Neighbors: We compute the indices of the 16 nearest residues (including itself) based on the distance between their C-alpha atoms. This gives us N_idx, an array of shape (L x 16), where L is the number of residues in the protein.
Retrieve Frames: We extract the frames (T_knn) of these neighboring residues using the indices. This results in T_knn of shape (L x 16).

Step 2: Embed Neighbors

Relative Sequence Positions (Δi): For each neighbor, we calculate its position relative to the query residue in the sequence. If a neighbor is two residues ahead in the sequence, Δi = +2. We clamp values to the range [-32, +32] to limit the positional encoding size.
Positional Embedding: We convert these relative positions into embeddings using a learned embedding function. This produces N, an array of shape (L x 16 x d), where d is the embedding dimension (e.g., d = 1024).

Step 3: Encode with Geometric Attention

We process the embeddings N through the VQ-VAE encoder f_enc, consisting of two geometric attention blocks. Here's what happens:

Geometric Attention: Each block applies attention mechanisms that consider both the embeddings N and the geometric information from T_knn.
Transformer-like Processing: Though similar to a Transformer, the self-attention is replaced with geometric_mha to account for 3D relationships.
Output: The encoder outputs updated embeddings of the same shape (L x 16 x d).

Step 4: Extract and Quantize

Extract Query Residue Embedding: From the 16 embeddings for each residue, we select the one corresponding to the residue itself (position 0). This gives us a tensor N[:, 0, :] of shape (L x d).
Linear Projection: We apply a linear transformation to adjust the dimension if necessary, resulting in z of shape (L x d').
Quantization: We replace each continuous embedding in z with the nearest vector (token) from a codebook containing 4,096 embeddings. This step converts continuous representations into discrete tokens.

Why This Approach?

Capturing Local Structure: By focusing on the local neighborhood, we capture important structural motifs and interactions that define a residue's environment.
Discrete Representation: Quantizing embeddings into tokens allows us to represent complex 3D information in a sequence-like format, which is useful for models that handle sequences (like language models).
Geometric Awareness: Incorporating geometric attention ensures that spatial relationships are considered, which is crucial for proteins where 3D structure determines function.

Key Concepts Explained

Special Euclidean Group SE(3)

Represents the group of all rotations and translations (rigid body movements) in 3D space.
In our context, each residue's frame in T is an element of SE(3), indicating how it is positioned and oriented in 3D space.

K-Nearest Neighbors (KNN)

A method to find the k closest points to a given point.
Here, we use it to find the 16 residues closest to each query residue based on their C-alpha atom positions.

Clamping Sequence Positions

To prevent the model from focusing too much on large sequence distances, we clamp relative positions to a fixed range.
Residues beyond +/-32 positions in the sequence are all treated the same in terms of positional embedding.

Positional Embedding

A technique used to inject information about the position of elements in a sequence.
In Transformers, this helps the model know where each token is in a sequence.
We extend this concept to include not just sequence position but also relative distances between residues.

Geometric Multi-Head Attention (geometric_mha)

An extension of the standard multi-head attention mechanism that incorporates geometric information.
Attention scores are computed not just based on embeddings but also on spatial relationships between residues.

Quantization

The process of mapping continuous values to discrete ones.
By quantizing embeddings to the closest codebook vectors, we effectively create a vocabulary of structural motifs.

Codebook

A set of vectors (embeddings) that represent the possible tokens.
In our case, the codebook has 4,096 vectors, each corresponding to a unique structural token.

Why 4,096 Tokens?

Balance: 4,096 tokens provide a good balance between expressiveness and computational efficiency.
Expressiveness: With more tokens, we can represent a wider variety of local structural configurations.
Efficiency: Computational resources grow with the size of the codebook; 4,096 is manageable for modern hardware.

Applications

Protein Structure Prediction: Representing structures as tokens can help in predicting or analyzing protein folding.
Protein Design: Designing new proteins by manipulating sequences of tokens representing desired structural motifs.
Sequence Modeling Techniques: Leveraging models like Transformers, originally designed for text, to process protein structures.

Summary

Goal: Convert the complex 3D structural information of proteins into a sequence of discrete tokens that capture local structural features.
Method:
Extract local neighborhoods for each residue.
Compute relative positions and embed them.
Use geometric attention to process embeddings, considering spatial information.
Quantize the processed embeddings to discrete tokens from a codebook.
Outcome: Each residue is represented by a token that encapsulates its local 3D environment, enabling machine learning models to process protein structures more effectively.

Final Thoughts

Understanding the structure tokenizer involves combining knowledge from machine learning, computational geometry, and bioinformatics. By representing protein structures as sequences of tokens, we open up new possibilities for applying advanced machine learning techniques to biological data, potentially leading to insights in protein function, interaction, and design.