esm.doi.bio/esm33/out26

\section*{A.1.7. Structure Tokenizer}

Each residue is associated with one of 4,096 structure tokens ( +4 special tokens), designed to provide a rich, learned representation of its local neighborhood. The tokens are generated with a VQ-VAE encoder, with a corresponding decoder to enable decoding of generated tokens back to $3 \mathrm{D}$ coordinates.

\section*{A.1.7.1. ENCODER}

The VQ-VAE encoder $f{ ext {enc }}$ consists of two geometric attention blocks (Transformer blocks, but self-attention replaced with geometricmha) with an embedding width of 1024 and 128 geometric heads per geometric attention layer.

The VQ-VAE encoder reasons over the backbone frames and the relative sequence position of residues in the local structure. Relative sequence positions are encoded through a learned positional embedding. Sequence positions are determined relative to the query residue (i.e., if the query residue has residue index 56 , then the residue in index 58 has a +2 sequence position). Relative sequence positions are clamped to $+/-32$ before encoding, meaning long-range contacts share sequence positional embeddings. Relative sequence positional embeddings define the initial encoder state $N$, and has shape $L imes 16 imes d$ (Algorithm 7, line 4). Note that this means the input to the VQ-VAE encoder is purely structural: no sequence (amino acid), function or other information is used here. Furthermore, each neighborhood is processed completely independently; for each residue, the encoder only uses the information of its 16 nearest neighbors.

Geometric attention blocks operate similar to Transformer blocks in that they transform a state according to an attention operation ( geometricmha ) and feedforward network (SwiGLU MLP). As such, the output has the same shape as the input. In this case, this means that the encoder outputs 16 latents per residue. However, we want to learn a single token, i.e., a single latent per residue, hence we take the embedding corresponding to the query residue position $N{:, 0,:}$.

The process of generating structure tokens (Algorithm 7) from the full 3D coordinates of the protein then is as follows:

  1. Local Neighborhood: For each residue, obtain the indices $N{ ext {idx }} \in{0 . . L-1}^{L imes 16}$ of the 16 nearest residues (as measured by $C{lpha}$ distance). The first of the 16 neighbors is always the residue itself. We also obtain the frames for each residue in a local neighborhood with $T_{ ext {knn }}$.

  2. Embed Neighbors: Embed the relative distance in sequence space for each neighbor, $\Delta i=\operatorname{clamp}\left(N_{\mathrm{idx}}- ight.$ $i,-32,32$ ) to form $N \in \mathbb{R}^{L imes 16 imes d}$

  3. Encode: Pass $N$ through a shallow encoder $f{ ext {enc }}$ consisting of 2 Transformer blocks, with regular multihead self-attention swapped with geometricmha. The attention is unmasked, all-to-all over the entire neighborhood.

  4. Quantize: Extract the first element $N{:, 0,:}$ from the neighborhood, which corresponds to the residue itself. Project it linearly, and quantize by replacing with the nearest vector in a codebook. This yields the structure token per residue. Algorithm 7 structureencode

Input: $x_{C_{lpha}} \in \mathbb{R}^{L     imes 3}, T \in S E(3)^{L}$
    1: $N_{\mathrm{idx}}=\operatorname{knn}\left(x_{C_{lpha}}
ight) \quad     riangleright\{0 . . L-1\}^{L    imes 16}$
    $: T_{\mathrm{knn}}=T\left[N_{\mathrm{idx}}
ight] \quad     riangleright S E(3)^{L  imes 16}$
    $\Delta i=\operatorname{clamp}\left(N_{\mathrm{idx}}-i,-32,32
ight)$
    $N=\operatorname{embed}(\Delta i)$
    $\Delta \mathbb{R}^{L     imes 16     imes d}$
    5: $N=f_{    ext {enc }}\left(N, T_{\mathrm{knn}}
ight)$
    $    riangle \mathbb{R}^{L   imes 16     imes d}$
    6: $z=\operatorname{Linear}\left(N_{:, 0,:}
ight) \quad     riangleright \mathbb{R}^{L  imes d^{\prime}}$
7: $z=$ quantize $(z) \quad     riangleright\{0 . .4095\}^{L    imes 16}$

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Structure Tokenizer and Its Encoder

Note: This explanation is aimed at a beginning graduate student in machine learning.


Introduction

In computational biology and bioinformatics, representing protein structures in a way that machines can efficiently process is crucial. Proteins are composed of amino acid residues, each with its own position in 3D space. To simplify the complex 3D structure for computational models, we convert these structures into a sequence of tokens, similar to words in a sentence.

The Structure Tokenizer is a method to encode each amino acid residue's local 3D neighborhood into a discrete token. Specifically, each residue is associated with one of 4,096 possible tokens (plus 4 special tokens). These tokens capture important structural information about the residue's surroundings.

To generate these tokens, we use a specialized neural network called a Vector Quantized Variational Autoencoder (VQ-VAE). The VQ-VAE has an encoder that processes the local structure around a residue to produce a latent representation, which is then quantized (assigned to the nearest code in a predefined codebook of tokens). There's also a decoder (not detailed here) that can reconstruct the 3D coordinates from the tokens.


Understanding the Components

1. Vector Quantized Variational Autoencoder (VQ-VAE)

In the context of the structure tokenizer, the VQ-VAE:

2. Geometric Attention Blocks


Detailed Step-by-Step Explanation

Let's walk through how the structure tokenizer works, focusing on the encoder part.

Algorithm Overview

Step 1: Local Neighborhood

For each residue in the protein:

  1. Find Nearest Neighbors: We compute the indices of the 16 nearest residues (including itself) based on the distance between their C-alpha atoms. This gives us N_idx, an array of shape (L x 16), where L is the number of residues in the protein.

  2. Retrieve Frames: We extract the frames (T_knn) of these neighboring residues using the indices. This results in T_knn of shape (L x 16).

Step 2: Embed Neighbors

  1. Relative Sequence Positions (Δi): For each neighbor, we calculate its position relative to the query residue in the sequence. If a neighbor is two residues ahead in the sequence, Δi = +2. We clamp values to the range [-32, +32] to limit the positional encoding size.

  2. Positional Embedding: We convert these relative positions into embeddings using a learned embedding function. This produces N, an array of shape (L x 16 x d), where d is the embedding dimension (e.g., d = 1024).

Step 3: Encode with Geometric Attention

We process the embeddings N through the VQ-VAE encoder f_enc, consisting of two geometric attention blocks. Here's what happens:

  1. Geometric Attention: Each block applies attention mechanisms that consider both the embeddings N and the geometric information from T_knn.

  2. Transformer-like Processing: Though similar to a Transformer, the self-attention is replaced with geometric_mha to account for 3D relationships.

  3. Output: The encoder outputs updated embeddings of the same shape (L x 16 x d).

Step 4: Extract and Quantize

  1. Extract Query Residue Embedding: From the 16 embeddings for each residue, we select the one corresponding to the residue itself (position 0). This gives us a tensor N[:, 0, :] of shape (L x d).

  2. Linear Projection: We apply a linear transformation to adjust the dimension if necessary, resulting in z of shape (L x d').

  3. Quantization: We replace each continuous embedding in z with the nearest vector (token) from a codebook containing 4,096 embeddings. This step converts continuous representations into discrete tokens.


Why This Approach?


Key Concepts Explained

Special Euclidean Group SE(3)

K-Nearest Neighbors (KNN)

Clamping Sequence Positions

Positional Embedding

Geometric Multi-Head Attention (geometric_mha)

Quantization

Codebook


Why 4,096 Tokens?


Applications


Summary


Final Thoughts

Understanding the structure tokenizer involves combining knowledge from machine learning, computational geometry, and bioinformatics. By representing protein structures as sequences of tokens, we open up new possibilities for applying advanced machine learning techniques to biological data, potentially leading to insights in protein function, interaction, and design.