out27

\section*{A.1.7.1.1. Codebook Learning}

quantize transforms the $L$ latents into $L$ discrete tokens. Since the VQ-VAE was initially proposed (67), numerous approaches and tricks have been developed to address issues with poor codebook utilization and unstable training. We chose to learn the codebook as an exponential moving average of encoder outputs (67-69). To improve codebook utilization, unused codes are re-initialized to encoder outputs.

\section*{A.1.7.1.2. Parallel Encoding}

To improve training and inference efficiency, we encode all local structure graphs within a protein in parallel. In practice, this means that given a batch of $B$ proteins with average sequence length $L$, then the inputs to the structure encoder will have shape $B L imes 16 imes d$.

\section*{A.1.7.2. DECODER}

While the encoder independently processes all local structures in parallel, the decoder $f_{ ext {dec }}$ attends over the entire set of $L$ tokens to reconstruct the full structure. It is composed using a stack of bidirectional Transformer blocks with regular self-attention.

As discussed in Appendix A.1.7.3, the VQ-VAE is trained in two stages. In the first stage, a smaller decoder trunk consisting of 8 Transformer blocks with width 1024, rotary positional embeddings, and MLPs is trained to only predict backbone coordinates. In the second stage, the decoder weights are re-initialized and the network size is expanded to 30 layers, each with an embedding dimension of 1280 ( $\sim 600 \mathrm{M}$ parameters) to predict all atom coordinates.

The exact steps to convert structure tokens back to 3D allatom coordinates using the decoder is provided in Algorithm 8 and detailed as follows,

Transformer: We embed the structure tokens and pass them through a stack of Transformer blocks $f_{d e c}$ (regular self-attention + MLP sublayers, no geometric attention).
Projection Head: We use a projection head to regress 3 3-D vectors per residue: a translation vector $ec{t}$, and 2 vectors $-ec{x}$ and $ec{y}$ that define the $N-C_{lpha}-C$ plane per residue after it has been rotated into position. This head also predicts the unnormalized sine and cosine components of up to 7 sidechain torsion angles.
Calculate $T$ : We use gram_schmidt to convert $ec{t}$, $ec{x}$, and $ec{y}$ into frames $T \in S E(3)^{L}$.
Calculate $T{ ext {local }}$ : We normalize the sine and cosine components and convert them to frames $T{ ext {local }} \in$ $S E(3)^{L imes 7}$ corresponding to rotations around the previous element on the sidechain.
Compose Frames: We compose each element of $T{ ext {local }}$ with its predecessors on a tree rooted at $T$ to form $T{ ext {global }} \in S E(3)^{L imes 14}$, corresponding to the transformations needed for each heavy atom per residue in atom14 representation.
Apply Frames: We then apply the frame to the $\overrightarrow{X_{ ext {ref }}} \in$ $\mathbb{R}^{L imes 14 imes 3}$ coordinates in a reference frame, to rotate and transform each residue into their final positions.

Algorithm 8 structure_decode
Input: $z \in\{0 . .4099\}^{L     imes 16}$
    1: $z=\operatorname{embed}(z)$
    $    riangle \mathbb{R}^{L   imes d}$
    2: $z=f_{d e c}(z)$
    $    riangleright \mathbb{R}^{L  imes d}$
    3: $ec{t}, ec{x}, ec{y}, \sin     heta, \overline{\cos    heta}=\operatorname{proj}(z) \quad  riangleright\left(\mathbb{R}^{L     imes 3}
ight)_{    imes 3},\left(\mathbb{R}^{L     imes 7}
ight)_{    imes 2}$
    4: $T=$ gram_schmidt $(ec{t},-ec{x}, ec{y}) \quad     riangle S E(3)^{L}$
    5: $\sin     heta=rac{\overline{\sin    heta}}{\sqrt{\sin ^{2}+\overline{\cos   heta}}} \quad   riangleright[-1,1]^{L   imes 7}$

    7: $T_{    ext {local }}=$ rot_frames $(\sin     heta, \cos  heta) \quad     riangleright S E(3)^{L  imes 7}$
    8: $T_{    ext {global }}=$ compose $\left(T_{   ext {local }}, T
ight) \quad     riangleright S E(3)^{L  imes 14}$
    9: $ec{X}=T_{    ext {global }}\left(\overrightarrow{X_{r e f}}
ight) \quad     riangleright \mathbb{R}^{L  imes 14     imes 3}$

Please explain this in detail to a beginning grad student in machine learning.

Response: Introduction

In this explanation, we'll break down the steps involved in the decoder of a Vector Quantized Variational Autoencoder (VQ-VAE) as applied to protein structures. We'll focus on understanding Algorithm 8 (structure_decode), which details how to convert discrete latent tokens back into 3D all-atom coordinates of a protein. We'll assume you're familiar with basic machine learning concepts but new to this specific application.

Background

Proteins are composed of amino acids that fold into complex three-dimensional structures critical for their function. Modeling protein structures computationally is essential in bioinformatics and drug discovery.

A VQ-VAE is a type of autoencoder that learns discrete latent representations (codes) of input data. Instead of mapping inputs to a continuous latent space as in traditional autoencoders, VQ-VAE maps inputs to the nearest embedding vectors in a codebook of discrete codes. This is particularly useful in modeling data where discrete representations are meaningful.

Overview of the Process

Encoding: The protein structure is broken down into local structures and encoded into discrete tokens using the VQ-VAE encoder.
Quantization: The continuous latent representations are quantized to the nearest code in the codebook, resulting in discrete tokens.
Decoding: The decoder takes these tokens and reconstructs the full 3D protein structure.

Focus on the Decoder (structure_decode)

The decoder's role is to reconstruct the protein's 3D structure from the discrete tokens. Let's go through Algorithm 8 step by step.

Algorithm 8: structure_decode

Inputs:

( z \in {0, \dots, 4099}^{L \times 16} ): A matrix of discrete tokens representing the protein structure.
( L ): The length of the protein sequence (number of residues).
Each residue is associated with 16 tokens.

Step-by-Step Explanation

Step 1: Embedding the Tokens

1: z = embed(z)  # Results in z ∈ ℝ^{L × d}

Purpose: Convert discrete tokens into continuous embeddings.
Process:
Each token (an integer between 0 and 4099) is mapped to a ( d )-dimensional embedding vector.
This is done using an embedding layer, similar to word embeddings in NLP.
Result: ( z ) is now a matrix of shape ( L \times d ), where each row corresponds to an embedded representation of the tokens for a residue.

Step 2: Passing through the Transformer Decoder

2: z = f_dec(z)  # z still ∈ ℝ^{L × d}

Purpose: Introduce context and capture relationships between different residues.
Process:
( f_{\text{dec}} ) is a stack of Transformer blocks.
The Transformer uses self-attention mechanisms to allow each position in the sequence to attend to other positions.
This helps model long-range dependencies in the protein structure.
Result: ( z ) now contains context-aware representations for each residue.

Step 3: Projection to Geometric Parameters

3: t_vec, x_vec, y_vec, sin_theta_bar, cos_theta_bar = proj(z)

Outputs:
( \vec{t} ): Translation vectors for each residue (shape ( L \times 3 )).
( \vec{x}, \vec{y} ): Vectors defining the orientation of each residue's backbone plane (each of shape ( L \times 3 )).
( \overline{\sin\theta}, \overline{\cos\theta} ): Unnormalized sine and cosine components of side-chain torsion angles (shape ( L \times 7 ) since there are up to 7 torsion angles per residue).
Purpose: Map the high-dimensional representations to geometric parameters needed for constructing the 3D structure.

Step 4: Constructing the Reference Frames

4: T = gram_schmidt(t_vec, -x_vec, y_vec)  # T ∈ SE(3)^{L}

Purpose: Build an orthonormal coordinate frame for each residue.
Process:
Gram-Schmidt process: Used to orthonormalize vectors.
Inputs are the translation vector and two orientation vectors.
( SE(3) ): The group of 3D rotations and translations (Special Euclidean group in 3D).
Result: A set of transformation matrices ( T ) representing positions and orientations for each residue.

Step 5: Normalizing the Angles

5: sin_theta = sin_theta_bar / sqrt(sin_theta_bar^2 + cos_theta_bar^2)  # sin_theta ∈ [-1, 1]^{L × 7}

Purpose: Normalize the predicted sine and cosine components to ensure they represent valid sine and cosine values (i.e., they satisfy ( \sin^2\theta + \cos^2\theta = 1 )).
Process:
Compute the magnitude ( r = \sqrt{\sin^2\theta + \cos^2\theta} ) for each angle.
Normalize ( \sin\theta ) by dividing by ( r ).
Similarly, normalize ( \cos\theta ).
Result: Normalized sine values for the torsion angles.

Step 6: (Assumed) Calculating Cosine of the Angles

Note: The algorithm snippet provided is missing explicit normalization of ( \cos\theta ).
Assumed Process:
Normalize ( \cos\theta ) similarly to ( \sin\theta ): plaintext cos_theta = cos_theta_bar / sqrt(sin_theta_bar^2 + cos_theta_bar^2) # cos_theta ∈ [-1, 1]^{L × 7}
Result: Normalized cosine values for the torsion angles.

Step 7: Calculating Local Transformation Frames

7: T_local = rot_frames(sin_theta, cos_theta)  # T_local ∈ SE(3)^{L × 7}

Purpose: Create rotation matrices that represent rotations around bonds for side-chain atoms.
Process:
rot_frames: A function that converts sine and cosine values into rotation matrices around a specified axis.
These rotations correspond to the torsion angles of the protein's side chains.
Result: A set of local transformation matrices ( T_{\text{local}} ) for each torsion angle.

Step 8: Composing Global Transformation Frames

8: T_global = compose(T_local, T)  # T_global ∈ SE(3)^{L × 14}

Purpose: Combine the initial residue frames with local rotations to get global transformations for each atom.
Process:
compose: Sequentially applies transformations from ( T_{\text{local}} ) and then ( T ).
The composition reflects the hierarchical nature of molecular geometry (backbone first, then side chains).
There are 14 heavy atoms per residue in the atom14 representation (backbone + side-chain atoms).
Result: A set of global transformation matrices ( T_{\text{global}} ) for all heavy atoms.

Step 9: Applying Transformations to Reference Coordinates

9: X = T_global(X_ref)  # X ∈ ℝ^{L × 14 × 3}

Purpose: Obtain the final 3D coordinates of each atom.
Process:
( X_{\text{ref}} ): Reference coordinates of atoms in a standard orientation.
Apply the transformations ( T{\text{global}} ) to ( X{\text{ref}} ) to get the transformed coordinates ( X ).
This involves rotating and translating the reference positions according to the computed frames.
Result: The 3D coordinates ( X ) of all heavy atoms in the protein.

Summary of the Steps

Embedding Tokens: Discrete tokens are converted into continuous vectors for processing by the neural network.
Transformer Processing: The embeddings are passed through Transformer layers to capture relationships between residues.
Projection to Geometric Parameters: Extract geometric information (positions, orientations, angles) necessary for reconstruction.
Frame Construction: Build initial coordinate frames for each residue using orthonormal vectors.
Angle Normalization: Ensure predicted angles are valid by normalizing sine and cosine components.
Local Rotations: Calculate rotations for side-chain atoms based on normalized angles.
Frame Composition: Combine local rotations with initial frames to get transformations for all atoms.
Apply Transformations: Use the transformation matrices to get the final 3D coordinates of the protein.

Key Concepts Explained

Embedding Layer: Converts discrete tokens (integers) into continuous vectors that can be processed by neural networks.
Transformer Architecture: Utilizes self-attention mechanisms to allow each element of the sequence to consider information from all other elements, capturing long-range dependencies.
Gram-Schmidt Orthonormalization: A process to take a set of vectors and produce an orthonormal set (vectors are all orthogonal and of unit length), which is crucial for defining coordinate frames in 3D space.
Special Euclidean Group ( SE(3) ): The mathematical group representing all possible rotations and translations in 3D space. Transformations in ( SE(3) ) are used to manipulate 3D objects.
Rotations with Sine and Cosine: Rotations in 3D can be represented using rotation matrices built from sine and cosine of rotation angles.
Composing Transformations: Applying multiple transformations in sequence, which in this context models the hierarchical structure of proteins (main chain and side chains).

Additional Notes

Side Chains and Torsion Angles: Proteins have side chains attached to the main backbone, which can rotate around bonds (torsion angles). Accurately modeling these rotations is key to reconstructing the protein's 3D structure.
Atom14 Representation: A method of representing protein structures that includes 14 heavy atoms per residue (excluding hydrogen atoms), covering both the backbone and side chains.

Conclusion

The decoder in this VQ-VAE model translates discrete tokens back into a full 3D protein structure by predicting geometric parameters and applying a series of transformations. This process leverages both advanced neural network architectures (Transformers) and geometric principles (rotations and coordinate frames) to accurately reconstruct complex molecular structures.

By understanding each step in structure_decode, a beginning graduate student can appreciate how machine learning models can integrate deep learning with domain-specific knowledge (such as protein chemistry) to solve complex problems.

Further Study Recommendations

Review Transformers: Understand how self-attention works and why Transformers are powerful for sequence modeling.
Study Protein Structure Basics: Learn about amino acid structures, protein backbones, and side chains to contextualize the geometric aspects.
Explore Rotations in 3D Space: Familiarize yourself with the math behind rotations, including rotation matrices and Euler angles.
Delve into VQ-VAEs: Read about how Vector Quantized VAEs differ from traditional VAEs and their applications in modeling discrete data.

References

VQ-VAE Original Paper: "Neural Discrete Representation Learning" by Oord et al., which introduces the concept of VQ-VAEs.
Transformer Architecture: "Attention Is All You Need" by Vaswani et al., which introduces the Transformer model.
Protein Modeling: Textbooks or review articles on computational protein structure prediction and modeling.

Feel free to ask if you have any questions or need further clarification on any of the steps or concepts!