out29

\section*{A.1.7.3.2. Stage 2.}

In the second stage of VQ-VAE training, the encoder and codebook are frozen and a new, deeper, decoder is trained. This second stage of training has multiple purposes. First, a larger decoder improves reconstruction quality. Second, augmented structure tokens from ESM3 are added to enable learning pAE and pLDDT heads. Third, we add sequence conditioning and train with all-atom geometric losses to be able to decode all-atom protein structures. Fourth, we extend the context length of the decoder to be able to decode multimers and larger single chain proteins.

Training data for stage 2 consists of predicted structures in AFDB and ESMAtlas, as well as single chain, multimer, and antibody-antigen complexes from the PDB. Sequence conditioning was added to the decoder via learned embeddings which are summed with structure token embeddings at the input to the decoder stack.

The structure token decoder was trained in three stages: $2 \mathrm{~A}$, 2B, 2C detailed in Table S2. The purpose of stage 2A is to efficiently learn decoding of all-atom structures. Enhanced training efficiency is achieved by keeping a short context length and omitting the pAE and pLDDT losses, which are both memory-consuming and can be in competition with strong reconstruction quality. In stage $2 \mathrm{~B}$, we add the pAE and pLDDT losses. These structure confidence heads cannot be well-calibrated unless structure tokens are augmented such that ESM3-predicted structure tokens are within the training distribution. To this end, for stages $2 \mathrm{~B}$ and $2 \mathrm{C}$ we replace ground truth structure tokens with ESM3-predicted structure tokens $50 \%$ of the time. In stage $2 \mathrm{C}$, we extend context length to 2048 and upsample experimental structures relative to predicted structures.

All-atom Distance Loss: We generalize the Backbone Distance Loss to all atoms by computing a pairwise $L{2}$ distance matrix for all 14 atoms in the atom14 representation of each residue. This results in $D{ ext {pred }}, D \in \mathbb{R}^{14 L imes 14 L}$. The rest of the computation follows as before: $\left(D_{ ext {pred }}-D ight)^{2}$, clamping to $(5 \AA)^{2}$, and taking the mean, while masking invalid pairs (where any atom14 representations are "empty").
All-atom Direction Loss: We extend the Backbone Direction Loss to all heavy atoms:

(a) Compute a pairwise distance matrix per residue from the 3D coordinates of each atom in atom14 representation, resulting in $\mathbb{R}^{L imes 14 imes 14}$.

(b) Mark atoms less than $2 \AA$ apart (excluding self) as covalent bonds.

(c) Filter to keep atoms with at least 2 covalent bonds, keeping only the first 2 bonds per atom, with ordering determined by the atom 14 representation.

(d) For each selected atom, compute a normal (zaxis) vector to the plane spanned by its two covalent bonds, resulting in three vectors per selected atom.

(e) Randomly subsample to 10,000 vectors per protein if the number exceeds 10,000 , ensuring the same vectors are sampled in both predicted and ground truth structures.

(f) Compute all-to-all pairwise dot products, forming $D{ ext {pred }}, D \in \mathbb{R}^{n imes n}$. Compute $\left(D{ ext {pred }}-D ight)^{2}$, clamp the max to 20 , and take the mean.

pLDDT Head: Uses a Regression Head with 50 output classes (each capturing 0.02 units from 0 to 100 ). Predicted structures are compared to ground truth to calculate per-residue pLDDT values, which are supervised with cross-entropy loss.
pAE Head: Use a Pairwise Projection Head to produce 64 logits per residue pair $\in \mathbb{R}^{L imes L imes d}$, converting to probabilities $p$ via softmax. Each probability corresponds to a bin representing $0.5 \AA$ of positional error, with centers $[0.25,0.75, \ldots, 31.25,31.75]$.

Please explain this in detail to a beginning grad student in machine learning.

Response: ### Introduction to Stage 2 of VQ-VAE Training for Protein Structures

Overview

In this explanation, we'll delve into the second stage of training a Vector Quantized Variational AutoEncoder (VQ-VAE) designed for protein structures. By the end, you'll understand:

The objectives of Stage 2 training.
The enhancements made to the decoder.
Introduction of new losses and why they're important.
Details about the pLDDT and pAE heads.

Let's start by setting the context.

Background Concepts

VQ-VAE in Protein Modeling

VQ-VAE: A type of autoencoder that uses a discrete latent space, quantizing the encoder outputs into a finite set of "codebook" vectors.
Protein Structures: Proteins are complex molecules made up of amino acids. Understanding their 3D structures is crucial for many biological applications.
Structure Tokens: Discrete representations (codes) of protein structures obtained from the encoder and quantized via the codebook.

Why VQ-VAE for Proteins?

By representing protein structures as sequences of discrete tokens, we can leverage sequence modeling techniques (like transformers) for structure prediction and generation.

Goals of Stage 2 Training

In Stage 2, we aim to enhance the VQ-VAE to better handle complex protein structures and improve prediction confidence. The main goals are:

Improve Reconstruction Quality: By using a larger decoder, we can capture more complex patterns in protein structures.
Enable Confidence Predictions: Incorporate pLDDT (Predicted Local Distance Difference Test) and pAE (Predicted Alignment Error) heads to estimate the confidence in our structure predictions.
All-Atom Modeling: Move from modeling just the backbone atoms to all atoms in the protein, providing a more detailed representation.
Handle Larger Proteins and Complexes: Extend the context length to model larger proteins, including multimers (complexes of multiple protein chains).

Training Data for Stage 2

Predicted Structures: From databases like AlphaFold Database (AFDB) and ESMAtlas, containing protein structures predicted by deep learning methods.
Experimental Structures: Directly from the Protein Data Bank (PDB), including single chains, multimers, and complexes like antibody-antigen interactions.

Enhancements in Stage 2

1. Freezing the Encoder and Codebook

Why Freeze?: We assume the encoder and codebook have already learned a good quantization of protein structures.
Focus on Decoder: By freezing them, we can focus on improving the decoder without changing the latent space.

2. Larger Decoder

Deeper Architecture: A bigger decoder can model more complex dependencies in the data.
Sequence Conditioning: Introduce sequence embeddings (learned representations of amino acid sequences) that are combined with structure tokens to improve decoding.

3. Structure Confidence Heads

pLDDT Head: Predicts the confidence (per residue) of the predicted structure.
pAE Head: Predicts positional errors between predicted and true structures.

4. All-Atom Modeling

From Backbone to All-Atom: Previously, we might have focused only on the protein backbone (main chain). Now, we include all side-chain atoms for detailed modeling.

5. Longer Context Length

Context Length: The number of tokens the model can process at once.
Increased to 2048: Allows the model to handle longer protein sequences and complexes.

Training Stages Within Stage 2

The training is split into three sub-stages: 2A, 2B, and 2C.

Stage 2A

Objective: Efficiently learn to decode all-atom structures.
Approach:
Short Context Length: Keeps computational requirements manageable.
Exclude pAE and pLDDT Losses: These are computationally heavy and might interfere with learning good reconstructions early on.

Stage 2B

Objective: Introduce confidence predictions.
Approach:
Add pAE and pLDDT Losses.
Augment Structure Tokens: Replace ground truth tokens with predicted ones from ESM3 (a protein structure prediction model) 50% of the time. This helps the model learn to handle variability in predictions.

Stage 2C

Objective: Prepare the model for larger proteins and complexes.
Approach:
Extend Context Length to 2048.
Upsample Experimental Structures: Give more weight to real structures from PDB to improve accuracy on experimental data.

Loss Functions and Model Heads

Now, let's dive into the specific losses and model components introduced in Stage 2.

1. All-Atom Distance Loss

Purpose: Encourage the model to accurately predict the positions of all atoms, not just the backbone.
Atom14 Representation: Each residue is represented with 14 atom positions (including side chains).
Computation:
Pairwise Distances: Compute the L2 (Euclidean) distances between all pairs of atoms.
- D_pred: Predicted distance matrix.
- D: Ground truth distance matrix.
- Size: ( \mathbb{R}^{14L \times 14L} ), where ( L ) is the sequence length.
Loss: Compute the squared difference ((D_{\text{pred}} - D)^2), clamp values above ((5 \text{Å})^2) to limit the impact of large errors, and take the mean.
Masking: Ignore invalid pairs (e.g., missing atoms in certain residues).

2. All-Atom Direction Loss

Purpose: Encourage correct orientation and geometry of the atoms relative to each other.
Steps:

Pairwise Distances per Residue:
- Compute the distance matrix between atoms within each residue (( \mathbb{R}^{L \times 14 \times 14} )).
Identify Covalent Bonds:
- Atoms less than 2 Å apart (excluding self) are considered covalently bonded.
Filter Atoms:
- Keep atoms with at least two covalent bonds (necessary for defining a plane).
- Keep only the first two bonds per atom for consistency.
Compute Normal Vectors:
- For each selected atom, compute the normal vector (z-axis) to the plane defined by its two bonds.
- This captures the orientation of the atom in 3D space.
Subsampling:
- If there are more than 10,000 vectors (which can happen in large proteins), randomly select 10,000 to reduce computational load.
- Ensure the same vectors are used for both predicted and ground truth structures for consistency.
Compute Pairwise Dot Products:
- Form matrices ( D_{\text{pred}}, D \in \mathbb{R}^{n \times n} ), where ( n ) is the number of vectors (up to 10,000).
- Each element is the dot product between pairs of normal vectors, measuring relative orientations.
Compute Loss:
- Calculate ((D_{\text{pred}} - D)^2), clamp values above 20 to limit large errors, and take the mean.

3. pLDDT Head (Predicted Local Distance Difference Test)

Purpose: Provide a per-residue confidence score for the predicted structure, similar to how AlphaFold predicts confidence.
Implementation:
Regression Head: Outputs a score for each residue.
50 Output Classes: Each class represents a range of confidence scores from 0 to 100, with intervals of 2 units (0.02 units would be too small; assuming it's 2 units given 50 classes).
Supervision: Use the ground truth to compute the actual pLDDT values and apply a cross-entropy loss.

4. pAE Head (Predicted Alignment Error)

Purpose: Predict the expected positional error between the predicted structure and the true structure for each pair of residues.
Implementation:
Pairwise Projection Head: For each pair of residues, output logits representing different bins of positional error.
64 Logits Per Residue Pair: Corresponds to bins of 0.5 Å from 0 Å to 32 Å.
- Bin centers are at 0.25 Å, 0.75 Å, …, up to 31.75 Å.
Softmax over Logits: Converts logits into probabilities ( p ).
Supervision: Use the ground truth errors to compute the target bin and apply a cross-entropy loss.

Why These Losses and Heads?

All-Atom Losses: Capturing detailed atomic interactions improves the physical plausibility of the predicted structures.
Direction Loss: Ensures that the geometry and orientation of side chains and other structural features are accurate.
pLDDT and pAE: Providing confidence measures is crucial in applications where understanding the reliability of predictions is important (e.g., drug design).

Handling ESM3-Predicted Structure Tokens

Why Use Predicted Tokens?: Incorporating structures obtained from another model (ESM3) helps the decoder learn to handle variability and errors present in predicted structures, not just perfect ground truth data.
50% Replacement: By replacing ground truth tokens with ESM3 predictions half of the time, the model becomes robust to discrepancies between predicted and true structures.
Calibrated Confidence Predictions: This data augmentation is essential for the pLDDT and pAE heads to make meaningful confidence estimates that reflect real-world scenarios where predictions are not perfect.

Extending Context Length

From Short to Long Contexts:
Initial Training (Stage 2A): Shorter sequences allow for faster and more efficient training.
Final Training (Stage 2C): Increasing the context length to 2048 tokens enables the model to process and generate longer protein sequences and complexes.
Upsampling Experimental Structures:
Why Upsample?: Experimental structures from the PDB are of high quality and provide valuable information that might not be present in predicted structures.
Effect: The model learns to produce high-quality outputs that are consistent with experimentally determined structures.

Summary

In Stage 2 of training the VQ-VAE for protein structures, we make significant enhancements to the decoder to improve its ability to reconstruct accurate and detailed protein structures. By introducing new losses that focus on all-atom distances and orientations, as well as confidence prediction heads (pLDDT and pAE), we enable the model to produce not just accurate structures but also estimates of how reliable those structures are.

Data augmentation with ESM3-predicted tokens helps the model generalize to the types of errors present in real-world predictions, leading to better calibration of the confidence measures. Extending the context length and focusing on experimental structures ensure that the model is capable of handling the complexity of large proteins and protein complexes.

Key Takeaways for a Machine Learning Graduate Student

Model Freezing: Fixing certain parts of the model (encoder and codebook) allows focused improvements on other parts (decoder) without disrupting learned representations.
Data Augmentation: Introducing variability in the input data (using ESM3 predictions) helps the model generalize and improves robustness.
Multi-Stage Training: Breaking down training into sub-stages with specific objectives can help manage computational resources and improve learning efficiency.
Loss Function Design: Custom losses that reflect the properties of the data (e.g., geometric constraints in proteins) are crucial for domain-specific applications.
Confidence Estimation: Predicting not just outputs but also confidence in those outputs is important for practical applications where uncertainty matters.