out30

\section*{A.1.7.4. EVALUATION}

We evaluate the reconstruction quality of the structure tokenizer after stage 1 and stage 2 of training using a set of CAMEO, CASP14, and CASP15 proteins taken after the training cutoff date (Fig. S3). Both decoders consistently reach RMSD $<1 \AA$, LDDT-CA $>0.98$. The retraining of the structure token decoder results in substantial improvements in reconstruction quality across all test sets. The stage 2 decoder, trained with an all-atom reconstruction loss and a sequence input, achieves strong all-atom reconstruction as well (Fig. S3C). We also visualize a random sample of backbone reconstructions on the CAMEO test set (Fig. S4A). Looking at the proteins with worse reconstruction quality, we find that long regions with few tertiary contacts, disordered regions, and unresolved coordinates

egin{tabular}{lllllll} \hline Stage & Steps & egin{tabular}{l} All-atom \ geometric \ losses \end{tabular} & egin{tabular}{l} pAE \ pLDDT \ losses \end{tabular} & egin{tabular}{l} and \ with ESM3- \ predicted \ tokens \end{tabular} & egin{tabular}{l} Data mixture \ length \end{tabular} & \ \hline 2A & $90 \mathrm{k}$ & $\checkmark$ & $X$ & $X$ & 512 & egin{tabular}{l} Roughly uniform sampling of pre- \ dicted and experimental structures \end{tabular} \ 2B & $20 \mathrm{k}$ & $\checkmark$ & $\checkmark$ & $\checkmark$ & 512 & egin{tabular}{l} Roughly uniform sampling of pre- \ dicted and experimental structures \end{tabular} \ 2C & $30 \mathrm{k}$ & $\checkmark$ & $\checkmark$ & & 2048 & egin{tabular}{l} Upsampling experimental structures \end{tabular} \ \hline \end{tabular}

Table S2. Training details for stage 2 training of an all-atom structure token decoder.

can lead to inaccurate global orientation of structural elements, while local structure reconstruction remains largely error-free (Fig. S4B). This behavior can be explained by the fact that the tokenizer relies on tertiary contacts to resolve the global orientation of a residue.

We also investigate the vocabulary learned by the structure tokenizer by visualizing the local neighborhoods which map to the same learned structure token. We find that many structure tokens encode semantically coherent sets of local neighborhoods (Fig. S5A). However, a number of tokens appear to represent multiple local neighborhoods (Fig. S5B). While the meaning of a single token may be ambiguous, the high-fidelity reconstruction quality from the decoder suggests that it is able to disambiguate given surrounding context in the full set of structure tokens.

Fig. S6 indicates that pLDDT and pTM are well-calibrated. We assess the calibration of the structure confidence heads on the CAMEO test set using structure tokens predicted by ESM3 7B. Most predictions for pLDDT lie along the diagonal, though there is a small bias towards more confident predictions. As pTM is a pessimistic estimator of the TMscore, we find that pTM is biased downwards. Anecdotally, we also find that pLDDT can be poorly calibrated for some generated sequences, particularly in alpha helical regions where it can be an overestimate.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Evaluation of a Structure Tokenizer in Protein Modeling

Introduction

As a beginning graduate student in machine learning, you're likely familiar with the basics of deep learning models and their applications. In computational biology, one exciting application area is protein structure prediction and reconstruction. This domain leverages machine learning to predict the three-dimensional structures of proteins based on their amino acid sequences.

In the provided text, the authors are evaluating the performance of a structure tokenizer—a model that encodes protein structures into discrete tokens, much like how words are tokenized in natural language processing. These tokens can then be used for various tasks, such as generating new protein sequences or reconstructing protein structures.

This explanation will break down the key components of the evaluation, the results, and their implications, making the content accessible and clear.

1. Overview of the Evaluation

The authors are assessing the reconstruction quality of their structure tokenizer after two stages of training:

Stage 1: Initial training of the tokenizer.
Stage 2: Retraining and refinement of the tokenizer with additional objectives and data.

They use test sets comprising proteins from the CAMEO, CASP14, and CASP15 datasets. These proteins are selected to ensure they were not part of the training data (i.e., they are from after the training cutoff date), ensuring an unbiased evaluation.

Key Metrics Used:

Root Mean Square Deviation (RMSD): Measures the average distance between atoms of superimposed proteins. A lower RMSD indicates higher structural similarity between the reconstructed and original protein structures. An RMSD of less than 1 Ångström (Å) is considered highly accurate.
Local Distance Difference Test for C-alpha atoms (lDDT-CA): A score ranging from 0 to 1 that assesses the local structural agreement of C-alpha atoms between two protein structures. A score greater than 0.98 indicates excellent local structural accuracy.

2. Results of the Evaluation

High Reconstruction Accuracy:

Stage 1 and Stage 2 Performance: Both the initial and retrained decoders achieve high reconstruction fidelity, with RMSD values less than 1 Å and lDDT-CA scores greater than 0.98. This demonstrates that the tokenizer can accurately reconstruct protein backbones from the tokens.
Improvement After Retraining: After Stage 2 retraining, there's a significant improvement in reconstruction quality across all test sets. The retrained decoder can reconstruct not only the backbone atoms but also the side-chain atoms (all-atom reconstruction), indicating a more precise and detailed structural recovery.

Visualizing Reconstructions:

The authors visualize random samples of reconstructed protein backbones from the CAMEO test set (referred to as Fig. S4A in the original text).
Observing Errors: For proteins with lower reconstruction quality, errors are often localized to specific regions rather than the entire structure.

3. Analysis of Reconstruction Errors

Common Sources of Errors:

Long Regions with Few Tertiary Contacts: Regions where amino acids do not interact much with other parts of the protein (fewer contacts) are harder to position accurately in the global structure.
Disordered Regions: Flexible or unstructured regions in proteins that lack a fixed conformation can lead to inaccuracies during reconstruction.
Unresolved Coordinates: Parts of the protein structure that are not well-defined in experimental data can introduce errors when reconstructing.

Impact on Global Orientation:

These factors primarily affect the global orientation of structural elements. While the local structure—the immediate surroundings of a residue—remains largely accurate, the overall positioning of these parts relative to the rest of the protein can be off.

Explanation:

The tokenizer relies on tertiary contacts (interactions between distant parts of the protein chain) to determine the global arrangement of residues. When these contacts are sparse or missing, the model has less information to accurately position those regions.

4. Investigating the Learned Vocabulary

Understanding Tokens:

The structure tokenizer assigns tokens to local neighborhoods in the protein structure, akin to words in a language representing specific meanings.

Semantic Coherence:

Consistent Tokens: Many tokens correspond to semantically coherent sets of local structures. For instance, a particular token might represent a common structural motif like an alpha-helix or beta-sheet segment.

Ambiguity in Tokens:

Multiple Representations: Some tokens are associated with multiple different local neighborhoods, indicating that the token is ambiguous.
Decoder's Role: Despite this ambiguity, the decoder effectively reconstructs accurate structures by leveraging the context provided by surrounding tokens. This is similar to how the meaning of a word in a sentence can be clarified by its context.

5. Calibration of Confidence Measures

Importance of Calibration:

In predictive modeling, it's essential not only to make accurate predictions but also to have confidence scores that reflect the true likelihood of correctness. Proper calibration ensures that the predicted confidence levels match reality.

pLDDT and pTM Metrics:

pLDDT (Predicted Local Distance Difference Test): Provides per-residue confidence scores, indicating how accurate the model expects each part of the structure to be.
pTM (Predicted Template Modeling score): Offers a global confidence score, estimating the overall accuracy of the predicted protein structure compared to the true structure.

Assessment Findings:

Good Calibration of pLDDT:
When plotting predicted pLDDT scores against actual accuracy, most data points lie along the diagonal, indicating that predicted confidence matches actual accuracy.
Slight Overconfidence: There's a small bias where the model predicts slightly higher confidence than warranted.
Pessimistic Estimation of pTM:
pTM tends to underestimate the true TM score, displaying a downward bias. This means the model is more conservative in its global confidence estimates.

Specific Observations:

Alpha Helical Regions:
In some generated sequences, particularly in regions forming alpha-helices, pLDDT can be poorly calibrated.
Overestimation: The model may assign higher confidence than appropriate, suggesting overconfidence in these structured regions.

6. Training Details for Stage 2 (Table S2)

Though the table is not properly formatted, we can extract key information about the training process during Stage 2:

Stage 2A:
Training Steps: 90,000 steps.
Losses Used:
- All-atom geometric losses: Yes.
- pAE/pLDDT losses: No.
Data:
- Sequence length: 512 residues.
- Data mixture: Roughly uniform sampling of predicted and experimental structures.
Stage 2B:
Training Steps: 20,000 steps.
Losses Used:
- All-atom geometric losses: Yes.
- pAE/pLDDT losses: Yes.
Additional Inputs:
- Includes ESM3-predicted tokens.
Data:
- Sequence length: 512 residues.
- Data mixture: Roughly uniform sampling of predicted and experimental structures.
Stage 2C:
Training Steps: 30,000 steps.
Losses Used:
- All-atom geometric losses: Yes.
- pAE/pLDDT losses: Yes.
Data:
- Sequence length: 2048 residues.
- Data mixture: Upsampling experimental structures (more emphasis on experimental data).

Purpose of Different Stages:

The progressive stages and varying training parameters allow the model to improve its reconstruction capabilities, incorporate confidence estimation (with pLDDT and pTM losses), and adjust to different sequence lengths and data distributions.

7. Implications and Conclusions

Model Capabilities:

The structure tokenizer demonstrates strong performance in accurately reconstructing protein structures from tokens, especially after Stage 2 retraining.
The ability to reconstruct all-atom structures indicates that the model captures detailed structural information beyond the protein backbone.

Challenges and Considerations:

Global Orientation Issues: The reliance on tertiary contacts means that regions lacking these contacts pose challenges for accurate global positioning.
Token Ambiguity: While some tokens are ambiguous, the decoder effectively resolves this by considering the context, highlighting the importance of sequential modeling in capturing structural dependencies.
Confidence Calibration: Properly calibrated confidence scores are vital, but overconfidence in certain regions (like alpha helices) suggests areas for further model refinement.

Relevance to Machine Learning Practice:

Multistage Training: Shows the effectiveness of refining models in stages, incorporating different losses and data mixtures to improve performance.
Error Analysis: Emphasizes the importance of understanding where and why models err, guiding future improvements.
Model Interpretability: Investigating the learned vocabulary helps in interpreting the model's internal representations, a valuable aspect in machine learning research.
Calibration Techniques: Highlights the need for ensuring that a model's predicted confidence levels align with actual performance, which is crucial for deploying models in critical applications.

Concluding Remarks

This evaluation provides valuable insights into the performance and behavior of a structure tokenizer in protein modeling. For a graduate student in machine learning, it underscores key concepts such as model training techniques, evaluation metrics, error analysis, and the importance of confidence calibration. Understanding these principles is essential for developing robust and reliable machine learning models across various domains.