esm.doi.bio/esm33/out30

\section*{A.1.7.4. EVALUATION}

We evaluate the reconstruction quality of the structure tokenizer after stage 1 and stage 2 of training using a set of CAMEO, CASP14, and CASP15 proteins taken after the training cutoff date (Fig. S3). Both decoders consistently reach RMSD $<1 \AA$, LDDT-CA $>0.98$. The retraining of the structure token decoder results in substantial improvements in reconstruction quality across all test sets. The stage 2 decoder, trained with an all-atom reconstruction loss and a sequence input, achieves strong all-atom reconstruction as well (Fig. S3C). We also visualize a random sample of backbone reconstructions on the CAMEO test set (Fig. S4A). Looking at the proteins with worse reconstruction quality, we find that long regions with few tertiary contacts, disordered regions, and unresolved coordinates

egin{tabular}{lllllll} \hline Stage & Steps & egin{tabular}{l} All-atom \ geometric \ losses \end{tabular} & egin{tabular}{l} pAE \ pLDDT \ losses \end{tabular} & egin{tabular}{l} and \ with ESM3- \ predicted \ tokens \end{tabular} & egin{tabular}{l} Data mixture \ length \end{tabular} & \ \hline 2A & $90 \mathrm{k}$ & $\checkmark$ & $X$ & $X$ & 512 & egin{tabular}{l} Roughly uniform sampling of pre- \ dicted and experimental structures \end{tabular} \ 2B & $20 \mathrm{k}$ & $\checkmark$ & $\checkmark$ & $\checkmark$ & 512 & egin{tabular}{l} Roughly uniform sampling of pre- \ dicted and experimental structures \end{tabular} \ 2C & $30 \mathrm{k}$ & $\checkmark$ & $\checkmark$ & & 2048 & egin{tabular}{l} Upsampling experimental structures \end{tabular} \ \hline \end{tabular}

Table S2. Training details for stage 2 training of an all-atom structure token decoder.

can lead to inaccurate global orientation of structural elements, while local structure reconstruction remains largely error-free (Fig. S4B). This behavior can be explained by the fact that the tokenizer relies on tertiary contacts to resolve the global orientation of a residue.

We also investigate the vocabulary learned by the structure tokenizer by visualizing the local neighborhoods which map to the same learned structure token. We find that many structure tokens encode semantically coherent sets of local neighborhoods (Fig. S5A). However, a number of tokens appear to represent multiple local neighborhoods (Fig. S5B). While the meaning of a single token may be ambiguous, the high-fidelity reconstruction quality from the decoder suggests that it is able to disambiguate given surrounding context in the full set of structure tokens.

Fig. S6 indicates that pLDDT and pTM are well-calibrated. We assess the calibration of the structure confidence heads on the CAMEO test set using structure tokens predicted by ESM3 7B. Most predictions for pLDDT lie along the diagonal, though there is a small bias towards more confident predictions. As pTM is a pessimistic estimator of the TMscore, we find that pTM is biased downwards. Anecdotally, we also find that pLDDT can be poorly calibrated for some generated sequences, particularly in alpha helical regions where it can be an overestimate.

Please explain this in detail to a beginning grad student in machine learning.

Response: Understanding the Evaluation of a Structure Tokenizer in Protein Modeling

Introduction

As a beginning graduate student in machine learning, you're likely familiar with the basics of deep learning models and their applications. In computational biology, one exciting application area is protein structure prediction and reconstruction. This domain leverages machine learning to predict the three-dimensional structures of proteins based on their amino acid sequences.

In the provided text, the authors are evaluating the performance of a structure tokenizer—a model that encodes protein structures into discrete tokens, much like how words are tokenized in natural language processing. These tokens can then be used for various tasks, such as generating new protein sequences or reconstructing protein structures.

This explanation will break down the key components of the evaluation, the results, and their implications, making the content accessible and clear.


1. Overview of the Evaluation

The authors are assessing the reconstruction quality of their structure tokenizer after two stages of training:

They use test sets comprising proteins from the CAMEO, CASP14, and CASP15 datasets. These proteins are selected to ensure they were not part of the training data (i.e., they are from after the training cutoff date), ensuring an unbiased evaluation.

Key Metrics Used:


2. Results of the Evaluation

High Reconstruction Accuracy:

Visualizing Reconstructions:


3. Analysis of Reconstruction Errors

Common Sources of Errors:

Impact on Global Orientation:

Explanation:


4. Investigating the Learned Vocabulary

Understanding Tokens:

Semantic Coherence:

Ambiguity in Tokens:


5. Calibration of Confidence Measures

Importance of Calibration:

pLDDT and pTM Metrics:

Assessment Findings:

Specific Observations:


6. Training Details for Stage 2 (Table S2)

Though the table is not properly formatted, we can extract key information about the training process during Stage 2:

Purpose of Different Stages:


7. Implications and Conclusions

Model Capabilities:

Challenges and Considerations:

Relevance to Machine Learning Practice:


Concluding Remarks

This evaluation provides valuable insights into the performance and behavior of a structure tokenizer in protein modeling. For a graduate student in machine learning, it underscores key concepts such as model training techniques, evaluation metrics, error analysis, and the importance of confidence calibration. Understanding these principles is essential for developing robust and reliable machine learning models across various domains.