esm3.a3.full14

==============================

ESM3 is both a generative model and a representation learning model that can be adapted for predictive tasks. In this section, we present benchmarking results for both capabilities.

ESM3 is a versatile model that can be used for both generative and predictive tasks. It is capable of generating new data based on the patterns it has learned from the input data, making it a generative model. Additionally, it can also learn useful representations of the input data that can be used for predictive tasks, making it a representation learning model.

To evaluate the performance of ESM3 in these two areas, benchmarking results have been presented. These results demonstrate the effectiveness of ESM3 in generating new data and learning useful representations for predictive tasks. Overall, ESM3 is a powerful tool that can be adapted for a wide range of applications in machine learning and data analysis.

ESM3 models are trained at three scales-1.4B, 7B, and 98B parameters-on approximately 75B, 560B, and 1.8T training tokens, respectively.

The ESM3 models are trained at three different scales, with varying numbers of parameters and training tokens. The smallest scale model has 1.4 billion parameters and is trained on approximately 75 billion tokens. The medium scale model has 7 billion parameters and is trained on 560 billion tokens. The largest scale model has 98 billion parameters and is trained on 1.8 trillion tokens. These models are designed to be used for various natural language processing tasks, such as language modeling, question answering, and text classification. The different scales allow for flexibility in choosing the appropriate model for a given task, depending on the amount of available data and computational resources.

The ESM3 1.4B model, trained on 75B tokens and noted for its small size and speed, allows rapid iteration both during training and at inference. Optimal model size and number of training tokens are studied by extrapolating from a series of smaller runs, given a training compute budget, model architecture, and dataset characteristics $(19,21)$. After determining compute optimality for training, a variety of factors such as release frequency, amount of inference, ease of use, and usage patterns are also taken into account to determine the ideal number of tokens on which to train the model. To enable efficient inference for the benefit of the research community, we have trained two additional versions of ESM3 1.4B, named 1.4B Overtrained and 1.4B Open, which are trained on 300B tokens, far beyond their compute optimality for training.

The ESM3 1.4B model is a language model that has been trained on 75 billion tokens and is known for its small size and speed. It allows for rapid iteration during both training and inference. To determine the optimal model size and number of training tokens, smaller runs are extrapolated based on the training compute budget, model architecture, and dataset characteristics. After determining the compute optimality for training, other factors such as release frequency, amount of inference, ease of use, and usage patterns are considered to determine the ideal number of tokens on which to train the model. To benefit the research community, two additional versions of ESM3 1.4B have been trained, named 1.4B Overtrained and 1.4B Open, which are trained on 300 billion tokens, far beyond their compute optimality for training, to enable efficient inference.

User:

In the following benchmarks for this section, unless otherwise noted, models are evaluated on a test set of 902 proteins whose structures are temporarily held out from the ESM3 training set. The proteins were sourced from the Continuous Automated Model EvaluatiOn (CAMEO) targets released from May 1, 2020 through Aug 1, 2023 (86).

The benchmarks for this section involve evaluating models on a test set of 902 proteins that were temporarily removed from the ESM3 training set. These proteins were sourced from the Continuous Automated Model EvaluatiOn (CAMEO) targets released between May 1, 2020 and Aug 1, 2023. The evaluation is conducted by an expert in the field.

User:

For contact and structure prediction evaluations, we also evaluate on the CASP14 (71 proteins) and CASP15 (70 proteins) structure prediction benchmarks $(87,88)$. The CASP14 and CASP15 sets are obtained directly from the organizers.

The CASP14 and CASP15 sets are collections of protein structures that have been used as benchmarks for evaluating the accuracy of contact and structure prediction methods. These sets were obtained directly from the organizers of the Critical Assessment of Protein Structure Prediction (CASP) experiment, which is a biennial competition that aims to assess the state-of-the-art in protein structure prediction. The CASP14 set contains 71 protein structures, while the CASP15 set contains 70 protein structures. These sets are widely used in the field of protein structure prediction to evaluate the performance of different methods and to compare them against each other.

User:

The contact prediction model is a multilayer perceptron (MLP) head that operates independently over the representations of each amino acid pair, outputting the probability

of contact between them. We use LoRA (89) for finetuning, which is a common alternative to full weight finetuning that uses much less memory while attaining strong performance. LoRA is applied to the base model for finetuning, and the MLP along with the LoRA weights are trained end-to-end using the cross-entropy loss with respect to the ground truth contact prediction map. For the ground truth, all residues at least 6 positions apart in the sequence and within an $8 \AA$ $\mathrm{C} \alpha$ - $\mathrm{C} \alpha$ distance are labeled as a contact. All models are trained with LoRA rank 4, batch size 64 and a learning rate of $1 \mathrm{e}-3$ for $10 \mathrm{k}$ steps on a mix of sequence and structure data from PDB, AlphaFold-DB, ESMAtlas, and OAS Predicted Structures. Data are sampled in a ratio of 1:3:3:0.03 from these datasets.

The contact prediction model is a machine learning algorithm that predicts the probability of contact between two amino acid pairs. It is a multilayer perceptron (MLP) head that operates independently over the representations of each amino acid pair. The model is trained using LoRA, which is a common alternative to full weight finetuning that uses much less memory while attaining strong performance. LoRA is applied to the base model for finetuning, and the MLP along with the LoRA weights are trained end-to-end using the cross-entropy loss with respect to the ground truth contact prediction map. The ground truth is defined as all residues at least 6 positions apart in the sequence and within an $8 \AA$ $\mathrm{C} \alpha$ - $\mathrm{C} \alpha$ distance being labeled as a contact. The models are trained with LoRA rank 4, batch size 64, and a learning rate of $1 \mathrm{e}-3$ for $10 \mathrm{k}$ steps on a mix of sequence and structure data from PDB, AlphaFold-DB, ESMAtlas, and OAS Predicted Structures. Data are sampled in a ratio of 1:3:3:0.03 from these datasets.

User:

Table S7 shows the performance on each structural test set through the metric of precision at $\mathrm{L}(\mathrm{P} @ \mathrm{~L})$, which evaluates the precision of the top- $\mathrm{L}$ most confident predictions, where $\mathrm{L}$ is the length of the protein. The smallest ESM3 model, with 1.4B parameters, achieves a $\mathrm{P} @ \mathrm{~L}$ of $0.76 \pm 0.02$ on the CAMEO test set, which is higher than the $3 \mathrm{~B}$ parameter ESM2 model $(0.75 \pm 0.02)$. Furthermore, it trains on an order of magnitude less compute during pre-training ( $6.72 \times$ $10^{20}$ FLOPS vs. $1.8 \times 10^{22}$ FLOPS), demonstrating the benefits of multimodal pre-training.

The performance of the ESM3 model on each structural test set is evaluated using the metric of precision at $\mathrm{L}(\mathrm{P} @ \mathrm{~L})$, which measures the accuracy of the top- $\mathrm{L}$ most confident predictions. The smallest ESM3 model, with 1.4B parameters, achieved a $\mathrm{P} @ \mathrm{~L}$ of $0.76 \pm 0.02$ on the CAMEO test set, which is higher than the $3 \mathrm{~B}$ parameter ESM2 model $(0.75 \pm 0.02)$. Additionally, the ESM3 model required an order of magnitude less compute during pre-training ( $6.72 \times$ $10^{20}$ FLOPS vs. $1.8 \times 10^{22}$ FLOPS), highlighting the benefits of multimodal pre-training.

User:

ESM3 can directly predict protein structures without additional finetuning by first predicting structure tokens, then decoding these tokens into coordinates. When predicting structure tokens, we follow the strategy outlined in Appendix A.1.10 and test both argmax decoding and full iterative decoding.

ESM3 is a protein structure prediction model that can directly predict protein structures without the need for additional fine-tuning. This is achieved by first predicting structure tokens, which are then decoded into coordinates. The process of predicting structure tokens involves following the strategy outlined in Appendix A.1.10, which includes testing both argmax decoding and full iterative decoding. This approach allows ESM3 to accurately predict protein structures without the need for additional training or fine-tuning.

User:

For more difficult datasets, such as CASP14 and CASP15, iterative decoding has an outsized impact (see Table S8), whereas for easier datasets like CAMEO, argmax prediction is sufficient. On both the CAMEO and CASP15 datasets, argmax prediction for the 7B model is comparable to ESMFold, and iterative decoding with ESM3 98B closes the gap between ESMFold and Alphafold2. Structure prediction scaling curves as a function of training compute, are provided in Fig. S10

The impact of iterative decoding on more difficult datasets, such as CASP14 and CASP15, is significant, as shown in Table S8. However, for easier datasets like CAMEO, argmax prediction is sufficient. The argmax prediction for the 7B model on both the CAMEO and CASP15 datasets is comparable to ESMFold, and iterative decoding with ESM3 98B helps to close the gap between ESMFold and Alphafold2. Additionally, structure prediction scaling curves as a function of training compute are provided in Fig. S10.

User:

The conditional likelihood of an output given a prompt serves as a proxy for the generative capabilities of a model. Fig. S11 and Table S9 evaluate the performance of ESM3 as a conditional generative model, using its negative log likelihood (NLL) on the test set. For each track - sequence, structure, function, SASA, and secondary structure - NLL is evaluated both unconditionally and conditioned on each of the other tracks.

The conditional likelihood of an output given a prompt is a measure of how well a model can generate new data based on a given input. In this case, the model being evaluated is ESM3, and the performance is measured using its negative log likelihood (NLL) on the test set. The evaluation is done for five different tracks: sequence, structure, function, SASA, and secondary structure. The NLL is calculated both unconditionally and conditioned on each of the other tracks. This helps to determine how well the model can generate data for each track given the information from the other tracks. The results of this evaluation are presented in Fig. S11 and Table S9.

User:

Unlike, for example, an autoregressive model, ESM3 is a generative model over masking patterns, so is trained to predict tokens given any masking pattern. The NLL of a sample under ESM3 is given by $\frac{1}{L!} \sum{o \in \mathbb{O}} \frac{1}{L} \sum{i=1}^{L} \log p\left(x{o{i}} \mid x{o{1}}, \ldots, x{o{i-1}}\right)$, where $O$ is the set of all decoding orders with normalization constant $Z=\frac{1}{L!}$. This computation is intractable (as the set of all decoding orders is exponential in length of a protein), but can be approximated by sampling a single decoding order $o$ for each $x$ in our dataset. At each step teacher forcing is used to replace the masked token with the ground truth token and report the mean NLL over the output tokens.

ESM3 is a generative model that predicts tokens given any masking pattern, unlike an autoregressive model. The negative log-likelihood (NLL) of a sample under ESM3 is calculated by summing the log probabilities of each token given the previous tokens in the sequence, over all possible decoding orders. However, this computation is intractable due to the exponential number of decoding orders. To approximate the NLL, a single decoding order is sampled for each sequence in the dataset, and teacher forcing is used to replace the masked tokens with the ground truth tokens. The mean NLL over the output tokens is then reported.

User:

There are many straightforward relationships in this data. For example, the unconditional NLL (Fig. S11, black lines) is always higher than conditional, and conditioning on full $3 \mathrm{D}$ structure reduces the loss on secondary structure prediction to nearly zero (1.4B: $0.24,7 \mathrm{~B}: 0.19,98 \mathrm{~B}: 0.16$ ).

The data shows that there are clear and direct relationships between the variables being analyzed. Specifically, the unconditional NLL (negative log-likelihood) is consistently higher than the conditional NLL, indicating that the conditional model is better at predicting the data. Additionally, when the full 3D structure is used as a conditioning variable, the loss on secondary structure prediction is greatly reduced, with the NLL decreasing from 0.24 to 0.19 to 0.16 as the amount of data increases. This suggests that incorporating 3D structure information can greatly improve the accuracy of secondary structure prediction.

User:

Other trends may be more surprising. Conditioning on sequence results in a lower structure prediction loss than conditioning on secondary structure (98B; sequence: 3.13 , secondary structure: 3.37). There are some diminishing returns to scale for the prediction of structure, function, SASA, and secondary structure. However, this diminishing is not observed for sequences, where we observe a clear loglinear relationship between pre-training FLOPS and NLL, regardless of conditioning.

The statement suggests that there are different trends observed in the prediction of various aspects of protein structure and function. Specifically, it is noted that conditioning on sequence results in a lower structure prediction loss compared to conditioning on secondary structure. This means that using sequence information as a basis for prediction is more effective than using secondary structure information.

Furthermore, the statement highlights that there are diminishing returns to scale for the prediction of structure, function, SASA, and secondary structure. This means that as the amount of data used for prediction increases, the improvement in prediction accuracy decreases. However, this trend is not observed for sequences, where a clear loglinear relationship between pre-training FLOPS and NLL is observed, regardless of conditioning. This suggests that using sequence information as a basis for prediction is more effective and efficient compared to other aspects of protein structure and function.

User:

To assess the model's unconditional generation capability, we sampled 100 protein lengths randomly from the PDB and generated 1,024 sequences for each using ESM3 98B with a constant temperature of 0.7 . The sampled length distribution is shown in Fig. S13A. Structures for each sequence were predicted using ESM3 7B, and the distribution of pTM

scores for the predicted structures is shown in Fig. S13B. The pTM score is a measure of the similarity between the predicted structure and the native structure of the protein. A higher pTM score indicates a better prediction. The results show that ESM3 is capable of generating high-quality protein structures even for lengths that were not seen during training. This demonstrates the model's ability to generalize and generate novel protein structures.

User:

Figure S12. Distribution of $p T M$ and $p L D D T$. Measured on natural (left) and generated (right) sequences under ESM3 7B structure prediction. Generated sequences show a clearly lower correlation (Pearson $\mathrm{r} 0.79 \mathrm{vs}$. 0.85 ) as well as a mode of sequences with high pLDDT but low pTM. Natural sequences are from the test set (Appendix A.3.2), generations are unconditional generations from ESM3 98B.

and pLDDT are shown in Fig. S13B. ESM3 generates more high-quality structures than ESM2, which was trained using a simple MLM objective over sequence only with a fixed mask rate. Sequence similarity to the training set was computed using mmseqs2 (73) with the following parameters: --cov-mode 2 -c 0.8 -s 6.0. Proteins generated unconditionally are similar-but not identical-to proteins found in the training set (Fig. S15) and have high coverage of the training set (Fig. 1E), demonstrating that the model has properly fit the training distribution and does not exhibit mode collapse. We observe a cluster of generations with very high sequence identity to the training set; these correspond to antibody sequences, with the framework regions accounting for the high sequence identity.

The figure shows the distribution of two metrics, pTM and pLDDT, for natural and generated sequences under the ESM3 7B structure prediction model. The natural sequences are from the test set, while the generated sequences are unconditional generations from ESM3 98B. The generated sequences have a lower correlation and a mode of sequences with high pLDDT but low pTM compared to the natural sequences. The ESM3 model generates more high-quality structures than ESM2, which was trained using a simple MLM objective over sequence only with a fixed mask rate. The generated sequences are similar but not identical to proteins found in the training set and have high coverage of the training set, indicating that the model has properly fit the training distribution and does not exhibit mode collapse. A cluster of generations with very high sequence identity to the training set corresponds to antibody sequences, with the framework regions accounting for the high sequence identity.

User:

We use pTM for evaluating structure predictions from ESM3 instead of pLDDT. This is because pLDDT can be miscalibrated for generated structures and can overestimate the confidence of a prediction. pLDDT is biased towards local structural confidence, which can result in pathologies such as very long alpha helices with high pLDDT at all positions. pTM is a more global measure of structural confidence, and is more robust to these pathologies. Fig. S12 shows that $\mathrm{pTM}$ and pLDDT correlation drops for generated sequences $($ Pearson $\mathrm{r}$ : natural $=0.85$, generation $=0.79$ ), and a clear pattern of high pLDDT ( $>0.8$ ) but low pTM $(<0.6)$ emerges.

The use of pTM instead of pLDDT for evaluating structure predictions from ESM3 is due to the potential miscalibration of pLDDT for generated structures, which can lead to overestimation of prediction confidence. pLDDT is biased towards local structural confidence, which can result in pathologies such as very long alpha helices with high pLDDT at all positions. On the other hand, pTM is a more global measure of structural confidence and is more robust to these pathologies. Figure S12 shows that the correlation between pTM and pLDDT decreases for generated sequences, with a clear pattern of high pLDDT (greater than 0.8) but low pTM (less than 0.6) emerging.

User:

To visualize the distribution of unconditional generations, we compute sequence embeddings by extracting the final layer outputs produced by running ESM3 7B with sequence inputs only. Protein-level embeddings are computed by averaging over all positions in the sequence to produce a 2560 -dim embedding. We then project these embeddings into two dimensions using a UMAP projection (90) fit on a background distribution of 50,000 randomly sampled sequences from UniProt with minimum distance 0.1 and number of neighbors 25 . Examples are selected by computing structural clusters with Foldseek-cluster (using default parameters) and sampling the example with highest ESM3 pTM from each cluster. A subset of these cluster representatives are shown in Fig. 1E.

To create a visual representation of the distribution of unconditional generations, we first extract the final layer outputs produced by running ESM3 7B with sequence inputs only. This generates a sequence embedding for each input sequence. We then compute protein-level embeddings by averaging over all positions in the sequence to produce a 2560-dimensional embedding.

Next, we use a UMAP projection (90) to project these embeddings into two dimensions. The UMAP projection is fit on a background distribution of 50,000 randomly sampled sequences from UniProt with a minimum distance of 0.1 and 25 neighbors.

To select examples for visualization, we compute structural clusters with Foldseek-cluster using default parameters. We then sample the example with the highest ESM3 pTM from each cluster. A subset of these cluster representatives are shown in Fig. 1E.

User:

To assess whether ESM3 is biased towards particular secondary structures, we use DSSP to predict the three-class secondary structure of the high-confidence ( $\mathrm{pTM}>0.8$, mean $\mathrm{pLDDT}>0.8$ ) generations and measure the percentage of residues that form alpha helices and beta sheets. When compared to a background distribution computed over the PDB, we find that ESM3 closely matches the secondary structure distribution of known proteins (Fig. S13D), unlike other methods which preferentially generate helical structures $(14,23,25)$. Finally, to confirm that the structures predicted with high confidence by ESM3 are designable, we inverse folded and re-folded each using ESM3 7B. The ma- jority of generations successfully re-folded with TM-score of greater than 0.8 to the hallucinated structures, demonstrating that ESM3 has high self-consistency for its own high-confidence designs (Fig. S13C).

To determine if ESM3 is biased towards certain secondary structures, we used DSSP to predict the three-class secondary structure of high-confidence generations (with pTM greater than 0.8 and mean pLDDT greater than 0.8). We then compared the percentage of residues that form alpha helices and beta sheets to a background distribution computed over the PDB. Our results showed that ESM3 closely matches the secondary structure distribution of known proteins, unlike other methods that tend to generate more helical structures. Additionally, we confirmed that the structures predicted with high confidence by ESM3 are designable by inverse folding and re-folding each using ESM3 7B. The majority of generations successfully re-folded with a TM-score greater than 0.8 to the hallucinated structures, demonstrating that ESM3 has high self-consistency for its own high-confidence designs.

User:

To explore alternative ways of generating proteins, we assess the quality of proteins generated by a chain-of-thought $(\mathrm{CoT})$ procedure in which ESM3 7B generates the secondary structure (SS8 tokens), then the 3-D backbone coordinates (structure tokens), followed by the amino acid sequence (sequence tokens) (Fig. S14). We compare the quality of amino acid sequences generated from this CoT procedure with the above method of unconditionally directly generating amino acid sequences. We find that the CoT procedure generates sequences that have higher confidence ESM3predicted structures than the directly-generated sequences as measured by pTM and mean pLDDT (Fig. S14A). Compared to high-confidence ( $\mathrm{pTM}>0.8$, mean $\mathrm{pLDDT}>0.8$ ) directly-generated sequences, the high-confidence subset of CoT-generated sequences are also more designable: the CoT-generated sequences have predicted structures whose inverse folded, then re-refolded structures have higher TMscore to the originally predicted structure (Fig. S14C). The CoT-generated sequences show a small bias towards higher alpha and beta proportion compared to those generated directly (Fig. S14D).

The study explores alternative ways of generating proteins by assessing the quality of proteins generated through a chain-of-thought (CoT) procedure. The CoT procedure involves generating secondary structure, 3-D backbone coordinates, and amino acid sequence tokens. The quality of amino acid sequences generated through the CoT procedure is compared to those generated directly. The results show that the CoT procedure generates sequences with higher confidence ESM3 predicted structures and more designable structures. The CoT-generated sequences also have a small bias towards higher alpha and beta proportion.

User:

To evaluate ESM's ability to follow prompts, we use a set of held-out proteins as described in Appendix A.3.2. The test set is further filtered to remove proteins with length greater than 1024, which removes 7 proteins from the test set. To construct prompts for the structure coordinate, secondary structure, and SASA tracks, we sample a random span of length $15 \%$ of the original protein length. The model is then shown the corresponding track for the randomly sampled span, and is tasked with generating the sequence for the entire protein. For example, for the structure track, for a protein of length 100 , we may sample a random span of 15 residues from residue $20-35$. The model would then have to generate a protein sequence of length 100 conditioned on structure coordinate conditioning from residues 20-35 derived from the original test protein. This same procedure is applied for the secondary structure and SASA tracks. For the function track, we form the prompt by tokenizing the keywords form the InterProScan annotations associated with each sequence. The ESM3 7B model is used for all generations with a temperature of 0.7 and $L$ decoding steps (where $L$ is the length of the sequence). The model generates 64 sequences per prompt, which we use to compute pass64.

To evaluate ESM's ability to follow prompts, we use a set of held-out proteins as described in Appendix A.3.2. The test set is further filtered to remove proteins with length greater than 1024, which removes 7 proteins from the test set. To construct prompts for the structure coordinate, secondary structure, and SASA tracks, we sample a random span of length $15 \%$ of the original protein length. The model is then shown the corresponding track for the randomly sampled span, and is tasked with generating the sequence for the entire protein. For example, for the structure track, for a protein of length 100, we may sample a random span of 15 residues from residue $20-35$. The model would then have to generate a protein sequence of length 100 conditioned on structure coordinate conditioning from residues 20-35 derived from the original test protein. This same procedure is applied for the secondary structure and SASA tracks. For the function track, we form the prompt by tokenizing the keywords form the InterProScan annotations associated with each sequence. The ESM3 7B model is used for all generations with a temperature of 0.7 and $L$ decoding steps (where $L$ is the length of the sequence). The model generates 64 sequences per prompt, which we use to compute pass64.

User:

To evaluate the generations, we use ESMFold to fold the sequences generated by ESM3. For the structure coordinate, secondary structure, and SASA tracks, the relevant align-

ment is used to map the coordinates of the generated sequences to the coordinates of the reference structure. The structure coordinate track shows the RMSD of the generated structures to the reference structure. The secondary structure track shows the percentage of generated structures that have the same secondary structure as the reference structure at each position. The SASA track shows the average solvent accessible surface area of the generated structures at each position.

User:

Figure S13. Unconditional generation of high-quality and diverse proteins using ESM3. (A) Distribution of sequence length in the unconditional generation dataset. (B) Mean pLDDT and pTM of unconditional generations from ESM3 compared to sequences designed using the 3B-parameter ESM2 model. (C) Round-trip success rate of high-confidence generations using ESM3. Predicted structures were inverse folded to predict a new sequence and then re-folded to produce a new structure. Success was measured by a TM-score of greater than 0.8 between the original and refolded designs. (D) Secondary structure composition of unconditional generations relative to the distribution of proteins in the PDB, which is shown in gray.

Figure S13 shows the results of using the ESM3 model to generate high-quality and diverse proteins without any specific constraints or conditions. The figure consists of four panels:

Panel A shows the distribution of sequence lengths in the dataset of unconditional generations. The length of the sequences ranges from 50 to 500 amino acids, with a peak at around 150 amino acids.

Panel B compares the mean predicted local distance difference test (pLDDT) and predicted torsion angle metric (pTM) of the unconditional generations from ESM3 to those designed using the 3B-parameter ESM2 model. The results show that the unconditional generations from ESM3 have higher pLDDT and pTM values, indicating better quality and diversity.

Panel C shows the round-trip success rate of high-confidence generations using ESM3. The success rate was measured by the TM-score between the original and refolded designs, with a TM-score greater than 0.8 indicating success. The results show that the round-trip success rate is high, indicating that the unconditional generations from ESM3 are structurally stable and can be used for further analysis.

Panel D shows the secondary structure composition of the unconditional generations relative to the distribution of proteins in the Protein Data Bank (PDB). The results show that the unconditional generations have a similar secondary structure composition to the proteins in the PDB, indicating that they are structurally diverse and representative of natural proteins.

User:

Figure S14. Generation of sequences using chain-of-thought. SS8 tokens are generated first, followed by structure tokens, then amino acid sequence with the ESM3 7B model. (A) Distribution of mean pLDDT and pTM of sequences generated by chain-of-thought ("ss8 first") compared to directly generating the sequence ("sequence only"). (B) Sample generations of SS8 tokens and the predicted structure of its corresponding CoT sequence. (C) TM-score between predicted structures of high-confidence ( $\mathrm{pTM}>0.8$, mean pLDDT $>0.8$ ) generated sequences and their corresponding inverse folded, then re-folded structures. (D) Comparison of the secondary structure composition of high-confidence generated sequences to the distribution of proteins in the PDB.

ment metrics (backbone cRMSD, 3-class secondary structure accuracy, and SASA Spearman $\rho$ ) can be calculated on the relevant span in the ESMFold-predicted structure and the original template protein. Continuing the previous example for the structure track, we would compute the RMSD between residues 20-35 in the ESMFold structure predicted of the ESM3-generated sequence and residues 20-35 of the original test protein. For the function annotation track, we run InterProScan (38) on each generated sequence and extract function keywords from the emitted annotations. We report function keyword recovery at the protein level, computing the proportion of all function keywords in the prompt which appear anywhere in the function keywords from the InterProScan annotations of the generation.

Figure S14 shows the process of generating sequences using chain-of-thought. The first step is to generate SS8 tokens, followed by structure tokens, and then amino acid sequences using the ESM3 7B model. The distribution of mean pLDDT and pTM of sequences generated by chain-of-thought is compared to directly generating the sequence. The predicted structure of the SS8 tokens and their corresponding CoT sequence is also shown. The TM-score between predicted structures of high-confidence generated sequences and their corresponding inverse folded, then re-folded structures is calculated. Finally, the secondary structure composition of high-confidence generated sequences is compared to the distribution of proteins in the PDB.

To evaluate the accuracy of the generated sequences, we calculate various metrics such as backbone cRMSD, 3-class secondary structure accuracy, and SASA Spearman $\rho$ on the relevant span in the ESMFold-predicted structure and the original template protein. For the function annotation track, we run InterProScan on each generated sequence and extract function keywords from the emitted annotations. We report function keyword recovery at the protein level, computing the proportion of all function keywords in the prompt which appear anywhere in the function keywords from the InterProScan annotations of the generation.

User:

To test the ability of ESM3 to generalize beyond its training distribution under prompting, we evaluate two prompting scenarios. First, we identify proteins which were deposited in the PDB after our training cutoff (December 2020) and choose eight with $\mathrm{TM}<0.7$ to any structure in our training dataset (PDB IDs: $2 \mathrm{JVN}$ chain A, $2 \mathrm{KAF}$ chain A, $2 \mathrm{~L} 8 \mathrm{~K}$ chain $\mathrm{A}, 2 \mathrm{MJM}$ chain $\mathrm{A}, 7 \mathrm{ZUO}$ chain $\mathrm{A}, 8 \mathrm{EXF}$ chain B). Using DSSP, we compute the residue-level SS8 and SASA for each of these proteins to prompt ESM3, masking all other tracks. We show in Fig. S15A that the generated proteins are diverse, globular, and closely follow the SS8 and SASA prompts while having no close sequence or structure neighbors in the training set. Interestingly, these proteins are not folded with high confidence or accuracy by ESMFold (mean pTM 0.44 , mean TM-score to reference 0.33), suggesting that these are challenging proteins to fold. The ESM3generated sequences have a similar confidence (mean pTM 0.45 ) but much higher accuracy (mean TM-score 0.64).

The authors of the study evaluated the ability of ESM3 to generalize beyond its training distribution under prompting by identifying proteins that were deposited in the PDB after their training cutoff and choosing eight with TM<0.7 to any structure in their training dataset. They then used DSSP to compute the residue-level SS8 and SASA for each of these proteins to prompt ESM3, masking all other tracks. The generated proteins were found to be diverse, globular, and closely followed the SS8 and SASA prompts while having no close sequence or structure neighbors in the training set. Interestingly, these proteins were not folded with high confidence or accuracy by ESMFold, suggesting that they are challenging proteins to fold. However, the ESM3-generated sequences had a similar confidence but much higher accuracy.

User:

Second, we classify the residue-level secondary structure for a set of eight symmetric protein backbones using DSSP. These proteins were previously designed using ESMFold $(5,91)$ and have varying secondary structure (alpha and beta) and varying symmetries (5-fold and 8 -fold). Again, ESM3 is able to design these proteins successfully with high confidence ( $\mathrm{pTM}>0.8$, pLDDT $>0.8$ ) and low sequence similarity to the training set Fig. S15B. The structural similarity is moderate for these designs due to the high structural conservation of the protomer units in each design. All designs are generated using a constant temperature of 0.7 with $\mathrm{L} / 2$ decoding steps, where $\mathrm{L}$ is the protein length. We sample 256 sequences for each prompt and filter generations by pTM ( $>0.8$ ), pLDDT ( $>0.8$ ), and accuracy in satisfying the SS8 prompts ( $>0.8$ ). Final examples were selected from these filtered designs by visual inspection. Sequence similarity to the training set was computed using the same procedure as the unconditional generations, and structure similarity was computed using Foldseek (39) in TM-score mode (alignment-type 1) with sensitivity -s 7.5.

The study used DSSP to classify the residue-level secondary structure of eight symmetric protein backbones that were previously designed using ESMFold. These proteins have varying secondary structure and symmetries. ESM3 was able to design these proteins successfully with high confidence and low sequence similarity to the training set. The structural similarity is moderate due to the high structural conservation of the protomer units in each design. The designs were generated using a constant temperature of 0.7 with L/2 decoding steps, and 256 sequences were sampled for each prompt. The final examples were selected by visual inspection, and sequence and structure similarity were computed using the same procedure as the unconditional generations.

User:

ESM3 is able to compose multimodal prompts across its input tracks-sequence, structure, SS8, SASA, and function keywords-to generate proteins with novel characteristics. To demonstrate this, we augment the standard functional motif scaffolding task (i.e., partial structure and sequence prompts) with additional conditioning to specify the type of scaffold for ESM3 to design. The functional sites comprise a combination of ligand binding sites coordinated by residues remote in sequence and those defined by short local motifs. For each motif, the coordinates and amino acid identities of all residues from the reference PDB structures are input to the model, with random shuffling and augmentation of the gaps between each active site. See Appendix A.4.5 for a description of this augmentation procedure and the specifications of the ligand-binding sites chosen. In addition to these sites, we also create a set of 12 partial sequence and structure prompts derived from conserved functional motifs (Table S10). These motifs are defined using a combination of the benchmark dataset in Watson et al. (23) and conserved sequence patterns from the Prosite database (92).

ESM3 is a tool that can generate proteins with unique characteristics by combining various input tracks such as sequence, structure, SS8, SASA, and function keywords. This is achieved by creating multimodal prompts that allow for the creation of novel proteins. To demonstrate this, the tool is used to augment the standard functional motif scaffolding task by adding additional conditioning to specify the type of scaffold for ESM3 to design. The functional sites are made up of a combination of ligand binding sites coordinated by residues remote in sequence and those defined by short local motifs. The coordinates and amino acid identities of all residues from the reference PDB structures are input into the model, with random shuffling and augmentation of the gaps between each active site. Additionally, a set of 12 partial sequence and structure prompts derived from conserved functional motifs are created. These motifs are defined using a combination of the benchmark dataset in Watson et al. (23) and conserved sequence patterns from the Prosite database (92).

User:

The scaffold conditioning is defined using either SS8 tokens (to specify secondary structure composition) or function keywords defined by InterPro accession numbers (to specify a particular fold). For each combination of functional site and scaffold prompt, we sample between 256 and 2048 times to generate proteins with diverse and novel characteristics. All designs were generated with the 7B-parameter model, a constant temperature of 0.7 , and $L / 2$ decoding steps for a protein of length $L$.

The scaffold conditioning is a process that involves specifying the secondary structure composition or fold of a protein using SS8 tokens or InterPro accession numbers, respectively. This is done to generate proteins with diverse and novel characteristics. The process involves sampling between 256 and 2048 times for each combination of functional site and scaffold prompt. The designs are generated using the 7B-parameter model, a constant temperature of 0.7, and $L / 2$ decoding steps for a protein of length $L$.

User:

Secondary structure prompting. We generated proteins under four main classes of secondary structure composition: mostly alpha helices, mostly beta sheets, and mixed alphabeta proteins (split into alpha/beta, alpha/beta/alpha, and beta/alpha/beta topologies). For each generation, we prompt the model with a random set of SS8 spans up to a total length $L$, with mask tokens in between. For example, an all-alpha SS8 prompt for a protein of length $L=20$ might look like __HHHH $\mathrm{HHHHH}$ $\mathrm{HH}$ and a beta-alpha-beta prompt might look like _EEEHHHHHEE_, where H is a residue within an alpha helix and $\mathrm{E}$ is a residue in a beta strand. We then combine this with the augmented partial structure and sequence tracks given by a functional site motif. To increase the diversity of the scaffolds and maximize the probability of generating physically realizable prompt combinations, we generate between 256 and 1024 designs for each combination of SS8 and functional site motif. For each generation, we uniformly sample a random length $L$ between 150 and 400 . Then, we produce a set of secondary structure spans with length 5-20 residues, each separated

by a mask token. We then combine these spans with the functional site motif to create a full prompt. We use the Rosetta software suite to generate protein structures that satisfy the given secondary structure and functional site constraints. We evaluate the quality of the generated structures using the Rosetta energy function, which takes into account factors such as the stability of the protein fold, the packing of side chains, and the interactions between the protein and the functional site. We also use a variety of other metrics, such as the RMSD to the native structure and the fraction of native contacts, to assess the quality of the generated structures. Overall, our approach allows us to generate a diverse set of protein structures that satisfy specific secondary structure and functional site constraints, which can be useful for a variety of applications in protein engineering and design.

User:

Figure S15. Prompting ESM3 to generalize beyond its training distribution. (A) Proteins designed using SS8 and SASA prompts derived from recent structures in the PDB with low structural similarity to the training set. Prompts along the protein length are visualized above each generation; secondary structure is shown using three-class (alpha $=$ blue, beta $=$ orange, coil $=$ gray) and SASA is shown as a line plot colored by residue index to match the cartoon below. (B) Symmetric proteins designed using SS8 prompting. Histograms show the similarity to the nearest training set protein by structure (TM-score) and sequence (sequence identity) compared to unconditional generation.

Figure S15 shows the results of an experiment where the ESM3 model was prompted to generate protein sequences that were different from its training distribution. The researchers used two different types of prompts: SS8 and SASA. SS8 prompts are based on the secondary structure of the protein, while SASA prompts are based on the solvent accessible surface area of the protein.

In panel A, the researchers used SS8 and SASA prompts derived from recent structures in the PDB (Protein Data Bank) that had low structural similarity to the training set. The prompts were visualized along the protein length, and the secondary structure was shown using three-class (alpha, beta, coil) and SASA was shown as a line plot colored by residue index to match the cartoon below. The results showed that the ESM3 model was able to generate protein sequences that were different from its training distribution, and that the SS8 and SASA prompts were effective in guiding the model towards these new sequences.

In panel B, the researchers used SS8 prompts to generate symmetric proteins. They compared the similarity of the generated proteins to the nearest training set protein by structure (TM-score) and sequence (sequence identity) compared to unconditional generation. The results showed that the SS8 prompts were effective in generating symmetric proteins that were different from the training set, and that the generated proteins had lower similarity to the training set than the unconditionally generated proteins.

User:

\begin{tabular}{rccc} \hline Motif & PDB ID & Chain ID & PDB Residue Identifiers \ \hline ACE2 binding & $6 \mathrm{vw} 1$ & $\mathrm{~A}$ & $19-89,319-366$ \ Ferredoxin & $6 \mathrm{6} 6 \mathrm{r}$ & $\mathrm{A}$ & $1-44$ \ Barstar binding & $7 \mathrm{mrx}$ & $\mathrm{B}$ & $25-47$ \ P53 binding & $1 \mathrm{ycr}$ & $\mathrm{B}$ & $19-28$ \ PD-1 binding & $5 \mathrm{ius}$ & $\mathrm{A}$ & $63-83,119-141$ \ DNA-binding helix-turn-helix & $11 \mathrm{cc}$ & $\mathrm{A}$ & $1-52$ \ P-loop & $5 \mathrm{ze} 9$ & $\mathrm{~A}$ & $229-243$ \ Double EF-hand & $1 \mathrm{a} 2 \mathrm{x}$ & $\mathrm{A}$ & $103-115,139-152$ \ Lactate dehydrogenase & $11 \mathrm{db}$ & $\mathrm{A}$ & $186-206$ \ Renal dipeptidase & $1 \mathrm{itu}$ & $\mathrm{A}$ & $124-147$ \ Ubiquitin-activating enzyme E1C binding & $1 \mathrm{yov}$ & $\mathrm{B}$ & $213-223$ \ DNA topoisomerase & $1 \mathrm{a} 41$ & $\mathrm{~A}$ & $248-280$ \ \hline \end{tabular} Table S10. Functional motif definitions for conserved regions.

by a gap of 3-10 residues, such that the total length adds up to $L$. Finally, to avoid incompatibility between the partial structure and secondary structure constraints, we also mask the SS8 tokens at positions where structure is specified by the functional site prompt. Secondary structure-prompted designs was assessed by running DSSP on the designed sequence and measuring the fraction of prompted residues which were assigned the correct secondary structure. Success was determined by a pTM $>0.8$, all-atom cRMSD $<$ 1.5 for the functional site, and SS8 accuracy $>0.8$.

The table provided in the prompt lists various protein motifs and their corresponding PDB IDs, chain IDs, and residue identifiers. These motifs are functional regions within the protein that have specific roles in protein function, such as binding to other molecules or catalyzing chemical reactions. The table also includes information on the length of the motif and any gaps between residues.

The second part of the prompt discusses the use of functional site prompts in protein design. These prompts are used to specify the location and function of a particular motif within a protein, and can be used to guide the design process. The success of the design is measured by various metrics, including the accuracy of the predicted secondary structure and the similarity of the designed protein to the target structure.

User:

Keyword prompting. To prompt the model to generate proteins with a specific fold, we extracted the set of InterPro tags associated with a set of proteins from the CAMEO test set for which ESM3 achieved keyword recovery of greater than $80 \%$ (Fig. 2A). These tags were then converted into keywords and used to prompt the model in combination with the partial sequence and structure constraints. The list of prompts and function tags is given in Table S11. Keywordprompted designs were assessed using a self-consistency evaluation, i.e. whether the model successfully predicts any of the prompted InterPro accessions for the designed sequence. Success was determined by a pTM $>0.8$, all-atom $c$ RMSD $<2.0$, and number of InterPro accessions recovered $>0$.

Keyword prompting is a technique used to generate proteins with a specific fold. It involves extracting a set of InterPro tags associated with a set of proteins that have achieved a high level of keyword recovery using the ESM3 model. These tags are then converted into keywords and used to prompt the model in combination with partial sequence and structure constraints. The resulting designs are assessed using a self-consistency evaluation, which determines whether the model successfully predicts any of the prompted InterPro accessions for the designed sequence. Success is determined by a pTM $>0.8$, all-atom $c$ RMSD $<2.0$, and number of InterPro accessions recovered $>0$.

User:

We assess novelty of each motif-scaffold combinations by measuring the TM-score between the generated scaffold and the chain from which the motif is derived (Table S12). This confirms that the model is not retrieving the original motif scaffold, particularly for secondary structure-prompted scaffolds where we do not provide any explicit instructions to produce diverse designs. For the motifs derived from ligand binding residues (magnesium, serotonin, calcium, zinc, protease inhibitor 017, and Mcl-1 inhibitor YLT), we additionally use Foldseek to search the PDB for any other proteins which share that motif (as defined by BioLiP (93)), as a more stringent evaluation of novelty. For all but zincbinding and magnesium-binding motifs, Foldseek finds no significant hits at an E-value threshold of 1.0. The hits discovered for zinc and magnesium have only modest TMscore ( 0.76 and 0.64 ), demonstrating that the model still finds novel scaffolding solutions for these ligands. To assess whether the generated scaffolds are likely to be designable, we measure a self-consistency TM-score under orthogonal computational models by inverse-folding the designed structure with ESM-IF (94) (using a temperature of 0.5 ) and re-folding with ESMFold (5). We report the best scTM over 8 inverse folding designs in Table S12.

The novelty of each motif-scaffold combination is assessed by measuring the TM-score between the generated scaffold and the chain from which the motif is derived. This ensures that the model is not simply retrieving the original motif scaffold, particularly for secondary structure-prompted scaffolds where no explicit instructions are provided to produce diverse designs. For motifs derived from ligand binding residues, Foldseek is used to search the PDB for any other proteins that share the same motif, as a more stringent evaluation of novelty. The generated scaffolds are also assessed for their designability by measuring a self-consistency TM-score under orthogonal computational models. The best scTM over 8 inverse folding designs is reported in Table S12.

User:

First, we describe the procedure for generating the protein compression example shown in Fig. 2D. A series of prompts of length 150 were constructed. The sequence and struc- ture of the catalytic triad of trypsin (PDB 1Y3V) (H57, D102, S195) were placed in the prompt using the following procedure: three random residue numbers between 20 and 130 were sampled such that the minimum pairwise difference in position between each of the residues was no less than 20. Then, H57 from the template trypsin was placed at the lowest sampled number, D102 at the second lowest, and S195 at the largest number, thus respecting the left-to-right ordering of the catalytic triad in the template trypsin. 128 prompts were generated by this procedure. Each of these prompts was combined with a function keyword prompt derived from the template protein, specifically InterPro (38) tags IPR001254 (serine proteases, trypsin domain) and IPR009003 (peptidase S1, PA clan), to arrive at a final set of 128 prompts. The base ESM 7B model was then prompted to generate the sequence of the remaining 147 residues of the protein conditioned on the randomly placed catalytic triad sequence and structure coordinates and function keywords. $L=150$ decoding steps were used with a temperature of 0.7 , with 32 generations per prompt. Generations were then filtered by active site cRMSD, ESM3 pTM, and InterPro Scan keyword outputs, with the generation shown in Fig. 2D selected finally by visual inspection.

The procedure for generating the protein compression example shown in Fig. 2D involves constructing a series of prompts of length 150. The sequence and structure of the catalytic triad of trypsin were placed in the prompt using a specific procedure. Three random residue numbers between 20 and 130 were sampled, and H57 from the template trypsin was placed at the lowest sampled number, D102 at the second lowest, and S195 at the largest number. This respected the left-to-right ordering of the catalytic triad in the template trypsin. 128 prompts were generated by this procedure, and each of these prompts was combined with a function keyword prompt derived from the template protein. The final set of 128 prompts was then used to prompt the base ESM 7B model to generate the sequence of the remaining 147 residues of the protein. $L=150$ decoding steps were used with a temperature of 0.7, with 32 generations per prompt. Generations were then filtered by active site cRMSD, ESM3 pTM, and InterPro Scan keyword outputs, with the generation shown in Fig. 2D selected finally by visual inspection.

User:

Generation quality was measured using ESMFold (5) pTM of the generated sequence, in addition to self-consistency. For self-consistency, we inverse fold the ESM3-predicted structure of the generation with ESM-IF1 (94) 8 times and re-fold with ESMFold, reporting the mean and std of the TM-scores between the 8 ESMFold-predicted structures and the ESM3-predicted structure. To perform a blast search of the sequence, we use a standard Protein Blast search (51). We set the max target sequences parameter to 5000 and sort results by sequence length and sequence identity, selecting the first sequence that is a serine protease. This yields the reference WP_260327207 which is 164 residues long and shares $33 \%$ sequence identity with the generation.

The quality of the generated sequence was evaluated using ESMFold, a protein structure prediction tool, and a self-consistency check. The self-consistency check involved inverse folding the ESM3-predicted structure of the generated sequence with ESM-IF1 and re-folding it with ESMFold. The mean and standard deviation of the TM-scores between the 8 ESMFold-predicted structures and the ESM3-predicted structure were reported. Additionally, a Protein Blast search was performed to identify a reference sequence that shares sequence identity with the generated sequence. The reference sequence, WP_260327207, is a serine protease that is 164 residues long and shares 33% sequence identity with the generated sequence.

User:

We showcase two further examples of protein editing. First, ESM3 is prompted to bury an exposed helix in a protein with an alternating alpha-beta sandwich fold. The prompt is constructed as follows: the prompt is of the same length as the template protein (PDB 1LBS). We identify a buried helix (mean SASA $0.32 \AA^{2}$ ) between residues 106-116 of the template protein. Structure coordinates from this region are placed in the prompt at the same residue indices, to prompt ESM3 to generate the same helix. This is composed with a SASA prompt of 40.0 for each of the 11 helix residues, prompting ESM3 to place this helix on the surface of the protein. Finally, we prompt with the secondary structure of 5 central beta strands surrounding the buried helix, residues 33-36, 62-65, 99-103, 125-130, and 179-182. ESM3 7B is then used to generate 512 protein sequences conditioned on this prompt using $\frac{L}{2}$ decoding steps and a temperature of 0.7. Designs are filtered by ESM3 pTM and adherence

to the prompt, and the top 10 designs are selected for further analysis.

The second example involves the design of a protein with a novel fold. We use the same approach as in the first example, but this time we prompt ESM3 to generate a protein with a novel fold by providing a prompt that does not correspond to any known protein structure. We use a prompt that is 200 residues long and contains a mix of helical and sheet-like secondary structure elements. We then use ESM3 7B to generate 512 protein sequences conditioned on this prompt using $\frac{L}{2}$ decoding steps and a temperature of 0.7. Designs are filtered by ESM3 pTM and adherence to the prompt, and the top 10 designs are selected for further analysis.

In both examples, we use ESM3 pTM to filter out designs that are predicted to be unstable or have low solubility. We also use ESM3 pTM to predict the melting temperature and solubility of the top 10 designs. Finally, we use ESM3 to generate 3D structures of the top 10 designs and analyze their structural properties using various bioinformatics tools.

User:

\begin{tabular}{|c|c|c|c|} \hline Scaffold & Reference & InterPro tags & Total Length \ \hline Beta propeller & $8 \sin \mathrm{A}$ & \begin{tabular}{l} IPR001680 (1-350) \ IPR036322 (1-350) \ IPR015943 (1-350) \end{tabular} & 353 \ \hline TIM barrel & $7 \mathrm{rpnA}$ & \begin{tabular}{l} IPR000652 (0-248) \ IPR020861 (164-175) \ IPR035990 (0-249) \ IPR013785 (0-251) \ IPR000652 (2-249) \ IPR022896 (1-249) \end{tabular} & 252 \ \hline MFS transporter & 4ikvA & \begin{tabular}{l} IPR011701 (1-380) \ IPR020846 (1-380) \ IPR036259 (1-380) \end{tabular} & 380 \ \hline Immunoglobulin & $7 \mathrm{sbdH}$ & \begin{tabular}{l} IPR036179 (0-116; 124-199) \ IPR013783 (0-206) \ IPR003597 (124-202) \ IPR007110 (0-115; 121-207) \ IPR003599 (6-115) \ IPR013106 (11-114) \end{tabular} & 209 \ \hline Histidine kinase & 8dvqA & \begin{tabular}{l} IPR003594 (47-156) \ IPR003594 (47-158) \ IPR004358 (118-137) \ IPR004358 (141-155) \ IPR004358 (101-112) \ IPR005467 (0-158) \ IPR036890 (4-159) \ IPR036890 (3-156) \end{tabular} & 166 \ \hline Alpha/beta hydrolase & 7yiiA & \begin{tabular}{l} IPR029058 (0-274) \ IPR000073 (26-265) \end{tabular} & 276 \ \hline \end{tabular}

Table S11. InterPro tags extracted from CAMEO test set proteins for prompting with fold specification.

The table provided shows the results of a CAMEO test set, which is a benchmarking dataset used to evaluate the performance of protein structure prediction methods. The table lists the scaffold, reference, and InterPro tags for each protein in the dataset, as well as the total length of the protein. The InterPro tags are used to identify functional domains and motifs in the protein sequence, which can provide valuable information for predicting the protein's structure. The table also includes the fold specification for each protein, which is a classification system used to group proteins based on their structural similarities. Overall, this table provides a comprehensive overview of the CAMEO test set proteins and their associated functional and structural information.

User:

\begin{tabular}{rrcc} & & & \ \hline Site & Scaffold & Novelty (TM to original) & Designability (scTM) \ \hline 017 & beta & 0.264 & 0.967 \ ACE2 & alpha & 0.606 & 0.871 \ CA & Immunoglobulin & 0.441 & 0.781 \ MG & ab-hydrolase & 0.293 & 0.969 \ TIM-barrel & 0.328 & 0.980 \ Renal-dipeptidase & alpha-beta-alpha & 0.644 & 0.933 \ SRO & mfs-transporter & 0.345 & 0.992 \ Topoisomerase & histidine-kinase & 0.269 & 0.948 \ YLT & alpha-beta & 0.229 & 0.899 \ ZN & alpha & 0.567 & 0.996 \ \hline \end{tabular} Table S12. Novelty and designability metrics. Metrics are shown for motif scaffolds shown in Fig. 2C. Novelty is measured by computing the TM-score to the original scaffold from which the motif is derived. Designability is measured by self-consistency TM-score over eight samples by inverse folding with ESM-IF and refolding with ESMFold. All designs are distinct from their original scaffolds while retaining high designability.

to the SASA prompt. The final generation is chosen by visual inspection. The generation is evaluated as described above (ESMFold pTM 0.71, scTM mean 0.82, std 0.045). Examining the generation, ESM3 is able to satisfy the input constraints: the generated protein maintains the structure of the helix (cRMSD $0.18 \AA$ ) and the alternating alpha-beta fold (both the generation and the template have 7 strands alternating with helices), while exposing the helix motif to the surface (mean SASA $28.35 \AA^{2}$ ). Furthermore, the generation is structurally distinct: a Foldseek search (39) of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode reveals no hit with TM-score greater than .76.

The table shows the novelty and designability metrics for various motif scaffolds. Novelty is measured by computing the TM-score to the original scaffold, while designability is measured by self-consistency TM-score over eight samples by inverse folding with ESM-IF and refolding with ESMFold. The final generation is chosen by visual inspection and evaluated using ESMFold pTM and scTM mean and std. The example given shows how ESM3 is able to satisfy input constraints and generate a structurally distinct protein with a mean SASA of 28.35 $\AA^{2}$.

User:

We also use ESM3 to generate an idealized TIM Barrel with 11-fold symmetry. This generation is undertaken in two steps. First, we derive a secondary structure and function keyword prompt from a reference TIM Barrel (PDB 5EKY). The secondary structure of the reference protein is computed using DSSP and then idealized to construct a prompt for ESM3. To construct the secondary structure prompt, the length of each helix and strand is fixed at 7 residues. Each helix and strand region is then separated by 3 mask tokens, with a mask token appended to the $\mathrm{N}$ and $\mathrm{C}$ termini of the prompt as well. This yields a secondary structure prompt of total length 159 , which is combined with a function keyword prompt derived from the reference protein: keywords are derived from IPR013785 (aldolase-type TIM barrel) and IPR000887 (KDPG/KHG aldolase). ESM3 7B is then used to generate 256 samples with $L$ decoding steps and a temperature of 0.7 . The design shown is chosen by filtering by ESM3 pTM and visual inspection. In the second step, the secondary structure prompt from the first step is expanded to contain 11 helix-strand subunits, for a total prompt length of 225 residues (4 mask tokens are now appended to the $\mathrm{N}$ and $\mathrm{C}$ termini, rather than just 1). ESM3 7B is then used to generate 256 samples with $L$ decoding steps and a temperature of 0.7 , with generations filtered by ESM3 pTM and visual inspection. The generation is evaluated as described above (ESMFold pTM 0.69, scTM mean 0.97, std 0.011). The generation is structurally distinct: a Foldseek search (39) of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode reveals no hit with TM-score greater than . 61 .

The process of generating an idealized TIM Barrel with 11-fold symmetry involves two steps. The first step involves deriving a secondary structure and function keyword prompt from a reference TIM Barrel (PDB 5EKY) using DSSP and ESM3. The secondary structure prompt is constructed by fixing the length of each helix and strand at 7 residues and separating them with 3 mask tokens. The total length of the prompt is 159, which is combined with a function keyword prompt derived from the reference protein. ESM3 7B is then used to generate 256 samples with $L$ decoding steps and a temperature of 0.7, and the design is chosen by filtering with ESM3 pTM and visual inspection.

In the second step, the secondary structure prompt is expanded to contain 11 helix-strand subunits, for a total prompt length of 225 residues. ESM3 7B is used to generate 256 samples with $L$ decoding steps and a temperature of 0.7, and the generation is filtered by ESM3 pTM and visual inspection. The resulting generation is structurally distinct, as revealed by a Foldseek search of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode, which yielded no hit with TM-score greater than .61. User: