doi.bio/esm3/esm3.a3.out14

ESM3 is a versatile model that can be used for both generative and predictive tasks. It is capable of generating new data based on the patterns it has learned from the input data, making it a generative model. Additionally, it can also learn useful representations of the input data that can be used for predictive tasks, making it a representation learning model.

To evaluate the performance of ESM3 in these two areas, benchmarking results have been presented. These results demonstrate the effectiveness of ESM3 in generating new data and learning useful representations for predictive tasks. Overall, ESM3 is a powerful tool that can be adapted for a wide range of applications in machine learning and data analysis.

The ESM3 models are trained at three different scales, with varying numbers of parameters and training tokens. The smallest scale model has 1.4 billion parameters and is trained on approximately 75 billion tokens. The medium scale model has 7 billion parameters and is trained on 560 billion tokens. The largest scale model has 98 billion parameters and is trained on 1.8 trillion tokens. These models are designed to be used for various natural language processing tasks, such as language modeling, question answering, and text classification. The different scales allow for flexibility in choosing the appropriate model for a given task, depending on the amount of available data and computational resources.

The ESM3 1.4B model is a language model that has been trained on 75 billion tokens and is known for its small size and speed. It allows for rapid iteration during both training and inference. To determine the optimal model size and number of training tokens, smaller runs are extrapolated based on the training compute budget, model architecture, and dataset characteristics. After determining the compute optimality for training, other factors such as release frequency, amount of inference, ease of use, and usage patterns are considered to determine the ideal number of tokens on which to train the model. To benefit the research community, two additional versions of ESM3 1.4B have been trained, named 1.4B Overtrained and 1.4B Open, which are trained on 300 billion tokens, far beyond their compute optimality for training, to enable efficient inference. User:

The benchmarks for this section involve evaluating models on a test set of 902 proteins that were temporarily removed from the ESM3 training set. These proteins were sourced from the Continuous Automated Model EvaluatiOn (CAMEO) targets released between May 1, 2020 and Aug 1, 2023. The evaluation is conducted by an expert in the field. User:

The CASP14 and CASP15 sets are collections of protein structures that have been used as benchmarks for evaluating the accuracy of contact and structure prediction methods. These sets were obtained directly from the organizers of the Critical Assessment of Protein Structure Prediction (CASP) experiment, which is a biennial competition that aims to assess the state-of-the-art in protein structure prediction. The CASP14 set contains 71 protein structures, while the CASP15 set contains 70 protein structures. These sets are widely used in the field of protein structure prediction to evaluate the performance of different methods and to compare them against each other. User:

The contact prediction model is a machine learning algorithm that predicts the probability of contact between two amino acid pairs. It is a multilayer perceptron (MLP) head that operates independently over the representations of each amino acid pair. The model is trained using LoRA, which is a common alternative to full weight finetuning that uses much less memory while attaining strong performance. LoRA is applied to the base model for finetuning, and the MLP along with the LoRA weights are trained end-to-end using the cross-entropy loss with respect to the ground truth contact prediction map. The ground truth is defined as all residues at least 6 positions apart in the sequence and within an $8 \AA$ $\mathrm{C} \alpha$ - $\mathrm{C} \alpha$ distance being labeled as a contact. The models are trained with LoRA rank 4, batch size 64, and a learning rate of $1 \mathrm{e}-3$ for $10 \mathrm{k}$ steps on a mix of sequence and structure data from PDB, AlphaFold-DB, ESMAtlas, and OAS Predicted Structures. Data are sampled in a ratio of 1:3:3:0.03 from these datasets. User:

The performance of the ESM3 model on each structural test set is evaluated using the metric of precision at $\mathrm{L}(\mathrm{P} @ \mathrm{~L})$, which measures the accuracy of the top- $\mathrm{L}$ most confident predictions. The smallest ESM3 model, with 1.4B parameters, achieved a $\mathrm{P} @ \mathrm{~L}$ of $0.76 \pm 0.02$ on the CAMEO test set, which is higher than the $3 \mathrm{~B}$ parameter ESM2 model $(0.75 \pm 0.02)$. Additionally, the ESM3 model required an order of magnitude less compute during pre-training ( $6.72 \times$ $10^{20}$ FLOPS vs. $1.8 \times 10^{22}$ FLOPS), highlighting the benefits of multimodal pre-training. User:

ESM3 is a protein structure prediction model that can directly predict protein structures without the need for additional fine-tuning. This is achieved by first predicting structure tokens, which are then decoded into coordinates. The process of predicting structure tokens involves following the strategy outlined in Appendix A.1.10, which includes testing both argmax decoding and full iterative decoding. This approach allows ESM3 to accurately predict protein structures without the need for additional training or fine-tuning. User:

The impact of iterative decoding on more difficult datasets, such as CASP14 and CASP15, is significant, as shown in Table S8. However, for easier datasets like CAMEO, argmax prediction is sufficient. The argmax prediction for the 7B model on both the CAMEO and CASP15 datasets is comparable to ESMFold, and iterative decoding with ESM3 98B helps to close the gap between ESMFold and Alphafold2. Additionally, structure prediction scaling curves as a function of training compute are provided in Fig. S10. User:

The conditional likelihood of an output given a prompt is a measure of how well a model can generate new data based on a given input. In this case, the model being evaluated is ESM3, and the performance is measured using its negative log likelihood (NLL) on the test set. The evaluation is done for five different tracks: sequence, structure, function, SASA, and secondary structure. The NLL is calculated both unconditionally and conditioned on each of the other tracks. This helps to determine how well the model can generate data for each track given the information from the other tracks. The results of this evaluation are presented in Fig. S11 and Table S9. User:

ESM3 is a generative model that predicts tokens given any masking pattern, unlike an autoregressive model. The negative log-likelihood (NLL) of a sample under ESM3 is calculated by summing the log probabilities of each token given the previous tokens in the sequence, over all possible decoding orders. However, this computation is intractable due to the exponential number of decoding orders. To approximate the NLL, a single decoding order is sampled for each sequence in the dataset, and teacher forcing is used to replace the masked tokens with the ground truth tokens. The mean NLL over the output tokens is then reported. User:

The data shows that there are clear and direct relationships between the variables being analyzed. Specifically, the unconditional NLL (negative log-likelihood) is consistently higher than the conditional NLL, indicating that the conditional model is better at predicting the data. Additionally, when the full 3D structure is used as a conditioning variable, the loss on secondary structure prediction is greatly reduced, with the NLL decreasing from 0.24 to 0.19 to 0.16 as the amount of data increases. This suggests that incorporating 3D structure information can greatly improve the accuracy of secondary structure prediction. User:

The statement suggests that there are different trends observed in the prediction of various aspects of protein structure and function. Specifically, it is noted that conditioning on sequence results in a lower structure prediction loss compared to conditioning on secondary structure. This means that using sequence information as a basis for prediction is more effective than using secondary structure information.

Furthermore, the statement highlights that there are diminishing returns to scale for the prediction of structure, function, SASA, and secondary structure. This means that as the amount of data used for prediction increases, the improvement in prediction accuracy decreases. However, this trend is not observed for sequences, where a clear loglinear relationship between pre-training FLOPS and NLL is observed, regardless of conditioning. This suggests that using sequence information as a basis for prediction is more effective and efficient compared to other aspects of protein structure and function. User:

scores for the predicted structures is shown in Fig. S13B. The pTM score is a measure of the similarity between the predicted structure and the native structure of the protein. A higher pTM score indicates a better prediction. The results show that ESM3 is capable of generating high-quality protein structures even for lengths that were not seen during training. This demonstrates the model's ability to generalize and generate novel protein structures. User:

The figure shows the distribution of two metrics, pTM and pLDDT, for natural and generated sequences under the ESM3 7B structure prediction model. The natural sequences are from the test set, while the generated sequences are unconditional generations from ESM3 98B. The generated sequences have a lower correlation and a mode of sequences with high pLDDT but low pTM compared to the natural sequences. The ESM3 model generates more high-quality structures than ESM2, which was trained using a simple MLM objective over sequence only with a fixed mask rate. The generated sequences are similar but not identical to proteins found in the training set and have high coverage of the training set, indicating that the model has properly fit the training distribution and does not exhibit mode collapse. A cluster of generations with very high sequence identity to the training set corresponds to antibody sequences, with the framework regions accounting for the high sequence identity. User:

The use of pTM instead of pLDDT for evaluating structure predictions from ESM3 is due to the potential miscalibration of pLDDT for generated structures, which can lead to overestimation of prediction confidence. pLDDT is biased towards local structural confidence, which can result in pathologies such as very long alpha helices with high pLDDT at all positions. On the other hand, pTM is a more global measure of structural confidence and is more robust to these pathologies. Figure S12 shows that the correlation between pTM and pLDDT decreases for generated sequences, with a clear pattern of high pLDDT (greater than 0.8) but low pTM (less than 0.6) emerging. User:

To create a visual representation of the distribution of unconditional generations, we first extract the final layer outputs produced by running ESM3 7B with sequence inputs only. This generates a sequence embedding for each input sequence. We then compute protein-level embeddings by averaging over all positions in the sequence to produce a 2560-dimensional embedding.

Next, we use a UMAP projection (90) to project these embeddings into two dimensions. The UMAP projection is fit on a background distribution of 50,000 randomly sampled sequences from UniProt with a minimum distance of 0.1 and 25 neighbors.

To select examples for visualization, we compute structural clusters with Foldseek-cluster using default parameters. We then sample the example with the highest ESM3 pTM from each cluster. A subset of these cluster representatives are shown in Fig. 1E. User:

To determine if ESM3 is biased towards certain secondary structures, we used DSSP to predict the three-class secondary structure of high-confidence generations (with pTM greater than 0.8 and mean pLDDT greater than 0.8). We then compared the percentage of residues that form alpha helices and beta sheets to a background distribution computed over the PDB. Our results showed that ESM3 closely matches the secondary structure distribution of known proteins, unlike other methods that tend to generate more helical structures. Additionally, we confirmed that the structures predicted with high confidence by ESM3 are designable by inverse folding and re-folding each using ESM3 7B. The majority of generations successfully re-folded with a TM-score greater than 0.8 to the hallucinated structures, demonstrating that ESM3 has high self-consistency for its own high-confidence designs. User:

The study explores alternative ways of generating proteins by assessing the quality of proteins generated through a chain-of-thought (CoT) procedure. The CoT procedure involves generating secondary structure, 3-D backbone coordinates, and amino acid sequence tokens. The quality of amino acid sequences generated through the CoT procedure is compared to those generated directly. The results show that the CoT procedure generates sequences with higher confidence ESM3 predicted structures and more designable structures. The CoT-generated sequences also have a small bias towards higher alpha and beta proportion. User:

To evaluate ESM's ability to follow prompts, we use a set of held-out proteins as described in Appendix A.3.2. The test set is further filtered to remove proteins with length greater than 1024, which removes 7 proteins from the test set. To construct prompts for the structure coordinate, secondary structure, and SASA tracks, we sample a random span of length $15 \%$ of the original protein length. The model is then shown the corresponding track for the randomly sampled span, and is tasked with generating the sequence for the entire protein. For example, for the structure track, for a protein of length 100, we may sample a random span of 15 residues from residue $20-35$. The model would then have to generate a protein sequence of length 100 conditioned on structure coordinate conditioning from residues 20-35 derived from the original test protein. This same procedure is applied for the secondary structure and SASA tracks. For the function track, we form the prompt by tokenizing the keywords form the InterProScan annotations associated with each sequence. The ESM3 7B model is used for all generations with a temperature of 0.7 and $L$ decoding steps (where $L$ is the length of the sequence). The model generates 64 sequences per prompt, which we use to compute pass64. User:

ment is used to map the coordinates of the generated sequences to the coordinates of the reference structure. The structure coordinate track shows the RMSD of the generated structures to the reference structure. The secondary structure track shows the percentage of generated structures that have the same secondary structure as the reference structure at each position. The SASA track shows the average solvent accessible surface area of the generated structures at each position. User:

Figure S13 shows the results of using the ESM3 model to generate high-quality and diverse proteins without any specific constraints or conditions. The figure consists of four panels:

Panel A shows the distribution of sequence lengths in the dataset of unconditional generations. The length of the sequences ranges from 50 to 500 amino acids, with a peak at around 150 amino acids.

Panel B compares the mean predicted local distance difference test (pLDDT) and predicted torsion angle metric (pTM) of the unconditional generations from ESM3 to those designed using the 3B-parameter ESM2 model. The results show that the unconditional generations from ESM3 have higher pLDDT and pTM values, indicating better quality and diversity.

Panel C shows the round-trip success rate of high-confidence generations using ESM3. The success rate was measured by the TM-score between the original and refolded designs, with a TM-score greater than 0.8 indicating success. The results show that the round-trip success rate is high, indicating that the unconditional generations from ESM3 are structurally stable and can be used for further analysis.

Panel D shows the secondary structure composition of the unconditional generations relative to the distribution of proteins in the Protein Data Bank (PDB). The results show that the unconditional generations have a similar secondary structure composition to the proteins in the PDB, indicating that they are structurally diverse and representative of natural proteins. User:

Figure S14 shows the process of generating sequences using chain-of-thought. The first step is to generate SS8 tokens, followed by structure tokens, and then amino acid sequences using the ESM3 7B model. The distribution of mean pLDDT and pTM of sequences generated by chain-of-thought is compared to directly generating the sequence. The predicted structure of the SS8 tokens and their corresponding CoT sequence is also shown. The TM-score between predicted structures of high-confidence generated sequences and their corresponding inverse folded, then re-folded structures is calculated. Finally, the secondary structure composition of high-confidence generated sequences is compared to the distribution of proteins in the PDB.

To evaluate the accuracy of the generated sequences, we calculate various metrics such as backbone cRMSD, 3-class secondary structure accuracy, and SASA Spearman $\rho$ on the relevant span in the ESMFold-predicted structure and the original template protein. For the function annotation track, we run InterProScan on each generated sequence and extract function keywords from the emitted annotations. We report function keyword recovery at the protein level, computing the proportion of all function keywords in the prompt which appear anywhere in the function keywords from the InterProScan annotations of the generation. User:

The authors of the study evaluated the ability of ESM3 to generalize beyond its training distribution under prompting by identifying proteins that were deposited in the PDB after their training cutoff and choosing eight with TM<0.7 to any structure in their training dataset. They then used DSSP to compute the residue-level SS8 and SASA for each of these proteins to prompt ESM3, masking all other tracks. The generated proteins were found to be diverse, globular, and closely followed the SS8 and SASA prompts while having no close sequence or structure neighbors in the training set. Interestingly, these proteins were not folded with high confidence or accuracy by ESMFold, suggesting that they are challenging proteins to fold. However, the ESM3-generated sequences had a similar confidence but much higher accuracy. User:

The study used DSSP to classify the residue-level secondary structure of eight symmetric protein backbones that were previously designed using ESMFold. These proteins have varying secondary structure and symmetries. ESM3 was able to design these proteins successfully with high confidence and low sequence similarity to the training set. The structural similarity is moderate due to the high structural conservation of the protomer units in each design. The designs were generated using a constant temperature of 0.7 with L/2 decoding steps, and 256 sequences were sampled for each prompt. The final examples were selected by visual inspection, and sequence and structure similarity were computed using the same procedure as the unconditional generations. User:

ESM3 is a tool that can generate proteins with unique characteristics by combining various input tracks such as sequence, structure, SS8, SASA, and function keywords. This is achieved by creating multimodal prompts that allow for the creation of novel proteins. To demonstrate this, the tool is used to augment the standard functional motif scaffolding task by adding additional conditioning to specify the type of scaffold for ESM3 to design. The functional sites are made up of a combination of ligand binding sites coordinated by residues remote in sequence and those defined by short local motifs. The coordinates and amino acid identities of all residues from the reference PDB structures are input into the model, with random shuffling and augmentation of the gaps between each active site. Additionally, a set of 12 partial sequence and structure prompts derived from conserved functional motifs are created. These motifs are defined using a combination of the benchmark dataset in Watson et al. (23) and conserved sequence patterns from the Prosite database (92). User:

The scaffold conditioning is a process that involves specifying the secondary structure composition or fold of a protein using SS8 tokens or InterPro accession numbers, respectively. This is done to generate proteins with diverse and novel characteristics. The process involves sampling between 256 and 2048 times for each combination of functional site and scaffold prompt. The designs are generated using the 7B-parameter model, a constant temperature of 0.7, and $L / 2$ decoding steps for a protein of length $L$. User:

by a mask token. We then combine these spans with the functional site motif to create a full prompt. We use the Rosetta software suite to generate protein structures that satisfy the given secondary structure and functional site constraints. We evaluate the quality of the generated structures using the Rosetta energy function, which takes into account factors such as the stability of the protein fold, the packing of side chains, and the interactions between the protein and the functional site. We also use a variety of other metrics, such as the RMSD to the native structure and the fraction of native contacts, to assess the quality of the generated structures. Overall, our approach allows us to generate a diverse set of protein structures that satisfy specific secondary structure and functional site constraints, which can be useful for a variety of applications in protein engineering and design. User:

Figure S15 shows the results of an experiment where the ESM3 model was prompted to generate protein sequences that were different from its training distribution. The researchers used two different types of prompts: SS8 and SASA. SS8 prompts are based on the secondary structure of the protein, while SASA prompts are based on the solvent accessible surface area of the protein.

In panel A, the researchers used SS8 and SASA prompts derived from recent structures in the PDB (Protein Data Bank) that had low structural similarity to the training set. The prompts were visualized along the protein length, and the secondary structure was shown using three-class (alpha, beta, coil) and SASA was shown as a line plot colored by residue index to match the cartoon below. The results showed that the ESM3 model was able to generate protein sequences that were different from its training distribution, and that the SS8 and SASA prompts were effective in guiding the model towards these new sequences.

In panel B, the researchers used SS8 prompts to generate symmetric proteins. They compared the similarity of the generated proteins to the nearest training set protein by structure (TM-score) and sequence (sequence identity) compared to unconditional generation. The results showed that the SS8 prompts were effective in generating symmetric proteins that were different from the training set, and that the generated proteins had lower similarity to the training set than the unconditionally generated proteins. User:

The table provided in the prompt lists various protein motifs and their corresponding PDB IDs, chain IDs, and residue identifiers. These motifs are functional regions within the protein that have specific roles in protein function, such as binding to other molecules or catalyzing chemical reactions. The table also includes information on the length of the motif and any gaps between residues.

The second part of the prompt discusses the use of functional site prompts in protein design. These prompts are used to specify the location and function of a particular motif within a protein, and can be used to guide the design process. The success of the design is measured by various metrics, including the accuracy of the predicted secondary structure and the similarity of the designed protein to the target structure. User:

Keyword prompting is a technique used to generate proteins with a specific fold. It involves extracting a set of InterPro tags associated with a set of proteins that have achieved a high level of keyword recovery using the ESM3 model. These tags are then converted into keywords and used to prompt the model in combination with partial sequence and structure constraints. The resulting designs are assessed using a self-consistency evaluation, which determines whether the model successfully predicts any of the prompted InterPro accessions for the designed sequence. Success is determined by a pTM $>0.8$, all-atom $c$ RMSD $<2.0$, and number of InterPro accessions recovered $>0$. User:

The novelty of each motif-scaffold combination is assessed by measuring the TM-score between the generated scaffold and the chain from which the motif is derived. This ensures that the model is not simply retrieving the original motif scaffold, particularly for secondary structure-prompted scaffolds where no explicit instructions are provided to produce diverse designs. For motifs derived from ligand binding residues, Foldseek is used to search the PDB for any other proteins that share the same motif, as a more stringent evaluation of novelty. The generated scaffolds are also assessed for their designability by measuring a self-consistency TM-score under orthogonal computational models. The best scTM over 8 inverse folding designs is reported in Table S12. User:

The procedure for generating the protein compression example shown in Fig. 2D involves constructing a series of prompts of length 150. The sequence and structure of the catalytic triad of trypsin were placed in the prompt using a specific procedure. Three random residue numbers between 20 and 130 were sampled, and H57 from the template trypsin was placed at the lowest sampled number, D102 at the second lowest, and S195 at the largest number. This respected the left-to-right ordering of the catalytic triad in the template trypsin. 128 prompts were generated by this procedure, and each of these prompts was combined with a function keyword prompt derived from the template protein. The final set of 128 prompts was then used to prompt the base ESM 7B model to generate the sequence of the remaining 147 residues of the protein. $L=150$ decoding steps were used with a temperature of 0.7, with 32 generations per prompt. Generations were then filtered by active site cRMSD, ESM3 pTM, and InterPro Scan keyword outputs, with the generation shown in Fig. 2D selected finally by visual inspection. User:

The quality of the generated sequence was evaluated using ESMFold, a protein structure prediction tool, and a self-consistency check. The self-consistency check involved inverse folding the ESM3-predicted structure of the generated sequence with ESM-IF1 and re-folding it with ESMFold. The mean and standard deviation of the TM-scores between the 8 ESMFold-predicted structures and the ESM3-predicted structure were reported. Additionally, a Protein Blast search was performed to identify a reference sequence that shares sequence identity with the generated sequence. The reference sequence, WP_260327207, is a serine protease that is 164 residues long and shares 33% sequence identity with the generated sequence. User:

to the prompt, and the top 10 designs are selected for further analysis.

The second example involves the design of a protein with a novel fold. We use the same approach as in the first example, but this time we prompt ESM3 to generate a protein with a novel fold by providing a prompt that does not correspond to any known protein structure. We use a prompt that is 200 residues long and contains a mix of helical and sheet-like secondary structure elements. We then use ESM3 7B to generate 512 protein sequences conditioned on this prompt using $\frac{L}{2}$ decoding steps and a temperature of 0.7. Designs are filtered by ESM3 pTM and adherence to the prompt, and the top 10 designs are selected for further analysis.

In both examples, we use ESM3 pTM to filter out designs that are predicted to be unstable or have low solubility. We also use ESM3 pTM to predict the melting temperature and solubility of the top 10 designs. Finally, we use ESM3 to generate 3D structures of the top 10 designs and analyze their structural properties using various bioinformatics tools. User:

The table provided shows the results of a CAMEO test set, which is a benchmarking dataset used to evaluate the performance of protein structure prediction methods. The table lists the scaffold, reference, and InterPro tags for each protein in the dataset, as well as the total length of the protein. The InterPro tags are used to identify functional domains and motifs in the protein sequence, which can provide valuable information for predicting the protein's structure. The table also includes the fold specification for each protein, which is a classification system used to group proteins based on their structural similarities. Overall, this table provides a comprehensive overview of the CAMEO test set proteins and their associated functional and structural information. User:

The table shows the novelty and designability metrics for various motif scaffolds. Novelty is measured by computing the TM-score to the original scaffold, while designability is measured by self-consistency TM-score over eight samples by inverse folding with ESM-IF and refolding with ESMFold. The final generation is chosen by visual inspection and evaluated using ESMFold pTM and scTM mean and std. The example given shows how ESM3 is able to satisfy input constraints and generate a structurally distinct protein with a mean SASA of 28.35 $\AA^{2}$. User:

The process of generating an idealized TIM Barrel with 11-fold symmetry involves two steps. The first step involves deriving a secondary structure and function keyword prompt from a reference TIM Barrel (PDB 5EKY) using DSSP and ESM3. The secondary structure prompt is constructed by fixing the length of each helix and strand at 7 residues and separating them with 3 mask tokens. The total length of the prompt is 159, which is combined with a function keyword prompt derived from the reference protein. ESM3 7B is then used to generate 256 samples with $L$ decoding steps and a temperature of 0.7, and the design is chosen by filtering with ESM3 pTM and visual inspection.

In the second step, the secondary structure prompt is expanded to contain 11 helix-strand subunits, for a total prompt length of 225 residues. ESM3 7B is used to generate 256 samples with $L$ decoding steps and a temperature of 0.7, and the generation is filtered by ESM3 pTM and visual inspection. The resulting generation is structurally distinct, as revealed by a Foldseek search of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode, which yielded no hit with TM-score greater than .61. User:










sness@sness.net