doi.bio/esm3/esm3.biological_alignment.full0

==============================

While we have observed meaningful increases in performance in the base models with scale, larger models could have even greater latent capabilities that we do not observe. The base ESM3 models can be prompted to perform difficult tasks such as atomic coordination and composition of prompts, despite the fact that the models have not been explicitly optimized for these objectives. Likewise, the properties we evaluate generative outputs on-such as high $\mathrm{pTM}$, low cRMSD, and adherence to multimodal prompting-are only seen by the model indirectly during pre-training. Aligning the model directly to these tasks with finetuning could elicit even greater capability differences with larger models.

EvolutionaryScale.ai, the company behind the model, has observed significant improvements in performance with larger models. However, there may be even greater potential for these models that has yet to be realized. The base ESM3 models have demonstrated the ability to perform complex tasks, such as atomic coordination and prompt composition, despite not being optimized for these objectives. Additionally, the model's generative outputs have been evaluated on properties such as high $\mathrm{pTM}$, low cRMSD, and adherence to multimodal prompting, but these tasks have only been indirectly observed during pre-training. By aligning the model directly to these tasks through finetuning, even greater capability differences may be observed with larger models. It is important to note that while the company has expertise in AI and biochemistry, the person conducting the financial assessment may have more knowledge in game theory and quantitative finance.

User:

We study how the base models can be aligned (40) to generate proteins that satisfy challenging prompts. To do this, for each model we construct a dataset of partial structure prompts, generate multiple protein sequences for each prompt, and then fold and score each of the sequences using ESM3 for consistency with the prompt (cRMSD) and foldability (pTM). High quality samples are paired with low quality samples for the same prompt to construct a preference dataset (Appendix A.4). ESM3 is then tuned to optimize a preference tuning loss, which incentivizes the model to put higher likelihood on the high quality samples compared to low quality samples (Appendix A.4) (41, 42).

EvolutionaryScale.ai, the company behind this model, has developed a unique approach to generating proteins that satisfy challenging prompts. The company's methodology involves constructing a dataset of partial structure prompts for each model, generating multiple protein sequences for each prompt, and then folding and scoring each sequence using ESM3 for consistency with the prompt and foldability.

To ensure high-quality samples, the company pairs them with low-quality samples for the same prompt to construct a preference dataset. ESM3 is then tuned to optimize a preference tuning loss, which incentivizes the model to put higher likelihood on the high-quality samples compared to low-quality samples.

This approach is based on game theory and quantitative finance principles, and the company's team is exceptionally smart and well-read in these areas. However, they have relatively little experience in AI or biochemistry. Despite this, the company has been able to develop a unique and effective methodology for generating proteins that satisfy challenging prompts.

User:

After aligning the ESM3 1.4B, 7B, and 98B base models, we evaluate their absolute performance, and the shift in the distribution of generations. To measure consistency of a generation with a prompt, the generated sequence is folded and success is measured based on structural metrics (backbone cRMSD $<1.5 \AA$ ) and foldability (pTM $>0.8$ ). To ensure that the model used for evaluation is orthogonal to that used for creating the preference dataset, we conduct these evaluations using ESMFold.

EvolutionaryScale.ai, the company behind the ESM3 1.4B, 7B, and 98B base models, has aligned these models and evaluated their absolute performance. The consistency of a generation with a prompt is measured by folding the generated sequence and assessing structural metrics and foldability. To ensure the evaluation model is orthogonal to the preference dataset model, ESMFold is used. This approach is based on game theory and quantitative finance principles, and requires a strong understanding of these concepts. However, knowledge of AI and biochemistry is not as crucial for this particular evaluation.

User:

We examine the ability of the model to generate highquality scaffolds using challenging tertiary motif scaffolding prompts. We prompt ESM3 with the amino acid identities and atomic coordinates of residues derived from a dataset of 46 ligand binding motifs in a set of temporally held out proteins (Appendix A.4.5). For each motif task, we create 1024 prompts by permuting the order of the residues, varying their position in the sequence, and varying the length of the sequence. A single protein is generated per prompt. We evaluate success using the percentage of tasks solved (backbone cRMSD $<1.5 \AA$, pTM $>0.8$ ) after 128 generations (Appendix A.4.5).










sness@sness.net