esm.doi.bio/esm3/esm3.intro.full10
==============================
In parallel experiments conducted over geological time, nature creates random mutations and applies selection, filtering proteins by their myriad sequences, structures, and functions.
This sentence is talking about how nature creates and selects proteins over a very long period of time. It does this by making random changes to the proteins and then choosing the ones that work best. The proteins are chosen based on their different shapes, structures, and functions.
As a result, the patterns in the proteins we observe reflect the action of the deep hidden variables of the biology that have shaped their evolution across time.
Gene sequencing surveys of Earth's natural diversity are cataloging the sequences $(1-3)$ and structures $(4,5)$ of proteins, containing billions of sequences and hundreds of millions of structures that illuminate patterns of variation across life.
A consensus is building that underlying these sequences is a fundamental language of protein biology that can be understood using large language models (6-10).
A number of language models of protein sequences have now been developed and evaluated ( $9,11-14$ ).
This sentence is discussing the development and evaluation of language models that are used to analyze protein sequences. The sentence is referencing previous research studies that have been conducted on these language models, which are numbered 9, 11-14. The sentence is likely written for an audience of experts in the field of protein analysis.
It has been found that the representations that emerge within language models reflect the biological structure and function of proteins $(6,15,16)$, and are learned without any supervision on those properties, improving with scale $(5,17,18)$.
In artificial intelligence, scaling laws have been found that predict the growth in capabilities with increasing scale, describing a frontier in compute, parameters and data (19-21).
This sentence is discussing the field of artificial intelligence and how it has discovered certain patterns or rules, called "scaling laws," that can predict how much more capable AI systems will become as they are given more resources, such as computing power, data, and parameters. These scaling laws help to define the limits of what AI can achieve with different levels of resources.
We present ESM3, a frontier multimodal generative model, that reasons over the sequences, structures, and functions of proteins.
This sentence is about a new type of computer program called ESM3, which is designed to understand and generate information about proteins. Proteins are important molecules in our bodies that perform many functions, and scientists study them to learn more about how our bodies work. ESM3 is a special kind of program that can analyze different aspects of proteins, such as their sequences (the order of the building blocks that make them up), structures (how they are folded and shaped), and functions (what they do in the body). By doing this, ESM3 can help scientists better understand proteins and potentially develop new treatments for diseases.
ESM3 is trained as a generative masked language model over discrete tokens for each modality.
This sentence is describing a type of machine learning model called ESM3, which is trained to generate new text or speech based on patterns it has learned from existing data. The model is designed to work with different types of data, such as text and audio, and it uses a technique called "masking" to focus on specific parts of the data during training. The goal of this model is to be able to generate new content that is similar in style and content to the original data it was trained on.
Structural reasoning is achieved by encoding three-dimensional atomic structure as discrete tokens rather than with the complex architecture and diffusion in three-dimensional space employed in recent predictive (22) and generative models $(14,23-25)$ of proteins.
All-to-all modeling of discrete tokens is scalable, and allows ESM3 to be prompted with any combination of its modalities, enabling controllable generation of new proteins that respect combinations of prompts.
ESM3 at its largest scale was trained with $1.07 \times 10^{24}$ FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters.
This sentence is describing the training process of a machine learning model called ESM3. The model was trained using a very large amount of computing power, measured in FLOPs (floating-point operations per second). It was trained on a dataset of 2.78 billion proteins and 771 billion unique tokens, which are small units of text. The model itself has 98 billion parameters, which are the values that the model adjusts during training to improve its performance. Overall, this sentence is describing the technical details of how the ESM3 model was trained.###
Scaling ESM3 to this 98 billion parameter size results in improvements in the representation of sequence, structure, and function, as well as on generative evaluations.
We find that ESM3 is highly responsive to prompts, and finds creative solutions to complex combinations of prompts, including solutions for which we can find no matching structure in nature.
The sentence means that ESM3 is a system that can quickly and effectively respond to different types of prompts or instructions. It is also able to come up with unique and innovative solutions to complex problems, even if those solutions have not been seen before in nature.
We find that models at all scales can be aligned to better follow prompts.
This sentence means that the models, which are used to predict or analyze data, can be adjusted or modified to better match the instructions or guidelines given to them. This can be done at different levels or scales of the model, and it helps to improve the accuracy and effectiveness of the model's predictions or analyses.
Larger models are far more responsive to alignment, and
show greater capability to solve the hardest prompts after alignment.
We report the generation of a new green fluorescent protein (GFP) with ESM3.
This sentence is describing the structure of a protein. It has a specific shape called an "eleven stranded beta barrel" and a "helix" that runs through the center of it. This structure helps to create a special part of the protein called a "chromophore" which can emit light. The chromophore is made up of atoms that are part of the protein itself.
This sentence is saying that there is a special process in nature where a protein can create its own fluorescent substance without any help from other things. This is very rare and unusual because it's the only protein that can do this on its own. This means that making something glow in the dark (fluorescence) is difficult even for nature to do.
Our new protein, which we have named esmGFP, has $36 \%$ sequence identity to Aequorea victoria GFP, and $58 \%$ sequence identity to the most similar known fluorescent protein.
We have discovered a new protein called esmGFP. It has some similarities to a protein called Aequorea victoria GFP, but it is also different in some ways. Specifically, esmGFP has 36% of the same building blocks (amino acids) as Aequorea victoria GFP. Additionally, esmGFP has 58% of the same building blocks as the most similar known fluorescent protein.
Despite GFP's intense focus as a target for protein engineering over several decades, as far as we are aware, proteins this distant have only been found through the discovery of new GFPs in nature.
This sentence is discussing the discovery of new proteins that are similar to GFP (Green Fluorescent Protein). The sentence is saying that even though scientists have been studying GFP for a long time, they have only found new proteins that are similar to GFP by discovering them in nature. This means that they have not been able to create these proteins through protein engineering.
Similar amounts of diversification among natural GFPs have occurred over predictable timescales.
This sentence is saying that the amount of diversification (or variety) among natural GFPs (green fluorescent proteins) has happened in a way that can be predicted over certain periods of time. In other words, the changes in GFPs have happened in a consistent and predictable way.
Understood in these terms, the generation of a new fluorescent protein at this distance from existing proteins appears to be equivalent to simulating over 500 million years of evolution.
This sentence means that creating a new fluorescent protein that is very different from existing proteins is like speeding up the natural process of evolution by millions of years. It would take a very long time for nature to create such a different protein through the normal process of evolution, but scientists were able to do it much faster.