esm3.intro.full10

==============================

The proteins that exist today have developed into their present forms over the course of billions of years of natural evolution, passing through a vast evolutionary sieve.

This sentence means that the proteins we have today have gone through a long process of change and development over a very long time. This process is called natural evolution and it has helped to shape the proteins into their current forms. The phrase "passing through a vast evolutionary sieve" is a metaphor that suggests that the process of evolution is like a filter that selects the best and most useful proteins, while getting rid of the ones that are not as good.

In parallel experiments conducted over geological time, nature creates random mutations and applies selection, filtering proteins by their myriad sequences, structures, and functions.

This sentence is talking about how nature creates and selects proteins over a very long period of time. It does this by making random changes to the proteins and then choosing the ones that work best. The proteins are chosen based on their different shapes, structures, and functions.

As a result, the patterns in the proteins we observe reflect the action of the deep hidden variables of the biology that have shaped their evolution across time.

This sentence is saying that the patterns we see in proteins are influenced by the underlying biological processes that have changed over time. These processes are not immediately visible, but they have an impact on the way proteins are formed and behave.

Gene sequencing surveys of Earth's natural diversity are cataloging the sequences $(1-3)$ and structures $(4,5)$ of proteins, containing billions of sequences and hundreds of millions of structures that illuminate patterns of variation across life.

This sentence is discussing how scientists are studying the genetic makeup of different living organisms on Earth. They are using gene sequencing surveys to identify the sequences and structures of proteins found in these organisms. By doing this, they are able to gather a large amount of data, including billions of sequences and hundreds of millions of structures, which helps them understand the patterns of variation across different forms of life.

A consensus is building that underlying these sequences is a fundamental language of protein biology that can be understood using large language models (6-10).

This sentence suggests that there is a growing agreement among experts that there is a basic language of protein biology that can be understood by using large language models. The sentence is likely referring to the use of computational models to analyze and understand the sequences of amino acids that make up proteins. The idea is that by studying these sequences, researchers can gain insights into the structure and function of proteins, which are essential for many biological processes. The sentence is written in a technical language that may be difficult for non-experts to understand, but it is essentially saying that scientists are using computer models to study proteins and learn more about how they work.

A number of language models of protein sequences have now been developed and evaluated ( $9,11-14$ ).

This sentence is discussing the development and evaluation of language models that are used to analyze protein sequences. The sentence is referencing previous research studies that have been conducted on these language models, which are numbered 9, 11-14. The sentence is likely written for an audience of experts in the field of protein analysis.

It has been found that the representations that emerge within language models reflect the biological structure and function of proteins $(6,15,16)$, and are learned without any supervision on those properties, improving with scale $(5,17,18)$.

This sentence is saying that when language models are used to study proteins, they are able to create representations of the proteins that accurately reflect their biological structure and function. This is done without any guidance or input from experts, and the accuracy of these representations improves as more data is analyzed.

In artificial intelligence, scaling laws have been found that predict the growth in capabilities with increasing scale, describing a frontier in compute, parameters and data (19-21).

This sentence is discussing the field of artificial intelligence and how it has discovered certain patterns or rules, called "scaling laws," that can predict how much more capable AI systems will become as they are given more resources, such as computing power, data, and parameters. These scaling laws help to define the limits of what AI can achieve with different levels of resources.

We present ESM3, a frontier multimodal generative model, that reasons over the sequences, structures, and functions of proteins.

This sentence is about a new type of computer program called ESM3, which is designed to understand and generate information about proteins. Proteins are important molecules in our bodies that perform many functions, and scientists study them to learn more about how our bodies work. ESM3 is a special kind of program that can analyze different aspects of proteins, such as their sequences (the order of the building blocks that make them up), structures (how they are folded and shaped), and functions (what they do in the body). By doing this, ESM3 can help scientists better understand proteins and potentially develop new treatments for diseases.

ESM3 is trained as a generative masked language model over discrete tokens for each modality.

This sentence is describing a type of machine learning model called ESM3, which is trained to generate new text or speech based on patterns it has learned from existing data. The model is designed to work with different types of data, such as text and audio, and it uses a technique called "masking" to focus on specific parts of the data during training. The goal of this model is to be able to generate new content that is similar in style and content to the original data it was trained on.

Structural reasoning is achieved by encoding three-dimensional atomic structure as discrete tokens rather than with the complex architecture and diffusion in three-dimensional space employed in recent predictive (22) and generative models $(14,23-25)$ of proteins.

This sentence is discussing a method for analyzing the structure of proteins. The method involves breaking down the three-dimensional structure of the protein into smaller, more manageable parts, which are represented by "discrete tokens." This approach is different from other methods that use more complex techniques to analyze the structure of proteins. The sentence is likely to be difficult for a non-expert to understand because it contains technical terms and concepts related to protein analysis.

All-to-all modeling of discrete tokens is scalable, and allows ESM3 to be prompted with any combination of its modalities, enabling controllable generation of new proteins that respect combinations of prompts.

This sentence is discussing a computer program called ESM3, which is used to generate new proteins. The sentence is saying that ESM3 can be used to create proteins that have specific characteristics, and that it can do this by combining different types of information (called "modalities"). The sentence also says that ESM3 can be used to create proteins that have certain combinations of characteristics, and that this can be done by using a technique called "all-to-all modeling of discrete tokens". This technique is described as being "scalable", which means that it can be used to create proteins with many different characteristics. Overall, the sentence is saying that ESM3 is a powerful tool for creating new proteins with specific characteristics.

ESM3 at its largest scale was trained with $1.07 \times 10^{24}$ FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters.

This sentence is describing the training process of a machine learning model called ESM3. The model was trained using a very large amount of computing power, measured in FLOPs (floating-point operations per second). It was trained on a dataset of 2.78 billion proteins and 771 billion unique tokens, which are small units of text. The model itself has 98 billion parameters, which are the values that the model adjusts during training to improve its performance. Overall, this sentence is describing the technical details of how the ESM3 model was trained.###

Scaling ESM3 to this 98 billion parameter size results in improvements in the representation of sequence, structure, and function, as well as on generative evaluations.

This sentence is discussing the benefits of increasing the size of a model called ESM3 to 98 billion parameters. The increase in size leads to improvements in how the model represents sequences, structures, and functions, as well as how well it performs on tasks that involve generating new data.

We find that ESM3 is highly responsive to prompts, and finds creative solutions to complex combinations of prompts, including solutions for which we can find no matching structure in nature.

The sentence means that ESM3 is a system that can quickly and effectively respond to different types of prompts or instructions. It is also able to come up with unique and innovative solutions to complex problems, even if those solutions have not been seen before in nature.

We find that models at all scales can be aligned to better follow prompts.

This sentence means that the models, which are used to predict or analyze data, can be adjusted or modified to better match the instructions or guidelines given to them. This can be done at different levels or scales of the model, and it helps to improve the accuracy and effectiveness of the model's predictions or analyses.

Larger models are far more responsive to alignment, and

show greater capability to solve the hardest prompts after alignment.

This sentence is discussing the performance of different models in solving difficult tasks. It is saying that larger models are better at responding to alignment, which means they are better at adjusting their behavior based on feedback. Additionally, these larger models are more capable of solving the hardest prompts after alignment, which means they are better at completing difficult tasks after receiving feedback.

We report the generation of a new green fluorescent protein (GFP) with ESM3.

This sentence is about the creation of a new type of green fluorescent protein (GFP) using a method called ESM3. GFP is a protein that glows green under certain conditions and is often used as a tool in scientific research to track the location and movement of cells or other molecules. The sentence is likely from a scientific paper or report and is intended for an audience of experts in the field.

Fluorescent proteins are responsible for the glowing colors of jellyfish and corals (26) and are important tools in modern biotechnology (27).

Fluorescent proteins are special types of proteins that can emit light, which makes them appear to glow. They are found in jellyfish and corals, and they are responsible for the bright colors that these organisms display. In addition to their natural role, fluorescent proteins are also used as tools in modern biotechnology, which is the field of science that uses living organisms or their parts to create useful products or technologies.

They share an elegant structure: an eleven stranded beta barrel with a helix that threads its center, which scaffolds the formation of a light-emitting chromophore out of the protein's own atoms.

This sentence is describing the structure of a protein. It has a specific shape called an "eleven stranded beta barrel" and a "helix" that runs through the center of it. This structure helps to create a special part of the protein called a "chromophore" which can emit light. The chromophore is made up of atoms that are part of the protein itself.

This mechanism is unique in nature-no other protein spontaneously forms a fluorescent chromophore out of its own structure-suggesting that producing fluorescence is hard even for nature.

This sentence is saying that there is a special process in nature where a protein can create its own fluorescent substance without any help from other things. This is very rare and unusual because it's the only protein that can do this on its own. This means that making something glow in the dark (fluorescence) is difficult even for nature to do.

Our new protein, which we have named esmGFP, has $36 \%$ sequence identity to Aequorea victoria GFP, and $58 \%$ sequence identity to the most similar known fluorescent protein.

We have discovered a new protein called esmGFP. It has some similarities to a protein called Aequorea victoria GFP, but it is also different in some ways. Specifically, esmGFP has 36% of the same building blocks (amino acids) as Aequorea victoria GFP. Additionally, esmGFP has 58% of the same building blocks as the most similar known fluorescent protein.

Despite GFP's intense focus as a target for protein engineering over several decades, as far as we are aware, proteins this distant have only been found through the discovery of new GFPs in nature.

This sentence is discussing the discovery of new proteins that are similar to GFP (Green Fluorescent Protein). The sentence is saying that even though scientists have been studying GFP for a long time, they have only found new proteins that are similar to GFP by discovering them in nature. This means that they have not been able to create these proteins through protein engineering.

Similar amounts of diversification among natural GFPs have occurred over predictable timescales.

This sentence is saying that the amount of diversification (or variety) among natural GFPs (green fluorescent proteins) has happened in a way that can be predicted over certain periods of time. In other words, the changes in GFPs have happened in a consistent and predictable way.

Understood in these terms, the generation of a new fluorescent protein at this distance from existing proteins appears to be equivalent to simulating over 500 million years of evolution.

This sentence means that creating a new fluorescent protein that is very different from existing proteins is like speeding up the natural process of evolution by millions of years. It would take a very long time for nature to create such a different protein through the normal process of evolution, but scientists were able to do it much faster.

sness@sness.net