doi.bio/esm3/esm3.discussion.full12

==============================

We have found that language models can reach a design space of proteins that is distant from the space explored by natural evolution, and generate functional proteins that would take evolution hundreds of millions of years to discover.

- Language models can generate functional proteins that would take evolution hundreds of millions of years to discover.

Protein language models do not explicitly work within the physical constraints of evolution, but instead can implicitly construct a model of the multitude of potential paths evolution could have followed.

- Protein language models can implicitly construct a model of the multitude of potential paths evolution could have followed.

Proteins can be seen as existing within an organized space where each protein is neighbored by every other that is one mutational event away (54).

- Proteins exist in an organized space where each protein is neighbored by every other that is one mutational event away.

The structure of evolution appears as a network within this space, connecting all proteins by the paths that evolution can take between them.

- Evolution can take different paths between proteins in this network.

The paths that evolution can follow are the ones by which each protein transforms into the next without the collective loss of function of the system it is a part of.

- The loss of function in a system can hinder the process of evolution.

It is in this space that a language model sees proteins.

- The language model is a powerful tool for scientific research and discovery.

It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution.

- This space reveals the parts accessible to evolution.

Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins.

- Solving the training task of predicting the next token involves understanding the process of evolution in generating proteins.

To do so it will need to learn what determines whether a path is feasible for evolution.

- The study of evolution continues to evolve as new technologies and research methods are developed.

Simulations are computational representations of reality.

- Simulations can be used to design and test new products and technologies.

In that sense a language model which can predict possible outcomes of evolution can be said to be a simulator of it.

- This ability to predict outcomes makes the language model a simulator of evolution.

ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution.

- It has been learned from solving a token prediction task on data generated by evolution.

It has been theorized that neural networks discover the underlying structure of the data they are trained to predict $(55,56)$.

- Further research is needed to fully elucidate these mechanisms.

In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e.

- The output should be presented in an unsorted markdown list.

the fundamental biology of proteins.

- Proteins can interact with other proteins or molecules to form complexes that carry out specific functions in the body.

In ESM3's generation of a new fluorescent protein, it is the first chain of thought to B8 that is the most intriguing.

- The first chain of thought to B8 is intriguing

At 96 mutations to B8's closest neighbor there are $\binom{229}{96} \times 19^{96}$ possible proteins, an astronomical number out of which only a vanishingly small fraction can have function, since fluorescence falls off sharply even after just a few random mutations.

- Fluorescence falls off sharply even after just a few random mutations.

The existence of $\mathrm{C} 10$ and other bright designs in the neighborhood of B8 confirms that in the first chain of thought to B8, ESM3 found a new part of the space of proteins that, although unexplored by nature, is dense with fluorescent proteins.











sness@sness.net