esm3.discussion

We have found that language models can reach a design space of proteins that is distant from the space explored by natural evolution, and generate functional proteins that would take evolution hundreds of millions of years to discover.

Protein language models do not explicitly work within the physical constraints of evolution, but instead can implicitly construct a model of the multitude of potential paths evolution could have followed. Proteins can be seen as existing within an organized space where each protein is neighbored by every other that is one mutational event away (54). The structure of evolution appears as a network within this space, connecting all proteins by the paths that evolution can take between them.

The paths that evolution can follow are the ones by which each protein transforms into the next without the collective loss of function of the system it is a part of. It is in this space that a language model sees proteins. It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution.

Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins. To do so it will need to learn what determines whether a path is feasible for evolution. Simulations are computational representations of reality.

In that sense a language model which can predict possible outcomes of evolution can be said to be a simulator of it. ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution. It has been theorized that neural networks discover the underlying structure of the data they are trained to predict $(55,56)$.

In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e. the fundamental biology of proteins.

In ESM3's generation of a new fluorescent protein, it is the first chain of thought to B8 that is the most intriguing. At 96 mutations to B8's closest neighbor there are $\binom{229}{96} \times 19^{96}$ possible proteins, an astronomical number out of which only a vanishingly small fraction can have function, since fluorescence falls off sharply even after just a few random mutations.

The existence of $\mathrm{C} 10$ and other bright designs in the neighborhood of B8 confirms that in the first chain of thought to B8, ESM3 found a new part of the space of proteins that, although unexplored by nature, is dense with fluorescent proteins.