We have found that language models can reach a design space of proteins that is distant from the space explored by natural evolution, and generate functional proteins that would take evolution hundreds of millions of years to discover. Protein language models do not explicitly work within the physical constraints of evolution, but instead can implicitly construct a model of the multitude of potential paths evolution could have followed.
Proteins can be seen as existing within an organized space where each protein is neighbored by every other that is one mutational event away (54). The structure of evolution appears as a network within this space, connecting all proteins by the paths that evolution can take between them. The paths that evolution can follow are the ones by which each protein transforms into the next without the collective loss of function of the system it is a part of.
It is in this space that a language model sees proteins. It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution. Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins. To do so it will need to learn what determines whether a path is feasible for evolution.
Simulations are computational representations of reality. In that sense a language model which can predict possible outcomes of evolution can be said to be a simulator of it. ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution. It has been theorized that neural networks discover the underlying structure of the data they are trained to predict $(55,56)$. In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e. the fundamental biology of proteins.
In ESM3's generation of a new fluorescent protein, it is the first chain of thought to B8 that is the most intriguing. At 96 mutations to B8's closest neighbor there are $inom{229}{96} imes 19^{96}$ possible proteins, an astronomical number out of which only a vanishingly small fraction can have function, since fluorescence falls off sharply even after just a few random mutations. The existence of $\mathrm{C} 10$ and other bright designs in the neighborhood of B8 confirms that in the first chain of thought to B8, ESM3 found a new part of the space of proteins that, although unexplored by nature, is dense with fluorescent proteins.
Please analyze and critique this Discussion section
Response: The Discussion section explores the intriguing idea that protein language models, such as ESM3, can generate functional proteins far beyond the scope of natural evolution by implicitly modeling evolutionary pathways. The authors suggest that by predicting the next token (amino acid) in protein sequences, these models effectively simulate how evolution navigates through the vast space of possible proteins.
While this perspective is captivating, there are several critical points to consider for a balanced analysis:
Misinterpretation of Sequence Prediction Tasks: The claim that predicting the next amino acid in a sequence equates to predicting evolutionary pathways may overstate the capabilities of language models. Protein language models are trained on existing protein sequences, capturing statistical patterns and correlations within the data. However, these models do not inherently account for evolutionary mechanisms such as natural selection, genetic drift, or selective pressures that shape evolutionary paths. The sequences used for training are snapshots of proteins that have survived evolutionary processes, but they do not directly represent the pathways or intermediate sequences that evolution explored.
Overextension of the Simulated Evolution Concept: Describing ESM3 as an "emergent simulator" of evolution might be misleading. While the model can generate sequences that are syntactically and functionally plausible, it does not replicate the dynamic and stochastic nature of evolutionary processes. Evolution operates under constraints like population dynamics, mutation rates, environmental pressures, and the functional viability of intermediate forms—all factors that language models do not simulate. Therefore, equating sequence prediction with evolutionary simulation oversimplifies the complex nature of evolutionary biology.
Functional Viability and Evolutionary Accessibility: The observation that ESM3 generated a functional protein (B8) with 96 mutations from its closest known neighbor is remarkable. However, the assertion that this region of protein space is "dense with fluorescent proteins" based on a few examples requires cautious interpretation. Evolutionary accessibility is not solely determined by the existence of functional proteins but also by the viability of the mutational paths leading to them. In nature, proteins evolve through gradual mutations where each intermediate must retain sufficient function to be selected. The language model does not consider the fitness landscape or the functionality of these intermediates, potentially generating sequences that evolution could not feasibly produce.
Statistical Patterns vs. Biological Mechanisms: The model's ability to generate novel sequences stems from recognizing and extrapolating statistical patterns within the training data. While these patterns reflect constraints imposed by protein structure and function, they do not necessarily capture the underlying biological mechanisms that govern protein evolution. The model may generate sequences that are statistically probable but biologically implausible due to factors not represented in the data, such as post-translational modifications or cellular context.
Evidence and Generalization: The claim that ESM3 has discovered new regions of functional protein space should be supported by broader evidence. Demonstrating that multiple, diverse sequences in this space are indeed functional, and not just a few isolated cases, would strengthen the argument. Additionally, experimental validation of these proteins' functionality and stability in realistic biological contexts is crucial for substantiating the model's capabilities.
Analogies and Terminology: Terms like "chain of thought" and phrases such as "the model sees proteins" are metaphorical and may obscure the technical realities. Clarifying the language to accurately reflect the computational processes will enhance the readers' understanding and avoid misinterpretation.
Acknowledging Limitations: A more nuanced discussion acknowledging the limitations of language models in simulating evolutionary processes would provide a balanced perspective. Recognizing that while these models offer valuable insights and tools for protein design, they are not substitutes for explicit evolutionary simulations that consider the myriad factors influencing protein evolution.
In conclusion, the Discussion presents an exciting view of protein language models' potential but could benefit from a more critical examination of the underlying assumptions and limitations. By refining the claims, providing additional evidence, and clarifying the distinction between statistical sequence modeling and evolutionary simulation, the authors can strengthen their argument and contribute valuable insights to the field of computational protein design.