We have found that language models can reach a design space of proteins that is distant from the space explored by natural evolution, and generate functional proteins that would take evolution hundreds of millions of years to discover. Protein language models do not explicitly work within the physical constraints of evolution, but instead can implicitly construct a model of the multitude of potential paths evolution could have followed.
Proteins can be seen as existing within an organized space where each protein is neighbored by every other that is one mutational event away (54). The structure of evolution appears as a network within this space, connecting all proteins by the paths that evolution can take between them. The paths that evolution can follow are the ones by which each protein transforms into the next without the collective loss of function of the system it is a part of.
It is in this space that a language model sees proteins. It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution. Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins. To do so it will need to learn what determines whether a path is feasible for evolution.
Simulations are computational representations of reality. In that sense a language model which can predict possible outcomes of evolution can be said to be a simulator of it. ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution. It has been theorized that neural networks discover the underlying structure of the data they are trained to predict $(55,56)$. In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e. the fundamental biology of proteins.
In ESM3's generation of a new fluorescent protein, it is the first chain of thought to B8 that is the most intriguing. At 96 mutations to B8's closest neighbor there are $inom{229}{96} imes 19^{96}$ possible proteins, an astronomical number out of which only a vanishingly small fraction can have function, since fluorescence falls off sharply even after just a few random mutations. The existence of $\mathrm{C} 10$ and other bright designs in the neighborhood of B8 confirms that in the first chain of thought to B8, ESM3 found a new part of the space of proteins that, although unexplored by nature, is dense with fluorescent proteins.
Please analyze and critique this Discussion section. Please compare to Alphafold2.
Response: Analysis and Critique of the Discussion Section
The provided discussion section highlights the potential of protein language models, specifically ESM3, to explore uncharted regions of protein sequence space and generate functional proteins that are distant from those found in nature. The authors argue that these models can implicitly model evolutionary pathways and simulate the fundamental biology underlying protein function.
Exploration Beyond Natural Evolution
The claim that language models can reach design spaces distant from those explored by natural evolution is both intriguing and significant. By generating functional proteins like B8, which is 96 mutations away from its closest natural neighbor, ESM3 demonstrates an ability to identify sequences that are not readily accessible through conventional evolutionary processes. This suggests that language models can traverse the vast protein sequence space more efficiently than natural evolution, potentially uncovering novel functionalities.
Protein Sequence Space and Evolutionary Paths
The conceptualization of proteins existing within an organized sequence space, where each protein is neighbored by others that are one mutational event away, is a well-accepted framework in molecular biology. The analogy of evolution navigating this space via a network of feasible paths—where each step maintains the overall function—is a valid representation of evolutionary processes.
However, the assertion that a language model "sees" proteins within this space in the same way evolution does may be an oversimplification. While the model is trained on existing protein sequences (data generated by evolution), it learns statistical patterns in the data rather than explicitly modeling the evolutionary pathways themselves. The model captures correlations and structural motifs that are common in functional proteins, which allows it to generate plausible sequences.
Language Models as Simulators of Evolution
The discussion posits that since the language model predicts the next token based on data generated by evolution, it effectively simulates evolutionary processes. This is a compelling idea but may not fully capture the distinction between statistical modeling and mechanistic simulation. Language models like ESM3 are excellent at recognizing and reproducing patterns in the data they are trained on, but they do not necessarily simulate the underlying biological mechanisms, such as selection pressures, genetic drift, or epistasis, that drive evolution.
Moreover, the language model's capacity to generate functional proteins in sparsely populated regions of sequence space suggests that it can extrapolate beyond the data it has seen. This ability is impressive but results from the model's generalization of learned patterns rather than an explicit simulation of evolution's exploratory mechanisms.
Learning Fundamental Biology Through Token Prediction
The argument that solving the token prediction task requires the model to learn the deep structure of protein biology is partially valid. By predicting amino acid sequences that result in stable and functional proteins, the model implicitly learns aspects of protein folding, stability, and function. However, this learning is indirect and based on statistical associations rather than an understanding of the physical and chemical principles governing proteins.
The Case of ESM3's B8 Protein
The generation of the fluorescent protein B8, which is significantly mutated from known proteins yet remains functional, showcases the potential of language models to discover novel proteins. The fact that B8 resides in a region of sequence space dense with other functional proteins like C10 supports the idea that ESM3 can identify new "islands" of functionality that natural evolution has not explored.
However, it's important to recognize that the actual functional landscape of proteins is complex, and the model’s success may not generalize to all protein families or functions. Additionally, while the number of possible proteins at such distances is astronomically large, the language model benefits from the biases and constraints present in its training data, which guide it toward sequences that are more likely to be functional.
Comparison to AlphaFold2
AlphaFold2, developed by DeepMind, represents another transformational advancement in computational biology but approaches the challenge from a different angle than ESM3.
Objective and Functionality:
AlphaFold2: Predicts the three-dimensional structure of proteins from their amino acid sequences with high accuracy.
ESM3: Generates novel protein sequences by predicting the next amino acid in a sequence, utilizing learned patterns from existing protein data.
Methodology:
AlphaFold2: Employs deep learning techniques and leverages evolutionary information from multiple sequence alignments (MSAs) to predict protein folding and structure.
ESM3: Uses language modeling approaches akin to those in natural language processing to model protein sequences and generate new ones.
Insights into Biology:
AlphaFold2: Provides structural insights that are crucial for understanding protein function, interactions, and mechanisms at a molecular level.
ESM3: Suggests that sequence alone, without explicit structural modeling, can inform the generation of functional proteins, indicating that sequence patterns encode significant functional information.
Role in Evolutionary Understanding:
AlphaFold2: While not explicitly modeling evolution, it relies on evolutionary relationships captured in MSAs to improve structural predictions.
ESM3: Claims to simulate evolutionary possibilities by exploring sequence space beyond natural evolution, though this is facilitated by statistical learning rather than modeling evolutionary dynamics.
Critique
While the discussion highlights the impressive capabilities of protein language models, several aspects warrant a critical examination:
Conclusion
The discussion presents an optimistic view of protein language models' capabilities to innovate beyond natural evolutionary pathways. While their potential is undeniable, it's important to temper expectations by acknowledging the models' reliance on statistical learning and the absence of explicit evolutionary simulation. The comparison with AlphaFold2 highlights the diverse ways in which deep learning can advance our understanding of proteins, each with its unique approach and contributions to the field of computational biology.