==============================
Language models can help us explore a wider range of proteins than what has been discovered through natural evolution. They can generate functional proteins that would take evolution a long time to find. These models don't have to follow the same physical constraints as evolution, but they can still create a model of the many possible paths evolution could have taken.
Proteins can be thought of as existing in a specific area where they are surrounded by other proteins that are only one small change away from them. The way that proteins evolve can be visualized as a network within this space, connecting all proteins through the paths that evolution can take between them. These paths represent the ways in which one protein can transform into another without causing harm to the larger system they are a part of.
It is in this space that a language model sees proteins. It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution. Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins. To do so it will need to learn what determines whether a path is feasible for evolution.