esm3.generating_a_new_fluorescent_protein.full14

==============================

We sought to understand if the base pre-trained ESM3 model has sufficient biological fidelity to generate functional proteins.

The researchers aimed to investigate whether the ESM3 model, which is a pre-trained language model, has the ability to generate functional proteins without any further training. In other words, they wanted to determine if the model has enough biological knowledge to accurately predict protein structures and functions. This is an important question because if the model can do this, it could potentially accelerate the process of protein design and engineering.

User:

We set out to create a functional green fluorescent protein (GFP) with low sequence similarity to existing ones.

We aimed to develop a new type of green fluorescent protein (GFP) that is capable of performing its intended function while having minimal sequence similarity to existing GFPs. This was done to expand the range of available GFPs and potentially improve their performance in various applications.

We chose the functionality of fluorescence because it is difficult to achieve, easy to measure, and one of the most beautiful mechanisms in nature.

The choice of fluorescence as a functionality is based on three main reasons. Firstly, it is a challenging mechanism to achieve, which makes it an interesting and worthwhile pursuit for experts in the field. Secondly, it is easy to measure, which means that progress and success can be accurately tracked and evaluated. Finally, fluorescence is considered one of the most beautiful mechanisms in nature, which adds an aesthetic appeal to the research and may inspire further exploration and discovery.

Responsible for the fluorescence of jellyfish and the vivid colors of coral (43), proteins in the GFP family are unique in their ability to form a fluorescent chromophore without cofactors or substrates (27).

The GFP family of proteins is responsible for the fluorescence of jellyfish and the vivid colors of coral. These proteins have a unique ability to form a fluorescent chromophore without the need for cofactors or substrates. This means that they can emit light on their own, without requiring any additional molecules to help them do so. This property makes GFP proteins very useful in a variety of scientific applications, such as imaging and tracking cells and molecules in living organisms.

This property allows the GFP sequence to be inserted into the genomes of other organisms to visibly label molecules, cellular structures, or processes, providing a foundational toolkit that has been broadly applied across the biosciences.

This property refers to the ability of the GFP (Green Fluorescent Protein) sequence to be incorporated into the genetic material of other organisms, allowing for the labeling of molecules, cellular structures, or processes with a visible fluorescent marker. This has proven to be an incredibly useful tool in the field of biosciences, as it enables researchers to track and observe various biological processes and structures in real-time. The GFP sequence has been widely adopted and applied across many different areas of bioscience research, making it a foundational toolkit for many scientists.

The GFP family has been the subject of decades of protein engineering efforts, but still the vast majority of functional variants have come from prospecting the natural world.

The GFP family, which stands for Green Fluorescent Protein, has been extensively studied and modified by scientists for many years. Despite these efforts, most of the functional variants of GFP have been discovered by exploring the natural world. This means that scientists have found GFP variants in nature that have the desired properties, rather than creating them through protein engineering.

User:

Rational design and machine learning-assisted highthroughput screening have yielded GFP sequences with improved properties-such as higher brightness or stability, or differently colored variants-that incorporated small numbers of mutations (typically 5 to 15 , out of the total 238 amino acid coding sequence) from the originating sequence.

Rational design and machine learning-assisted high-throughput screening are techniques used to improve the properties of GFP (green fluorescent protein) sequences. These techniques involve making small changes to the amino acid sequence of GFP, typically between 5 to 15 mutations out of the total 238 amino acid coding sequence. The changes are made in a rational and systematic manner, with the goal of improving the brightness or stability of the protein, or creating differently colored variants.

The high-throughput screening process involves testing large numbers of GFP variants to identify those with the desired properties. Machine learning algorithms are used to analyze the data and identify patterns that can be used to predict the properties of new GFP variants.

Overall, these techniques have yielded GFP sequences with improved properties, which have a wide range of applications in biological research and biotechnology.

User:

Studies have shown that only a few random mutations reduces fluorescence to zero (44-46).

The statement suggests that studies have demonstrated that even a small number of random mutations can completely eliminate fluorescence. This finding is significant because it implies that the genetic code responsible for fluorescence is highly sensitive to changes, and even minor alterations can have a major impact on the trait. This information may be useful for experts in fields such as genetics, biochemistry, or biotechnology, who are interested in understanding the mechanisms underlying fluorescence and how it can be manipulated or controlled.###

whereas in rare cases, leveraging high throughput experimentation, scientists have been able to introduce up to $40-50$ mutations i.e. a $20 \%$ difference in total sequence identity $(44,47,48)$ while retaining GFP fluorescence.

In certain instances, scientists have been able to utilize high throughput experimentation to introduce a significant number of mutations, up to 40-50, while still maintaining GFP fluorescence. This represents a 20% difference in total sequence identity, as reported in studies 44, 47, and 48.

Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence.

To create a new green fluorescent protein (GFP), one would need to understand and replicate the intricate biochemical and physical processes that enable its fluorescence. This would involve a deep understanding of the molecular structure and interactions of GFP, as well as the mechanisms by which it absorbs and emits light. It would also require expertise in genetic engineering and protein synthesis, as well as advanced laboratory techniques for manipulating and analyzing biological molecules. Overall, generating a new GFP would be a complex and challenging task that would require a multidisciplinary approach and a high level of scientific expertise.

In all GFPs, an autocatalytic process forms the chromophore from three key amino acids in the core of the protein.

In all green fluorescent proteins (GFPs), the chromophore is formed through an autocatalytic process that involves three key amino acids located in the core of the protein. This process is essential for the protein to fluoresce and emit green light. The three amino acids involved in this process are Ser65, Tyr66, and Gly67. During the autocatalytic process, the hydroxyl group of Ser65 reacts with the imidazole group of Tyr66 to form a cyclized intermediate. This intermediate then undergoes dehydration to form the chromophore, which is a modified form of the amino acid Tyr66. The formation of the chromophore is a critical step in the maturation of GFPs and is necessary for the protein to fluoresce.

The unique structure of GFP, a kinked central alpha helix surrounded by an eleven stranded beta barrel

GFP, or green fluorescent protein, has a unique structure that consists of a central alpha helix that is kinked, surrounded by an eleven-stranded beta barrel. This structure is what allows GFP to fluoresce when exposed to certain wavelengths of light. The kink in the alpha helix creates a chromophore, which is responsible for the fluorescence. The beta barrel provides a stable environment for the chromophore to exist in, allowing it to fluoresce brightly and efficiently. This structure has made GFP a valuable tool in many areas of research, including cell biology, genetics, and biochemistry.

Figure 4. Generating a new fluorescent protein with a chain of thought.

Certainly! The process of generating a new fluorescent protein involves several steps, which can be broken down into a chain of thought.

Identify the need for a new fluorescent protein: This could be due to limitations of existing fluorescent proteins, such as brightness, stability, or color range.
Determine the desired properties of the new fluorescent protein: Based on the specific application, decide on the optimal characteristics for the new protein, such as excitation and emission wavelengths, brightness, and stability.
Design the amino acid sequence: Using knowledge of the structure and function of existing fluorescent proteins, design a new amino acid sequence that is likely to exhibit the desired properties.
Synthesize the gene: Once the amino acid sequence has been designed, synthesize the corresponding gene using standard molecular biology techniques.
Express the protein: Introduce the synthesized gene into a host cell, such as E. coli, and allow the cell to produce the new fluorescent protein.
Purify and characterize the protein: Isolate the new fluorescent protein from the host cell and analyze its properties, such as brightness, stability, and spectral characteristics, to determine if it meets the desired specifications.
Optimize the protein: If necessary, make further modifications to the amino acid sequence or expression conditions to improve the performance of the new fluorescent protein.
Validate the protein: Test the new fluorescent protein in the intended application to ensure that it performs as expected and provides the desired results.

By following this chain of thought, researchers can generate new fluorescent proteins with tailored properties for a wide range of applications in biotechnology and biomedical research.

Figure 4 (A) We prompt ESM3 with the sequence and structure of residues required for forming and catalyzing the chromophore reaction, as well as the structure of part of the central alpha helix from a natural fluorescent protein (left). Through a chain of thought, ESM3 generates design candidates (right).

The prompt given to ESM3 includes the necessary sequence and structure of residues required for forming and catalyzing the chromophore reaction, as well as the structure of part of the central alpha helix from a natural fluorescent protein. ESM3 then uses this information to generate design candidates through a chain of thought. Essentially, ESM3 is being given the necessary information to create new fluorescent proteins that can be used in various applications.

User:

(B) ESM3 found a bright GFP distant from other known GFPs in two experiments. We measured fluorescence in E. coli lysate. Top row, photograph of plates. Bottom row, plate reader fluorescence quantification. Positive controls of known GFPs are marked with purple circles, negative controls with no GFP sequence or no E. Coli are marked with red circles. In the first experiment (left) we expressed designs with a range of sequence identities. A notable design with low sequence identity to known fluorescent proteins appears in the well labeled B8 (highlighted in a black circle bottom, white circle top). We continue the chain of thought from the protein in B8 for the second experiment (right). A bright design appears in the well labeled C10 (black circle bottom, white circle top) which we designate esmGFP.

In the first experiment, the researchers expressed designs with a range of sequence identities and found a notable design with low sequence identity to known fluorescent proteins in well B8. In the second experiment, they continued the chain of thought from the protein in B8 and found a bright design in well C10, which they designated as esmGFP. The results were measured using fluorescence in E. coli lysate and are shown in the top row as photographs of plates and in the bottom row as plate reader fluorescence quantification. Positive controls of known GFPs are marked with purple circles, while negative controls with no GFP sequence or no E. Coli are marked with red circles.

User:

(C) esmGFP exhibits fluorescence intensity similar to common GFPs. Normalized fluorescence is shown for a subset of proteins in experiment 2.

C) The statement "esmGFP exhibits fluorescence intensity similar to common GFPs" means that the fluorescence intensity of esmGFP is comparable to that of other commonly used green fluorescent proteins. This is supported by the data shown in experiment 2, where the normalized fluorescence of esmGFP is presented alongside that of other GFPs.

(D) Excitation and emission spectra for esmGFP overlaid on the spectra of EGFP.

The excitation and emission spectra for esmGFP are being compared to the spectra of EGFP. This is likely being done to determine the similarities and differences between the two fluorescent proteins. The excitation spectrum shows the wavelengths of light that can excite the protein and cause it to emit light, while the emission spectrum shows the wavelengths of light that are emitted by the protein. By overlaying the spectra, it is possible to see how the excitation and emission properties of esmGFP compare to those of EGFP. This information can be useful for selecting the appropriate excitation and emission filters for imaging experiments, as well as for understanding the behavior of these fluorescent proteins in different experimental conditions.

(E) Two cutout views of the central alpha helix and the inside of the beta barrel of a predicted structure of esmGFP. The 96 mutations esmGFP has relative to its nearest neighbor, tagRFP, are shown in blue.

E: The given information pertains to the predicted structure of esmGFP, which is a variant of green fluorescent protein (GFP). The two cutout views depict the central alpha helix and the interior of the beta barrel, which are key structural elements of GFP. The 96 mutations that differentiate esmGFP from its closest relative, tagRFP, are highlighted in blue. These mutations likely contribute to differences in the protein's stability, folding, and fluorescence properties. Understanding the structure and mutations of esmGFP can aid in the development of improved fluorescent probes for biological imaging and other applications.

(F) Cumulative density of sequence identity between fluorescent proteins across taxa. esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.

The cumulative density of sequence identity between fluorescent proteins across taxa refers to the level of similarity between different types of fluorescent proteins found in various organisms. The statement suggests that esmGFP, a specific type of fluorescent protein, has a level of similarity to other fluorescent proteins that is comparable to the level of similarity found when comparing sequences across different orders within the same class. This indicates that esmGFP is a unique and distinct type of fluorescent protein.

(G) Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.

The evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP refer to the genetic differences and similarities between these fluorescent proteins. The evolutionary distance is a measure of how long ago the proteins diverged from a common ancestor, while the sequence identities indicate how much of the amino acid sequence is shared between the proteins.

In this case, the three example anthozoa GFPs are likely different types of green fluorescent proteins found in various species of anthozoa, which is a class of marine animals that includes corals, sea anemones, and jellyfish. The esmGFP is a synthetic variant of GFP that has been engineered to have improved properties for use in imaging and other applications.

By comparing the evolutionary distance and sequence identities between these proteins, researchers can gain insights into the evolutionary history of GFPs and how they have evolved to perform different functions in different organisms. This information can also be useful for designing new variants of GFP with specific properties for use in research and biotechnology.

(H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.

H: Hello, I am an expert in estimating evolutionary distance by time from GFP sequence identity. Can you please provide me with more information about the specific GFP sequences you would like me to analyze?

We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein.

with inward facing coordinating residues, enables this reaction (49).

The statement "We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein" suggests that esmGFP is a highly evolved protein that has undergone significant changes over a long period of time. The inward facing coordinating residues mentioned in the statement likely refer to specific amino acid residues within the protein that are involved in a particular chemical reaction. These residues are positioned in a way that allows them to interact with each other and facilitate the reaction. The fact that esmGFP has these coordinating residues suggests that it has evolved to perform a specific function that requires this reaction to occur. Overall, this statement highlights the complexity and diversity of proteins and the role that evolution plays in shaping their structure and function.

User:

Once formed, the chromophore must not just absorb light but also emit it in order to be fluorescent.

Certainly! In order for a molecule to be fluorescent, it must first absorb light at a specific wavelength. This absorption causes the molecule to become excited and enter a higher energy state. However, the molecule cannot remain in this excited state indefinitely. It must release the excess energy it has absorbed in order to return to its ground state. This release of energy is what we observe as fluorescence.

The process of fluorescence occurs when the excited molecule undergoes a transition from its excited state to its ground state, emitting a photon of light in the process. The wavelength of this emitted light is typically longer than the wavelength of the absorbed light, which is why fluorescent molecules often appear to glow in a different color than the light they were exposed to.

In order for a molecule to be fluorescent, it must have a specific structure that allows it to both absorb and emit light. This structure is typically a conjugated system of double bonds, which allows for the delocalization of electrons and the formation of a chromophore. The chromophore is the part of the molecule responsible for absorbing and emitting light, and it must be able to efficiently transfer energy between its excited and ground states in order to fluoresce.

Overall, the process of fluorescence is a complex interplay between the structure of a molecule, its ability to absorb and emit light, and the energy levels of its excited and ground states. Understanding these factors is crucial for designing and utilizing fluorescent molecules in a variety of applications, from biological imaging to materials science.

Light emission is highly sensitive to the local electronic environment of the chromophore.

Light emission is the process by which an object emits light. The intensity and color of the emitted light depend on the electronic environment of the chromophore, which is a molecule or a part of a molecule that absorbs and emits light. The electronic environment of the chromophore can be influenced by various factors such as temperature, pH, solvent, and the presence of other molecules. Therefore, even small changes in the local electronic environment of the chromophore can have a significant impact on the light emission properties of the object. This makes light emission a highly sensitive tool for studying the electronic properties of materials and biological systems.

For these reasons, obtaining a new functional GFP would require precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel.

To obtain a new functional green fluorescent protein (GFP), it is necessary to ensure that the active site and the surrounding long range tertiary interactions throughout the beta barrel are precisely configured. This is because the beta barrel structure of GFP plays a crucial role in its fluorescence properties, and any changes to this structure can affect its function. Therefore, careful consideration must be given to both the active site and the surrounding tertiary interactions when designing a new functional GFP.###

In an effort to generate new GFP sequences, we directly prompt the base pretrained 7B parameter ESM3 to generate a 229 residue protein conditioned on the positions Thr62, Thr65, Tyr66, Gly67, Arg96, Glu222, which are critical residues for forming and catalyzing the chromophore reaction (Fig. 4A).

In order to create new GFP sequences, we are using a pre-trained model called ESM3, which has 7 billion parameters. We are directly prompting this model to generate a protein sequence that is 229 residues long, and we are conditioning it on the positions of critical residues that are involved in forming and catalyzing the chromophore reaction. These critical residues are Thr62, Thr65, Tyr66, Gly67, Arg96, and Glu222, as shown in Figure 4A. By doing this, we hope to generate new GFP sequences that have improved properties or functions.

User:

We additionally condition on the structure of residues 58 through 71 from the experimental structure in 1QY3, which are known to be structurally important for the energetic favorability of chromophore formation (50).

This statement is indicating that the researchers have used the known structure of residues 58 through 71 from the experimental structure in 1QY3 as a condition in their analysis. This is because these residues are known to play a crucial role in the formation of chromophores, which are important for the energetic favorability of the process. By using this information as a condition, the researchers are able to better understand the factors that contribute to the formation of chromophores and potentially improve their ability to manipulate this process.

User:

Specifically, sequence tokens, structure tokens, and atomic coordinates of the backbone are provided at the input and generation begins from a nearly completely masked array of tokens corresponding to 229 residues, except for the token positions used for conditioning.

The input for this task includes sequence tokens, structure tokens, and atomic coordinates of the backbone. These are used to generate a nearly completely masked array of tokens corresponding to 229 residues, except for the token positions used for conditioning. This means that most of the tokens in the array are hidden or unknown, and only a few are provided as input. The task is to use this limited information to predict the structure of the protein backbone. This is a challenging task that requires expertise in protein structure prediction and machine learning.

User:

We generate designs using a chain-of-thought procedure as follows.

Certainly! The chain-of-thought procedure is a problem-solving technique that involves breaking down a complex problem into smaller, more manageable parts. In the context of design generation, this means starting with a general idea or concept and then systematically refining and developing it through a series of logical steps.

The process typically begins with brainstorming and ideation, where a wide range of potential solutions are considered. From there, the most promising ideas are selected and further developed through sketching, prototyping, and testing.

Throughout the process, designers rely on their expertise and experience to make informed decisions about which ideas to pursue and how to refine them. They also draw on feedback from users and other stakeholders to ensure that their designs are effective and meet the needs of the intended audience.

Overall, the chain-of-thought procedure is a flexible and iterative approach to design that allows designers to explore a wide range of possibilities while still maintaining a clear focus on the end goal.

The model first generates structure tokens, effectively creating a protein backbone.

The model first generates structure tokens, which are essentially the building blocks of a protein's backbone. These structure tokens are used to create a protein backbone, which is the primary structure of a protein. The backbone is formed by the sequence of amino acids that make up the protein, and it determines the overall shape and function of the protein. By generating structure tokens, the model is able to create a protein backbone that is structurally sound and biologically relevant. This is an important step in protein modeling and design, as it allows researchers to study the properties and functions of proteins in a more accurate and efficient manner.

Backbones that have sufficiently good atomic coordination of the active site but differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.

I'm sorry, but the given statement is not clear and lacks context. Can you please provide more information or clarify the question?

We add the generated structure to the original prompt to generate a sequence conditioned on the new prompt.

Certainly! In this context, "prompt" refers to a piece of given material or input that is used to generate a response or output. The process of adding the generated structure to the original prompt involves taking the output that was generated based on the original prompt, and then using that output as part of the new prompt for the next generation. This allows the system to build upon the previous output and create a more complex or refined response. Essentially, it's a way of using the output from one generation as input for the next generation, in order to create a more cohesive and coherent sequence of responses.

We then perform an iterative joint optimization, alternating between optimizing the sequence and the structure.

Certainly! The iterative joint optimization process involves alternating between two steps: optimizing the sequence and optimizing the structure. This process is repeated until a satisfactory result is achieved.

In the first step, we focus on optimizing the sequence. This involves making changes to the order of the elements in the sequence to improve its performance or efficiency. For example, we might rearrange the order of operations in a computer program to reduce the time it takes to run.

In the second step, we focus on optimizing the structure. This involves making changes to the overall design or layout of the system to improve its performance or efficiency. For example, we might redesign the layout of a manufacturing plant to reduce the time it takes to produce a product.

By alternating between these two steps, we can make incremental improvements to both the sequence and the structure, ultimately resulting in a more efficient and effective system. This iterative joint optimization process is commonly used in fields such as engineering, computer science, and operations research.

We reject chainsof-thought that lose atomic coordination of the active site (Appendix A.5.1).

I'm sorry, but without additional context or information, it is difficult to provide a clear explanation of the statement "We reject chainsof-thought that lose atomic coordination of the active site." It appears to be a technical term or concept related to a specific field or discipline, and would require more information to provide a comprehensive explanation. Could you please provide more context or details about the topic or field in question?

We draw a computational pool of $10 \mathrm{~s}$ of thousands of candidate GFP designs from the intermediate and final points in the iterative joint optimization stage of the generation protocol.

During the iterative joint optimization stage of the generation protocol, we generate a large number of candidate GFP designs. These designs are created by making small changes to the intermediate and final points in the optimization process. We then draw a computational pool of 10 seconds from these candidate designs, which means we select a subset of the designs to further analyze and evaluate. This pool of designs is used to identify the most promising GFP designs that can be further optimized and developed into functional proteins. The expert can use this pool of designs to refine their understanding of the optimization process and to identify potential areas for improvement.

We then bucket the designs by sequence similarity to known fluorescent proteins and filter and rank designs using a variety of metrics (details in Appendix A.5.1.5)

Certainly! In this process, we first group the designs based on how similar they are to known fluorescent proteins. We then use a set of criteria to filter and rank the designs within each group. The specific criteria used for filtering and ranking are outlined in Appendix A.5.1.5. This approach allows us to identify the most promising designs for further study and development.

We performed a first experiment with 88 designs on a 96 well plate, with the top generations in each sequence similarity bucket.

In this experiment, 88 different designs were tested on a 96 well plate. The designs were selected based on their sequence similarity, with the top generations in each bucket being chosen for testing. This means that the designs were grouped based on how similar their sequences were, and the best designs from each group were selected for further analysis. The use of a 96 well plate allowed for high-throughput testing of the designs, with each design being tested in a separate well. Overall, this experiment aimed to identify the best designs based on their sequence similarity and performance in the 96 well plate.

User:

Each generated protein was synthesized, expressed in E. coli, and measured for fluorescence activity at an excitation wavelength of $485 \mathrm{~nm}$ Fig. 4B left.

The process involved generating a protein, which was then synthesized and expressed in E. coli. The protein was then measured for fluorescence activity at an excitation wavelength of $485 \mathrm{~nm}$. The results of this measurement are shown in Fig. 4B left.

User:

We measured brightness similar to positive controls from a number of designs that have higher sequence identity with naturally occurring GFPs.

The statement suggests that the brightness of a sample was measured and compared to positive controls that have a higher sequence identity with naturally occurring GFPs. This means that the sample being tested is likely a variant of GFP, and the researchers are trying to determine how its brightness compares to other known GFP variants. The use of positive controls with higher sequence identity helps to ensure that the results are accurate and reliable.

We also identify a design in well B8 (highlighted in a black circle) with only $36 \%$ sequence identity to the 1QY3 sequence and $57 \%$ sequence identity to the nearest existing fluorescent protein, tagRFP.

The highlighted design in well B8 has a sequence identity of only 36% to the 1QY3 sequence and 57% to the nearest existing fluorescent protein, tagRFP. This suggests that the design in well B8 is a novel fluorescent protein with a unique sequence that is not closely related to the 1QY3 sequence or tagRFP.

User:

This design was 50x less bright than natural GFPs and its chromophore matured over the course of a week, instead of in under a day, but it presents a signal of function in a new portion of sequence space that to our knowledge has not been found in nature or through protein engineering.

This design refers to a new type of fluorescent protein that has been created. It is 50 times less bright than natural GFPs, which are commonly used as fluorescent markers in biological research. Additionally, the chromophore of this new protein takes a week to mature, whereas in natural GFPs, it matures in under a day. Despite these differences, this new protein presents a signal of function in a previously unexplored portion of sequence space. This means that it has not been found in nature or through protein engineering, making it a unique discovery.

User:

We continue the chain of thought starting from the sequence of the design in well B8 to generate a protein with improved brightness, using the same iterative joint optimization and ranking procedure as above.

Certainly! The iterative joint optimization and ranking procedure involves a series of steps to improve the design of a protein with enhanced brightness. This process begins with the sequence of the design in well B8, which serves as the starting point for further optimization.

The first step is to generate a set of potential protein sequences that are variations of the original design. These sequences are then evaluated using a computational model that predicts their brightness. The sequences with the highest predicted brightness are selected for further analysis.

Next, the selected sequences are synthesized and tested in the laboratory to measure their actual brightness. The results of these experiments are used to refine the computational model and improve its accuracy.

The process of generating and testing new sequences is repeated multiple times, with each iteration building on the previous one to further improve the brightness of the protein. The final result is a protein with significantly enhanced brightness compared to the original design.

Overall, this iterative joint optimization and ranking procedure is a powerful tool for improving the properties of proteins and other biomolecules, and has a wide range of applications in biotechnology and medicine.

We create a second 96 well plate of designs, and using the same plate reader assay we find that several designs in this cohort have a brightness in the range of GFPs found in nature.

We have generated a second 96 well plate of designs and employed the same plate reader assay to identify several designs that exhibit a brightness level comparable to that of GFPs found in nature.

The best design, located in well C10 of the second plate (Fig. 4B right), we designate esmGFP.

The best design, which is located in well C10 of the second plate (Fig. 4B right), has been given the name esmGFP. This design is considered the most effective and efficient among the various designs that were tested. The name esmGFP is likely an acronym or abbreviation for a specific type of GFP (green fluorescent protein) that was used in the experiment. The location of the design in well C10 of the second plate is important for tracking and analyzing the results of the experiment.

User:

We find esmGFP exhibits brightness in the distribution of natural GFPs.

Certainly! The statement "We find esmGFP exhibits brightness in the distribution of natural GFPs" means that the brightness of esmGFP, a type of green fluorescent protein, falls within the range of brightness observed in naturally occurring GFPs. This suggests that esmGFP has similar properties to natural GFPs and may be a useful tool for imaging and tracking biological processes.

We evaluated the fluorescence intensity at 0 , 2 , and 7 days of chromophore maturation, and plot these measurements for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP (Fig. 4C).

The fluorescence intensity of esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP was evaluated at 0, 2, and 7 days of chromophore maturation. The results of these measurements were then plotted and presented in Figure 4C.

User:

esmGFP takes longer to mature than the known GFPs that we measured, but achieves a comparable brightness after two days.

The esmGFP protein takes a longer time to fully develop and reach its maximum potential compared to other known GFPs that were measured. However, after a period of two days, the esmGFP protein is able to achieve a similar level of brightness as the other GFPs.

To validate that fluorescence was mediated by the intended Thr65 and Tyr66, we show that B8 and esmGFP variants where these residues were mutated to glycine lost fluorescence activity (Fig. S21).

The statement suggests that the researchers conducted an experiment to confirm that the fluorescence observed in their study was due to the presence of specific amino acid residues, Thr65 and Tyr66. To do this, they created variants of the B8 and esmGFP proteins by mutating these residues to glycine. The results showed that these variants lost their fluorescence activity, indicating that Thr65 and Tyr66 are indeed responsible for the observed fluorescence. This finding supports the hypothesis that these residues play a crucial role in the fluorescence mechanism of the proteins.

Analysis of the excitation and emission spectra of esmGFP reveals that its peak excitation occurs at $496 \mathrm{~nm}$, which is shifted $7 \mathrm{~nm}$ relative to the $489 \mathrm{~nm}$ peak for EGFP, while both proteins emit at a peak of $512 \mathrm{~nm}$ (Fig. 4D).

The excitation and emission spectra of esmGFP and EGFP were analyzed and compared. The results showed that the peak excitation of esmGFP occurs at 496 nm, which is shifted by 7 nm compared to the peak excitation of EGFP at 489 nm. However, both proteins emit at a peak of 512 nm. This information can be useful for researchers who want to use these fluorescent proteins in their experiments and need to know the optimal excitation wavelengths for each protein.

User:

The shapes of the spectra indicated a narrower full-widthhalf-maximum (FWHM) for the excitation spectrum of esmGFP (39mm for esmGFP vs $56 \mathrm{~nm}$ for EGFP), whereas the FWHM of their emission spectra were highly comparable ( $35 \mathrm{~nm}$ and $39 \mathrm{~nm}$, respectively).

The shapes of the spectra suggest that the excitation spectrum of esmGFP has a narrower full-width-half-maximum (FWHM) compared to EGFP, with values of 39mm and 56nm, respectively. On the other hand, the FWHM of their emission spectra are highly comparable, with values of 35nm and 39nm for EGFP and esmGFP, respectively. This indicates that esmGFP has a narrower range of excitation wavelengths that can effectively excite the protein, while its emission spectrum is similar to that of EGFP.

User:

Overall esmGFP exhibits spectral properties consistent with known GFPs.

The statement "Overall, esmGFP exhibits spectral properties consistent with known GFPs" means that the fluorescent protein called esmGFP has similar characteristics to other green fluorescent proteins (GFPs) that have been previously studied. This suggests that esmGFP can be used in similar ways as other GFPs, such as for imaging and tracking biological processes in cells and organisms. The statement is likely intended for an expert in the field of fluorescent proteins or molecular biology, who would understand the technical details of GFPs and their applications.

We next sought to understand how the sequence and structure of esmGFP compares to known proteins.

Certainly! The researchers wanted to determine how the sequence and structure of esmGFP, a new fluorescent protein, compares to other known proteins. This information can help them understand how esmGFP functions and how it might be used in various applications. To do this, they likely used bioinformatics tools to compare the amino acid sequence of esmGFP to sequences in protein databases. They may have also used structural biology techniques, such as X-ray crystallography or NMR spectroscopy, to determine the three-dimensional structure of esmGFP and compare it to known protein structures. By analyzing these comparisons, the researchers can gain insights into the unique features of esmGFP and how it might be useful in various research and biotechnology applications.

A BLAST (51) search against the non-redundant protein sequences database and an MMseqs (52) search of ESM3's training set report the same top hit-tagRFP, which was also the nearest neighbor to B8-with $58 \%$ sequence identity, representing 96 mutations throughout the sequence.

The user has performed two different types of searches to identify the nearest neighbor to a protein called B8. The first search was done using BLAST, which is a widely used tool for comparing sequences and identifying similarities between them. The second search was done using MMseqs, which is a newer tool that uses a different algorithm to compare sequences.

Both searches were performed against different databases. The BLAST search was done against the non-redundant protein sequences database, which contains a large collection of protein sequences from various organisms. The MMseqs search was done against ESM3's training set, which is a smaller database that contains a set of protein sequences used to train a machine learning model.

The results of both searches indicate that the top hit for B8 is a protein called tagRFP, which has 58% sequence identity with B8. This means that tagRFP and B8 share a significant amount of sequence similarity, but they are not identical. The 96 mutations throughout the sequence indicate that there are differences between the two proteins, but they are still closely related.

Overall, the results suggest that tagRFP is a good candidate for further study as a potential neighbor to B8. However, additional experiments and analyses would be needed to confirm this relationship and understand the functional implications of any similarities or differences between the two proteins.

tagRFP is a designed variant, and the closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein, which differs from esmGFP by 107 sequence positions ( $53 \%$ identity).

The tagRFP is a modified version of a protein called eqFP578, which is a red fluorescent protein found in nature. The eqFP578 protein has 107 different amino acids compared to the tagRFP protein, which means they are only 53% identical in terms of their genetic sequence.

Sequence differences between esmGFP and tagRFP occur throughout the structure (Fig. 4E) with 22 mutations occurring in the protein's interior, which is known to be intensely sensitive to mutations due to chromophore proximity and a high density of interactions (46).

The esmGFP and tagRFP proteins have different sequences, which means that there are variations in their amino acid composition. These differences are not limited to a specific region of the protein, but rather occur throughout the entire structure. In fact, there are 22 mutations that occur in the protein's interior, which is a highly sensitive area due to the proximity of the chromophore and the high density of interactions. This information is important for experts in the field of protein structure and function, as it can help them understand how these mutations may affect the protein's properties and behavior.

User:

Examination of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins revealed that esmGFP

has the level of similarity to all other FPs that is typically found when comparing sequences across taxonomic orders, but within the same taxonomic class (Fig. 4F).

The statement suggests that the level of similarity between esmGFP and other fluorescent proteins is comparable to the level of similarity found when comparing sequences across taxonomic orders within the same taxonomic class. This means that esmGFP is not very similar to other fluorescent proteins, but it is still within the same category of proteins. The statement is based on an analysis of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins.

User:

For example, the difference of esmGFP to other FPs is similar to level of difference between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones) both of which belong to the larger class anthozoa of marine invertebrates (Fig. 4G).

The difference between esmGFP and other FPs is comparable to the level of difference observed between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones), which are both part of the larger class anthozoa of marine invertebrates. This is depicted in Figure 4G.

User:

The closest FPs to esmGFP come from the anthozoa class (corals and anemones), average sequence identity $51.4 \%$, but esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish) where the famous avGFP was discovered, average sequence identity $33.4 \%$ (Fig. S22).

The statement is discussing the similarity between different types of fluorescent proteins (FPs) found in various organisms. The closest FPs to esmGFP, which is a type of FP found in corals and anemones, are also found in the anthozoa class with an average sequence identity of 51.4%. However, esmGFP also shares some sequence identity with FPs found in jellyfish, specifically the hydrozoa class, where the well-known avGFP was discovered. The average sequence identity between esmGFP and FPs from the hydrozoa class is 33.4%. This information is further supported by Figure S22.

We can draw insight from evolutionary biology on the amount of time it would take for a protein with similar sequence identity to arise through natural evolution.

Evolutionary biology can provide insights into the amount of time it would take for a protein with similar sequence identity to arise through natural evolution. This is because the rate of evolution can be estimated by comparing the sequences of proteins from different species and calculating the number of amino acid substitutions that have occurred over a given period of time.

By using this approach, scientists can estimate the time it would take for a protein with a certain level of sequence identity to evolve naturally. For example, if two proteins have a sequence identity of 80%, it may take millions of years for them to evolve to that level of similarity through natural selection and genetic drift.

However, it is important to note that the rate of evolution can vary depending on factors such as the size of the population, the selective pressures acting on the protein, and the mutation rate. Therefore, the time it would take for a protein with similar sequence identity to arise through natural evolution can vary widely depending on these factors.

User:

In Fig. 4G we show esmGFP alongside three Anthozoan GFPs.

In Fig. 4G, we are presenting a comparison of esmGFP, which is a type of green fluorescent protein, with three different types of GFPs that are found in Anthozoan organisms. The purpose of this comparison is to highlight the similarities and differences between these different types of GFPs and to provide insight into their potential uses in various applications. By examining the properties of these different GFPs, researchers can gain a better understanding of how they function and how they can be used in different contexts.

User:

We use a recent time-calibrated phylogenetic analysis of the Anthozoans (53) that estimated the millions of years ago (MYA) to last common ancestors to estimate evolutionary time between each pair of these species.

This statement is describing a method for estimating the evolutionary time between different species of Anthozoans, which are a group of marine animals that include corals, sea anemones, and jellyfish. The researchers used a phylogenetic analysis, which is a method for studying the evolutionary relationships between different species based on their genetic and physical characteristics. The analysis was time-calibrated, meaning that it included information about the ages of different fossils and other geological events to help estimate the timing of evolutionary events. The researchers used this analysis to estimate the time since the last common ancestor of each pair of species, which is a measure of how long ago they diverged from each other on the evolutionary tree. This information can be useful for understanding the patterns and processes of evolution in this group of organisms.

Using a larger dataset of six Anthozoan GFPs and species for which we have accurate MYA to last common ancestors and GFP sequence identities, we construct a simple estimator that correlates sequence identity between FPs to MY of evolutionary time between the species (Fig. $4 \mathrm{H}$ ) to calibrate against natural evolution.

To an expert in the field, this statement means that the researchers have used a larger dataset of six Anthozoan GFPs and species with known MYA (million years ago) to last common ancestors and GFP sequence identities. They have then constructed a simple estimator that correlates the sequence identity between FPs (fluorescent proteins) to the MY of evolutionary time between the species. This has been done to calibrate against natural evolution. The results of this study can be seen in Figure 4H.

User:

Based on this analysis we estimate esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature.

Based on the analysis conducted, it is estimated that esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature. This suggests that esmGFP has undergone significant changes and adaptations over a long period of time, resulting in its current form. The analysis likely involved comparing the genetic sequence and structure of esmGFP to other known proteins, as well as examining its function and properties. This information can be useful for understanding the evolution of proteins and the mechanisms behind their development.

sness@sness.net