esm3.generating_a_new_fluorescent_protein.full10

==============================

We sought to understand if the base pre-trained ESM3 model has sufficient biological fidelity to generate functional proteins.

This sentence is about a research study that aimed to find out if a computer model called ESM3, which was trained on a large amount of data, can accurately predict the structure and function of proteins. The researchers wanted to know if the model could be used to design new proteins that would work properly in living organisms.

We set out to create a functional green fluorescent protein (GFP) with low sequence similarity to existing ones.

We wanted to make a new type of green fluorescent protein (GFP) that works well but is different from the ones that already exist.

We chose the functionality of fluorescence because it is difficult to achieve, easy to measure, and one of the most beautiful mechanisms in nature.

The sentence means that the person or group who made the choice of using fluorescence did so because it is a challenging process, can be easily observed, and is considered one of the most visually appealing natural phenomena. Fluorescence is a type of light emission that occurs when certain materials absorb and then release light energy. It is often used in scientific research and technology because it can provide valuable information about the properties of materials and biological systems.

Responsible for the fluorescence of jellyfish and the vivid colors of coral (43), proteins in the GFP family are unique in their ability to form a fluorescent chromophore without cofactors or substrates (27).

This sentence is talking about a special type of protein called GFP, which is found in jellyfish and coral. GFP is unique because it can create bright colors without needing any other molecules to help it. This makes GFP very useful for scientists who want to study cells and other tiny things, because they can use GFP to make those things glow and be easier to see.

This property allows the GFP sequence to be inserted into the genomes of other organisms to visibly label molecules, cellular structures, or processes, providing a foundational toolkit that has been broadly applied across the biosciences.

This sentence is about a special property of a molecule called GFP (green fluorescent protein). This property allows scientists to add GFP to the DNA of other living things, like bacteria or animals, to make them glow green under certain conditions. This glowing effect can be used to help scientists see and study different parts of cells or organisms. The sentence is saying that this ability to use GFP in this way has been very useful in many different areas of science.

The GFP family has been the subject of decades of protein engineering efforts, but still the vast majority of functional variants have come from prospecting the natural world.

This sentence is about a group of proteins called GFP (Green Fluorescent Protein) that have been studied and modified for many years. Despite all the work done to change these proteins, most of the useful versions have been found in nature.

Rational design and machine learning-assisted highthroughput screening have yielded GFP sequences with improved properties-such as higher brightness or stability, or differently colored variants-that incorporated small numbers of mutations (typically 5 to 15 , out of the total 238 amino acid coding sequence) from the originating sequence.

This sentence is about a scientific study that used a combination of rational design and machine learning to create new versions of a protein called GFP (green fluorescent protein). The researchers were able to make GFP brighter, more stable, or change its color by making small changes to its genetic code. These changes involved adding or removing just a few amino acids (the building blocks of proteins) from the original GFP sequence. The study shows that using these techniques can help scientists create new and improved versions of proteins for various applications.

Studies have shown that only a few random mutations reduces fluorescence to zero (44-46).

This sentence is saying that research has found that just a small number of changes that happen by chance can completely stop something from glowing. The numbers 44-46 are likely references to specific studies or sources that support this finding.

whereas in rare cases, leveraging high throughput experimentation, scientists have been able to introduce up to $40-50$ mutations i.e. a $20 \%$ difference in total sequence identity $(44,47,48)$ while retaining GFP fluorescence.

In some unusual situations, scientists have been able to make 40-50 changes to a molecule called GFP, which is used to make things glow in the dark. This is a significant change, as it represents a 20% difference in the molecule's overall structure. Despite these changes, the GFP molecule still glows.

Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence.

This sentence means that creating a new type of GFP (green fluorescent protein) would require a deep understanding of the complex chemical and physical processes that make it glow. In other words, it would be a difficult task that requires a lot of scientific knowledge.

In all GFPs, an autocatalytic process forms the chromophore from three key amino acids in the core of the protein.

This sentence is about a type of protein called GFPs (Green Fluorescent Proteins). It explains that in all GFPs, a special process happens where three important parts of the protein work together to create a part of the protein called the chromophore. This process is called autocatalytic because it happens automatically without needing any outside help.

The unique structure of GFP, a kinked central alpha helix surrounded by an eleven stranded beta barrel

GFP stands for Green Fluorescent Protein, which is a protein that emits green light when it is exposed to certain types of light. The sentence is describing the unique shape of GFP, which has a central alpha helix that is bent or kinked, and is surrounded by an eleven stranded beta barrel. This specific structure is what allows GFP to emit green light.

Figure 4. Generating a new fluorescent protein with a chain of thought.

The sentence "Generating a new fluorescent protein with a chain of thought" means that someone is using their logical thinking process to create a new type of fluorescent protein. The phrase "chain of thought" refers to the series of ideas and reasoning that the person is using to develop this new protein. The sentence is likely related to scientific research or experimentation.

Figure 4 (A) We prompt ESM3 with the sequence and structure of residues required for forming and catalyzing the chromophore reaction, as well as the structure of part of the central alpha helix from a natural fluorescent protein (left). Through a chain of thought, ESM3 generates design candidates (right).

The sentence is describing a process where a computer program called ESM3 is being used to design new fluorescent proteins. The program is being given information about the structure and sequence of certain parts of the protein that are important for its function. Based on this information, ESM3 is able to generate new design ideas for fluorescent proteins. The sentence is using technical language that may be difficult for someone who is not an expert in this field to understand.

User:

(B) ESM3 found a bright GFP distant from other known GFPs in two experiments. We measured fluorescence in E. coli lysate. Top row, photograph of plates. Bottom row, plate reader fluorescence quantification. Positive controls of known GFPs are marked with purple circles, negative controls with no GFP sequence or no E. Coli are marked with red circles. In the first experiment (left) we expressed designs with a range of sequence identities. A notable design with low sequence identity to known fluorescent proteins appears in the well labeled B8 (highlighted in a black circle bottom, white circle top). We continue the chain of thought from the protein in B8 for the second experiment (right). A bright design appears in the well labeled C10 (black circle bottom, white circle top) which we designate esmGFP.

In this sentence, the researchers are describing their experiments to find a new type of fluorescent protein called esmGFP. They used E. coli bacteria to produce the protein and measured its fluorescence using a plate reader. They also used known fluorescent proteins as positive controls and negative controls to compare the results. In the first experiment, they found a design with low sequence identity to known fluorescent proteins that appeared bright in the well labeled B8. In the second experiment, they continued to study this design and found a bright design in the well labeled C10, which they named esmGFP.

User:

(C) esmGFP exhibits fluorescence intensity similar to common GFPs. Normalized fluorescence is shown for a subset of proteins in experiment 2.

The sentence is saying that a protein called esmGFP has a level of brightness that is similar to other proteins called GFPs. The brightness is measured by how much light the protein gives off when it is exposed to a certain type of light. The sentence also mentions that the brightness of esmGFP was compared to other proteins in a study, and the results are shown in a graph.

(D) Excitation and emission spectra for esmGFP overlaid on the spectra of EGFP.

This sentence is describing two types of spectra (which are like graphs that show the intensity of light at different wavelengths) for a type of protein called esmGFP. The sentence is saying that the spectra for esmGFP are being compared to the spectra for another protein called EGFP. The word "overlaid" means that the two spectra are being shown on top of each other, so that they can be easily compared. The sentence is likely being used in a scientific paper or presentation, and is intended for an audience of experts who are familiar with the technical terms being used.

(E) Two cutout views of the central alpha helix and the inside of the beta barrel of a predicted structure of esmGFP. The 96 mutations esmGFP has relative to its nearest neighbor, tagRFP, are shown in blue.

The sentence is describing a predicted structure of a protein called esmGFP. The structure has two parts: a central alpha helix and a beta barrel. The sentence also mentions that esmGFP has 96 mutations compared to another protein called tagRFP, and these mutations are shown in blue. This information is important for understanding the differences between these two proteins and how they might function differently in the body.

(F) Cumulative density of sequence identity between fluorescent proteins across taxa. esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.

This sentence is describing the level of similarity between a specific fluorescent protein called esmGFP and other fluorescent proteins found in different organisms. The sentence is saying that the level of similarity between esmGFP and other fluorescent proteins is similar to what is usually found when comparing sequences of proteins across different groups of organisms, but within the same broad category of organisms.

(G) Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.

This sentence is describing the relationship between three different types of GFPs (green fluorescent proteins) found in anthozoa (a type of marine animal). The "evolutionary distance" refers to how closely related these GFPs are to each other based on their genetic sequences. This distance is measured in millions of years, which represents the amount of time that has passed since these GFPs evolved from a common ancestor. The "sequence identities" refer to how similar the genetic sequences of these GFPs are to each other. This is expressed as a percentage, with higher percentages indicating greater similarity. The sentence is providing specific information about the evolutionary relationships and genetic similarities between these three GFPs.

(H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.

This sentence is about a method used to estimate how long ago two species diverged from a common ancestor, based on the similarity of a specific protein called GFP (green fluorescent protein) in their DNA. The method involves comparing the GFP sequences of the two species and using the number of differences to calculate an estimate of the time since they diverged, measured in millions of years (MY). This estimate is called an "evolutionary distance" because it reflects the amount of genetic change that has occurred over time as the two species evolved independently.

We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein.

with inward facing coordinating residues, enables this reaction (49).

This sentence is discussing the evolution of a protein called esmGFP. The researchers estimate that esmGFP is very different from the closest known protein, and that it has evolved over a very long time (over 500 million years). The sentence also mentions that the protein has certain features (inward facing coordinating residues) that allow it to undergo a specific reaction.

User:

Once formed, the chromophore must not just absorb light but also emit it in order to be fluorescent.

Sure, I'd be happy to help!

In this sentence, "chromophore" refers to a molecule that is responsible for giving something its color. For example, the chromophore in a red apple is what makes it appear red.

When we say that the chromophore must "absorb light," we mean that it needs to be able to take in light energy. This is what happens when light shines on an object - the chromophores in the object absorb some of the light energy.

However, in order for the object to appear fluorescent, the chromophore also needs to be able to emit light. This means that after it absorbs light energy, it needs to release some of that energy back out in the form of light. This is what makes fluorescent objects glow under certain conditions.

So, in summary, the sentence is saying that in order for something to be fluorescent, the molecules responsible for its color (chromophores) need to be able to both absorb and emit light energy.

Light emission is highly sensitive to the local electronic environment of the chromophore.

This sentence means that the process of emitting light is greatly affected by the specific conditions of the environment surrounding the molecule responsible for the emission of light, known as the chromophore. This is important because it suggests that changes in the local environment of the chromophore can have a significant impact on the properties of the light that is emitted.

For these reasons, obtaining a new functional GFP would require precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel.

This sentence is discussing the process of creating a new type of GFP (green fluorescent protein) that can function properly. It states that in order to achieve this, both the active site (the part of the protein that performs its function) and the surrounding long range tertiary interactions (the way the protein folds and interacts with itself) need to be carefully arranged in a specific way. This is important because if these factors are not properly configured, the GFP may not work as intended.

In an effort to generate new GFP sequences, we directly prompt the base pretrained 7B parameter ESM3 to generate a 229 residue protein conditioned on the positions Thr62, Thr65, Tyr66, Gly67, Arg96, Glu222, which are critical residues for forming and catalyzing the chromophore reaction (Fig. 4A).

In order to create new GFP sequences, we are using a computer program called ESM3 that has been trained on a large amount of data. We are specifically asking the program to generate a protein that is 229 amino acids long, and we are providing it with information about six specific amino acids that are important for the protein to function properly. These six amino acids are located at positions 62, 65, 66, 67, 96, and 222 in the protein sequence. By providing this information to the program, we hope to generate new GFP sequences that are more efficient and effective than previous versions.

We additionally condition on the structure of residues 58 through 71 from the experimental structure in 1QY3, which are known to be structurally important for the energetic favorability of chromophore formation (50).

This sentence is discussing a scientific experiment related to the formation of chromophores, which are molecules that absorb and emit light. The researchers are using a previously known structure (1QY3) as a reference point for their experiment. They are specifically focusing on a specific part of the structure (residues 58 through 71) that has been shown to be important for the formation of chromophores. By "conditioning" on this structure, they are essentially using it as a starting point for their experiment and building upon it to further understand the process of chromophore formation.

Specifically, sequence tokens, structure tokens, and atomic coordinates of the backbone are provided at the input and generation begins from a nearly completely masked array of tokens corresponding to 229 residues, except for the token positions used for conditioning.

This sentence is describing a process in which a computer program is being used to generate data related to the structure of a protein. The program is starting with a set of input data that includes information about the sequence of amino acids that make up the protein, as well as information about the three-dimensional structure of the protein. The program is then using this input data to generate a set of output data that includes information about the positions of the atoms in the protein's backbone. The process of generating this output data involves starting with a set of tokens (which are essentially data points) that are mostly blank, except for a few positions that are used to provide information about the structure of the protein. The program then uses this information to fill in the rest of the tokens and generate the final output data.

We generate designs using a chain-of-thought procedure as follows.

Sure, I'd be happy to help!

The sentence "We generate designs using a chain-of-thought procedure as follows" means that the speaker or writer is describing a process they use to create designs. The process involves a "chain-of-thought" approach, which likely means that they start with one idea and then build on it with additional ideas until they have a complete design. The phrase "as follows" suggests that the speaker or writer is about to provide more details about the specific steps involved in this process.

The model first generates structure tokens, effectively creating a protein backbone.

The sentence is about a model that creates a protein backbone by generating structure tokens. This means that the model is able to create a basic structure for a protein, which is an important building block for many biological processes. The sentence is likely referring to a specific type of model or software that is used in the field of biology or biochemistry.

Backbones that have sufficiently good atomic coordination of the active site but differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.

This sentence is describing a process where a computer program is analyzing the structure of a protein. The "backbones" refer to the main chain of the protein, which is made up of amino acids. The "active site" is a specific region of the protein where chemical reactions take place. The program is checking to see if the backbones have a good arrangement of atoms in the active site, which is important for the protein to function properly. If the backbones pass this test, they move on to the next step in the analysis. The "filter" is a set of criteria that the backbones must meet in order to continue in the analysis. The "overall structure" refers to the shape of the protein, which can vary even if the active site is well-coordinated. The sentence is saying that the program is looking for backbones that have a good active site structure, but also have a different overall shape from a specific protein called 1QY3.

We add the generated structure to the original prompt to generate a sequence conditioned on the new prompt.

This sentence is about a process where a new prompt is added to an existing one, and then a sequence is generated based on the combined prompts. The "generated structure" refers to the new prompt that is being added, and "conditioned on the new prompt" means that the sequence is created based on the information provided in the new prompt. This process is often used in natural language processing and machine learning to generate text or other data based on specific prompts or inputs.

User:

We then perform an iterative joint optimization, alternating between optimizing the sequence and the structure.

This sentence is describing a process where two things are being optimized - the sequence and the structure. The optimization is done in a specific way, where the optimization of one thing is done first, and then the optimization of the other thing is done, and this process is repeated multiple times. This is called an iterative joint optimization.

We reject chainsof-thought that lose atomic coordination of the active site (Appendix A.5.1).

This sentence is related to a scientific or technical topic, and it may be difficult for a non-expert to understand. However, I can try to explain it in simpler terms.

The sentence is saying that the researchers or scientists are not accepting certain ideas or theories that do not match the way the active site of a molecule works. The active site is a specific part of a molecule that is responsible for its function or activity. The sentence is also referring to an appendix, which is a section at the end of a document that provides additional information or details.

In summary, the sentence is saying that the researchers are rejecting certain ideas because they do not fit with the way the active site of a molecule works, and they are providing more information about this in an appendix.

We draw a computational pool of $10 \mathrm{~s}$ of thousands of candidate GFP designs from the intermediate and final points in the iterative joint optimization stage of the generation protocol.

In this sentence, the speaker is describing a process where they are creating a large number of potential designs for a protein called GFP. They are using a computer program to generate these designs, and they are doing it by looking at the results of a previous stage in the process. The "iterative joint optimization stage" refers to a part of the process where they are trying to improve the designs by making small changes and testing them. The "intermediate and final points" refer to the results of these tests at different stages of the optimization process. By looking at these results, they are able to generate a large number of potential designs for GFP. The "computational pool" refers to the collection of all these designs, which they are using to find the best possible design for GFP.

We then bucket the designs by sequence similarity to known fluorescent proteins and filter and rank designs using a variety of metrics (details in Appendix A.5.1.5)

This sentence is about a process of analyzing and organizing data related to fluorescent proteins. The process involves grouping similar designs together based on their sequence similarity to known fluorescent proteins. Then, the designs are filtered and ranked using various metrics, which are explained in more detail in a section called Appendix A.5.1.5. This process helps researchers to better understand and compare the different designs of fluorescent proteins.

User:

We performed a first experiment with 88 designs on a 96 well plate, with the top generations in each sequence similarity bucket.

In this sentence, the speaker is describing a scientific experiment they conducted. They used 88 different designs and placed them in a special plate with 96 wells. The designs were grouped based on how similar they were to each other. The speaker then focused on the top designs in each group and performed the experiment on them.

Each generated protein was synthesized, expressed in E. coli, and measured for fluorescence activity at an excitation wavelength of $485 \mathrm{~nm}$ Fig. 4B left.

This sentence is describing a scientific experiment where a protein was created and tested for its ability to fluoresce (glow) when exposed to a specific type of light. The protein was made using a process called synthesis, which involves combining different molecules to create a new substance. The protein was then introduced into a type of bacteria called E. coli, which acted as a host for the protein. Finally, the protein was tested for its fluorescence activity by shining a light with a wavelength of 485 nanometers on it and measuring how much it glowed. The results of this experiment are shown in a figure labeled "Fig. 4B left."

We measured brightness similar to positive controls from a number of designs that have higher sequence identity with naturally occurring GFPs.

In this sentence, the speaker is saying that they measured the brightness of something and compared it to a positive control. The positive control is a sample that is known to work well and is used as a reference. The speaker also mentions that they used designs that have a higher sequence identity with naturally occurring GFPs. GFP stands for green fluorescent protein, which is a protein that glows green under certain conditions. Sequence identity refers to how similar the DNA or RNA sequence of the designs is to the naturally occurring GFPs. Overall, the sentence is describing a scientific experiment where the speaker is comparing the brightness of something to a known reference and using designs that are similar to naturally occurring GFPs.

We also identify a design in well B8 (highlighted in a black circle) with only $36 \%$ sequence identity to the 1QY3 sequence and $57 \%$ sequence identity to the nearest existing fluorescent protein, tagRFP.

The sentence is discussing the identification of a specific design in a well labeled B8. This design has a low level of similarity to a previously known sequence called 1QY3, with only 36% sequence identity. Additionally, it has a slightly higher level of similarity to the nearest existing fluorescent protein, tagRFP, with 57% sequence identity. This information is important for understanding the uniqueness and potential usefulness of this design.

This design was 50x less bright than natural GFPs and its chromophore matured over the course of a week, instead of in under a day, but it presents a signal of function in a new portion of sequence space that to our knowledge has not been found in nature or through protein engineering.

This sentence is describing a new design for a protein that is similar to a natural protein called GFP (green fluorescent protein). The new design is not as bright as natural GFPs and takes longer to mature, but it shows potential for a new function that has not been seen before in nature or through protein engineering.

We continue the chain of thought starting from the sequence of the design in well B8 to generate a protein with improved brightness, using the same iterative joint optimization and ranking procedure as above.

This sentence is about a process of improving a protein's brightness. The process involves using a sequence of design from well B8 and applying an iterative joint optimization and ranking procedure. This procedure is used to generate a protein with improved brightness. The sentence is written in technical language and may be difficult for a non-expert to understand.

User:

We create a second 96 well plate of designs, and using the same plate reader assay we find that several designs in this cohort have a brightness in the range of GFPs found in nature.

We made a new plate with 96 different designs and tested them using a special machine. We found that some of the designs were as bright as GFPs that occur naturally.

The best design, located in well C10 of the second plate (Fig. 4B right), we designate esmGFP.

The sentence is describing a design that was found in a specific location (well C10) on a plate (the second plate). The design is considered the best and has been given the name "esmGFP". The sentence is likely referring to a scientific experiment or study where different designs were tested and evaluated.

We find esmGFP exhibits brightness in the distribution of natural GFPs.

The sentence is saying that a protein called esmGFP has a level of brightness that is similar to the brightness of other naturally occurring proteins called GFPs.

We evaluated the fluorescence intensity at 0 , 2 , and 7 days of chromophore maturation, and plot these measurements for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP (Fig. 4C).

We conducted an experiment to measure the brightness of a certain type of protein called GFP (green fluorescent protein) at different stages of its development. We did this by measuring the intensity of the fluorescence emitted by the protein at three different time points: 0 days, 2 days, and 7 days. We then plotted the results of our measurements on a graph, which shows how the brightness of the protein changes over time. We compared the results of our experiment with a few other types of GFP, including avGFP, cgreGFP, and ppluGFP. The graph of our results is shown in Figure 4C.

User:

esmGFP takes longer to mature than the known GFPs that we measured, but achieves a comparable brightness after two days.

The sentence is saying that a type of protein called "esmGFP" takes a longer time to fully develop compared to other types of GFPs that have been studied. However, after two days, esmGFP becomes just as bright as the other GFPs.

To validate that fluorescence was mediated by the intended Thr65 and Tyr66, we show that B8 and esmGFP variants where these residues were mutated to glycine lost fluorescence activity (Fig. S21).

The sentence is saying that they wanted to make sure that the fluorescence they observed was caused by two specific amino acids, Thr65 and Tyr66. To do this, they created two different versions of a protein called B8 and a fluorescent protein called esmGFP, where Thr65 and Tyr66 were replaced with glycine. They found that these modified versions of B8 and esmGFP did not show any fluorescence, which confirms that Thr65 and Tyr66 are necessary for fluorescence to occur. The results of this experiment are shown in a figure called Fig. S21.

Analysis of the excitation and emission spectra of esmGFP reveals that its peak excitation occurs at $496 \mathrm{~nm}$, which is shifted $7 \mathrm{~nm}$ relative to the $489 \mathrm{~nm}$ peak for EGFP, while both proteins emit at a peak of $512 \mathrm{~nm}$ (Fig. 4D).

This sentence is discussing the properties of two types of proteins, esmGFP and EGFP, and how they interact with light. The peak excitation of esmGFP occurs at a wavelength of 496 nanometers, which is slightly different from the peak excitation of EGFP at 489 nanometers. However, both proteins emit light at the same peak wavelength of 512 nanometers. This information is shown in a graph in Figure 4D.

User:

The shapes of the spectra indicated a narrower full-widthhalf-maximum (FWHM) for the excitation spectrum of esmGFP (39mm for esmGFP vs $56 \mathrm{~nm}$ for EGFP), whereas the FWHM of their emission spectra were highly comparable ( $35 \mathrm{~nm}$ and $39 \mathrm{~nm}$, respectively).

This sentence is describing the results of an experiment that compared two types of fluorescent proteins, esmGFP and EGFP. The researchers measured the width of the spectra, which is a way of describing the range of colors that the proteins emit. They found that the width of the excitation spectrum (which measures the colors of light that the proteins can absorb) was narrower for esmGFP than for EGFP. However, the width of the emission spectrum (which measures the colors of light that the proteins emit) was similar for both proteins. This suggests that esmGFP may be more selective in the colors of light that it can absorb, but both proteins emit a similar range of colors.

Overall esmGFP exhibits spectral properties consistent with known GFPs.

This sentence means that the protein called esmGFP has similar characteristics to other proteins in the same family, called GFPs. These characteristics are related to how the protein absorbs and emits light, which is important for its use in scientific research.

We next sought to understand how the sequence and structure of esmGFP compares to known proteins.

The sentence means that the researchers wanted to find out how the order and shape of a protein called esmGFP compares to other proteins that are already known.

A BLAST (51) search against the non-redundant protein sequences database and an MMseqs (52) search of ESM3's training set report the same top hit-tagRFP, which was also the nearest neighbor to B8-with $58 \%$ sequence identity, representing 96 mutations throughout the sequence.

This sentence is discussing a search for a protein sequence called B8. The search was done using two different methods: BLAST and MMseqs. Both methods found the same top result, which was a protein called tagRFP. The two proteins have 58% sequence identity, meaning they have a lot of similarities in their amino acid sequences. However, there are also 96 mutations, or differences, between the two sequences. This information is important for understanding the relationship between B8 and tagRFP, and for potentially predicting the function of B8 based on the known function of tagRFP.

User:

tagRFP is a designed variant, and the closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein, which differs from esmGFP by 107 sequence positions ( $53 \%$ identity).

This sentence is discussing a type of protein called esmGFP, which is a variant that has been designed by scientists. The closest naturally occurring protein to esmGFP is called eqFP578, which is a red fluorescent protein. However, eqFP578 is different from esmGFP in 107 places in its genetic code, which means that they are only 53% similar.

Sequence differences between esmGFP and tagRFP occur throughout the structure (Fig. 4E) with 22 mutations occurring in the protein's interior, which is known to be intensely sensitive to mutations due to chromophore proximity and a high density of interactions (46).

This sentence is discussing the differences between two types of proteins, esmGFP and tagRFP. The differences are found throughout the structure of the proteins, and there are 22 mutations that occur in the interior of the protein. This is important because the interior of the protein is very sensitive to mutations due to the presence of a chromophore (a molecule that gives the protein its color) and a high density of interactions between different parts of the protein. The sentence is saying that these mutations could have a significant impact on the function of the protein.

User:

Examination of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins revealed that esmGFP

has the level of similarity to all other FPs that is typically found when comparing sequences across taxonomic orders, but within the same taxonomic class (Fig. 4F).

The sentence is discussing a study that looked at a group of 648 fluorescent proteins, including a specific one called esmGFP. The researchers found that when they compared the sequence of esmGFP to the sequences of other fluorescent proteins, it had a level of similarity that is usually seen when comparing sequences from different groups of organisms (called taxonomic orders). However, this level of similarity was actually found within the same group of organisms (called a taxonomic class). This suggests that esmGFP is a unique type of fluorescent protein that is different from others in its class.

For example, the difference of esmGFP to other FPs is similar to level of difference between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones) both of which belong to the larger class anthozoa of marine invertebrates (Fig. 4G).

The sentence is comparing the difference between a type of fluorescent protein called esmGFP and other fluorescent proteins to the level of difference between two types of marine invertebrates called scleractinia and actiniaria, which belong to the same larger group called anthozoa. The comparison is shown in a figure labeled 4G.

User:

The closest FPs to esmGFP come from the anthozoa class (corals and anemones), average sequence identity $51.4 \%$, but esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish) where the famous avGFP was discovered, average sequence identity $33.4 \%$ (Fig. S22).

This sentence is discussing the similarity between different types of fluorescent proteins (FPs) found in different types of marine organisms. The sentence states that the closest FPs to a specific type of FP called esmGFP are found in corals and anemones, and they share an average of 51.4% of their genetic sequence. Additionally, esmGFP also shares some genetic sequence with FPs found in jellyfish, specifically the famous avGFP, with an average sequence identity of 33.4%. The sentence also includes a reference to a figure (Fig. S22) that likely provides more detailed information about the genetic sequences being discussed.

We can draw insight from evolutionary biology on the amount of time it would take for a protein with similar sequence identity to arise through natural evolution.

This sentence is saying that we can use what we know about how living things change over time (evolutionary biology) to figure out how long it would take for a protein with a similar structure to develop naturally.

In Fig. 4G we show esmGFP alongside three Anthozoan GFPs.

In Fig. 4G, we are presenting a comparison of a protein called esmGFP with three other proteins called GFPs that come from a group of animals called Anthozoans. This comparison is being shown in a figure or diagram labeled as Fig. 4G.

We use a recent time-calibrated phylogenetic analysis of the Anthozoans (53) that estimated the millions of years ago (MYA) to last common ancestors to estimate evolutionary time between each pair of these species.

This sentence is discussing a study that used a special type of analysis to estimate how long ago different species of Anthozoans (a type of marine animal) shared a common ancestor. The analysis was done using a method called "time-calibrated phylogenetic analysis," which involves looking at the genetic differences between different species and using that information to estimate how long ago they diverged from a common ancestor. The study estimated the time of divergence in "millions of years ago" (MYA) for each pair of species.

User:

Using a larger dataset of six Anthozoan GFPs and species for which we have accurate MYA to last common ancestors and GFP sequence identities, we construct a simple estimator that correlates sequence identity between FPs to MY of evolutionary time between the species (Fig. $4 \mathrm{H}$ ) to calibrate against natural evolution.

This sentence is discussing a method for estimating the amount of time that has passed since two species of Anthozoan GFPs (green fluorescent proteins) shared a common ancestor. The researchers used a dataset of six Anthozoan GFPs and information about the species' last common ancestors and GFP sequence identities to create a simple estimator. This estimator relates the sequence identity between the GFPs to the amount of time that has passed since the species diverged from each other. The researchers used this estimator to calibrate against natural evolution, which means they compared their results to what is known about the evolutionary history of these species. The results of this study are shown in Figure 4H.

Based on this analysis we estimate esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature.

The sentence means that the protein called esmGFP has evolved over a very long time, which is estimated to be more than 500 million years. This estimate is based on a scientific analysis.

sness@sness.net