doi.bio/esm3/esm3.generating_a_new_fluorescent_protein.full10
==============================
We sought to understand if the base pre-trained ESM3 model has sufficient biological fidelity to generate functional proteins.
This sentence is about a research study that aimed to find out if a computer model called ESM3, which was trained on a large amount of data, can accurately predict the structure and function of proteins. The researchers wanted to know if the model could be used to design new proteins that would work properly in living organisms.
We set out to create a functional green fluorescent protein (GFP) with low sequence similarity to existing ones.
We wanted to make a new type of green fluorescent protein (GFP) that works well but is different from the ones that already exist.
We chose the functionality of fluorescence because it is difficult to achieve, easy to measure, and one of the most beautiful mechanisms in nature.
This sentence is talking about a special type of protein called GFP, which is found in jellyfish and coral. GFP is unique because it can create bright colors without needing any other molecules to help it. This makes GFP very useful for scientists who want to study cells and other tiny things, because they can use GFP to make those things glow and be easier to see.
This sentence is about a special property of a molecule called GFP (green fluorescent protein). This property allows scientists to add GFP to the DNA of other living things, like bacteria or animals, to make them glow green under certain conditions. This glowing effect can be used to help scientists see and study different parts of cells or organisms. The sentence is saying that this ability to use GFP in this way has been very useful in many different areas of science.
The GFP family has been the subject of decades of protein engineering efforts, but still the vast majority of functional variants have come from prospecting the natural world.
Rational design and machine learning-assisted highthroughput screening have yielded GFP sequences with improved properties-such as higher brightness or stability, or differently colored variants-that incorporated small numbers of mutations (typically 5 to 15 , out of the total 238 amino acid coding sequence) from the originating sequence.
This sentence is about a scientific study that used a combination of rational design and machine learning to create new versions of a protein called GFP (green fluorescent protein). The researchers were able to make GFP brighter, more stable, or change its color by making small changes to its genetic code. These changes involved adding or removing just a few amino acids (the building blocks of proteins) from the original GFP sequence. The study shows that using these techniques can help scientists create new and improved versions of proteins for various applications.
Studies have shown that only a few random mutations reduces fluorescence to zero (44-46).
This sentence is saying that research has found that just a small number of changes that happen by chance can completely stop something from glowing. The numbers 44-46 are likely references to specific studies or sources that support this finding.
whereas in rare cases, leveraging high throughput experimentation, scientists have been able to introduce up to $40-50$ mutations i.e. a $20 \%$ difference in total sequence identity $(44,47,48)$ while retaining GFP fluorescence.
In some unusual situations, scientists have been able to make 40-50 changes to a molecule called GFP, which is used to make things glow in the dark. This is a significant change, as it represents a 20% difference in the molecule's overall structure. Despite these changes, the GFP molecule still glows.
Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence.
This sentence means that creating a new type of GFP (green fluorescent protein) would require a deep understanding of the complex chemical and physical processes that make it glow. In other words, it would be a difficult task that requires a lot of scientific knowledge.
This sentence is about a type of protein called GFPs (Green Fluorescent Proteins). It explains that in all GFPs, a special process happens where three important parts of the protein work together to create a part of the protein called the chromophore. This process is called autocatalytic because it happens automatically without needing any outside help.
The unique structure of GFP, a kinked central alpha helix surrounded by an eleven stranded beta barrel
GFP stands for Green Fluorescent Protein, which is a protein that emits green light when it is exposed to certain types of light. The sentence is describing the unique shape of GFP, which has a central alpha helix that is bent or kinked, and is surrounded by an eleven stranded beta barrel. This specific structure is what allows GFP to emit green light.
The sentence is describing a process where a computer program called ESM3 is being used to design new fluorescent proteins. The program is being given information about the structure and sequence of certain parts of the protein that are important for its function. Based on this information, ESM3 is able to generate new design ideas for fluorescent proteins. The sentence is using technical language that may be difficult for someone who is not an expert in this field to understand.
User:
In this sentence, the researchers are describing their experiments to find a new type of fluorescent protein called esmGFP. They used E. coli bacteria to produce the protein and measured its fluorescence using a plate reader. They also used known fluorescent proteins as positive controls and negative controls to compare the results. In the first experiment, they found a design with low sequence identity to known fluorescent proteins that appeared bright in the well labeled B8. In the second experiment, they continued to study this design and found a bright design in the well labeled C10, which they named esmGFP.
User:
(C) esmGFP exhibits fluorescence intensity similar to common GFPs. Normalized fluorescence is shown for a subset of proteins in experiment 2.
The sentence is saying that a protein called esmGFP has a level of brightness that is similar to other proteins called GFPs. The brightness is measured by how much light the protein gives off when it is exposed to a certain type of light. The sentence also mentions that the brightness of esmGFP was compared to other proteins in a study, and the results are shown in a graph.
(D) Excitation and emission spectra for esmGFP overlaid on the spectra of EGFP.
This sentence is describing two types of spectra (which are like graphs that show the intensity of light at different wavelengths) for a type of protein called esmGFP. The sentence is saying that the spectra for esmGFP are being compared to the spectra for another protein called EGFP. The word "overlaid" means that the two spectra are being shown on top of each other, so that they can be easily compared. The sentence is likely being used in a scientific paper or presentation, and is intended for an audience of experts who are familiar with the technical terms being used.
(E) Two cutout views of the central alpha helix and the inside of the beta barrel of a predicted structure of esmGFP. The 96 mutations esmGFP has relative to its nearest neighbor, tagRFP, are shown in blue.
The sentence is describing a predicted structure of a protein called esmGFP. The structure has two parts: a central alpha helix and a beta barrel. The sentence also mentions that esmGFP has 96 mutations compared to another protein called tagRFP, and these mutations are shown in blue. This information is important for understanding the differences between these two proteins and how they might function differently in the body.
(F) Cumulative density of sequence identity between fluorescent proteins across taxa. esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.
This sentence is describing the level of similarity between a specific fluorescent protein called esmGFP and other fluorescent proteins found in different organisms. The sentence is saying that the level of similarity between esmGFP and other fluorescent proteins is similar to what is usually found when comparing sequences of proteins across different groups of organisms, but within the same broad category of organisms.
(G) Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.
(H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.
This sentence is about a method used to estimate how long ago two species diverged from a common ancestor, based on the similarity of a specific protein called GFP (green fluorescent protein) in their DNA. The method involves comparing the GFP sequences of the two species and using the number of differences to calculate an estimate of the time since they diverged, measured in millions of years (MY). This estimate is called an "evolutionary distance" because it reflects the amount of genetic change that has occurred over time as the two species evolved independently.
We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein.
with inward facing coordinating residues, enables this reaction (49).
This sentence is discussing the evolution of a protein called esmGFP. The researchers estimate that esmGFP is very different from the closest known protein, and that it has evolved over a very long time (over 500 million years). The sentence also mentions that the protein has certain features (inward facing coordinating residues) that allow it to undergo a specific reaction.
User:
Sure, I'd be happy to help!
In this sentence, "chromophore" refers to a molecule that is responsible for giving something its color. For example, the chromophore in a red apple is what makes it appear red.
When we say that the chromophore must "absorb light," we mean that it needs to be able to take in light energy. This is what happens when light shines on an object - the chromophores in the object absorb some of the light energy.
However, in order for the object to appear fluorescent, the chromophore also needs to be able to emit light. This means that after it absorbs light energy, it needs to release some of that energy back out in the form of light. This is what makes fluorescent objects glow under certain conditions.
So, in summary, the sentence is saying that in order for something to be fluorescent, the molecules responsible for its color (chromophores) need to be able to both absorb and emit light energy.
Light emission is highly sensitive to the local electronic environment of the chromophore.
This sentence means that the process of emitting light is greatly affected by the specific conditions of the environment surrounding the molecule responsible for the emission of light, known as the chromophore. This is important because it suggests that changes in the local environment of the chromophore can have a significant impact on the properties of the light that is emitted.
For these reasons, obtaining a new functional GFP would require precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel.
We generate designs using a chain-of-thought procedure as follows.
Sure, I'd be happy to help!
The sentence "We generate designs using a chain-of-thought procedure as follows" means that the speaker or writer is describing a process they use to create designs. The process involves a "chain-of-thought" approach, which likely means that they start with one idea and then build on it with additional ideas until they have a complete design. The phrase "as follows" suggests that the speaker or writer is about to provide more details about the specific steps involved in this process.
The model first generates structure tokens, effectively creating a protein backbone.
Backbones that have sufficiently good atomic coordination of the active site but differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.
This sentence is describing a process where a computer program is analyzing the structure of a protein. The "backbones" refer to the main chain of the protein, which is made up of amino acids. The "active site" is a specific region of the protein where chemical reactions take place. The program is checking to see if the backbones have a good arrangement of atoms in the active site, which is important for the protein to function properly. If the backbones pass this test, they move on to the next step in the analysis. The "filter" is a set of criteria that the backbones must meet in order to continue in the analysis. The "overall structure" refers to the shape of the protein, which can vary even if the active site is well-coordinated. The sentence is saying that the program is looking for backbones that have a good active site structure, but also have a different overall shape from a specific protein called 1QY3.
We add the generated structure to the original prompt to generate a sequence conditioned on the new prompt.
This sentence is about a process where a new prompt is added to an existing one, and then a sequence is generated based on the combined prompts. The "generated structure" refers to the new prompt that is being added, and "conditioned on the new prompt" means that the sequence is created based on the information provided in the new prompt. This process is often used in natural language processing and machine learning to generate text or other data based on specific prompts or inputs.
User:
This sentence is describing a process where two things are being optimized - the sequence and the structure. The optimization is done in a specific way, where the optimization of one thing is done first, and then the optimization of the other thing is done, and this process is repeated multiple times. This is called an iterative joint optimization.
We reject chainsof-thought that lose atomic coordination of the active site (Appendix A.5.1).
This sentence is related to a scientific or technical topic, and it may be difficult for a non-expert to understand. However, I can try to explain it in simpler terms.
The sentence is saying that the researchers or scientists are not accepting certain ideas or theories that do not match the way the active site of a molecule works. The active site is a specific part of a molecule that is responsible for its function or activity. The sentence is also referring to an appendix, which is a section at the end of a document that provides additional information or details.
We then bucket the designs by sequence similarity to known fluorescent proteins and filter and rank designs using a variety of metrics (details in Appendix A.5.1.5)
This sentence is about a process of analyzing and organizing data related to fluorescent proteins. The process involves grouping similar designs together based on their sequence similarity to known fluorescent proteins. Then, the designs are filtered and ranked using various metrics, which are explained in more detail in a section called Appendix A.5.1.5. This process helps researchers to better understand and compare the different designs of fluorescent proteins.
User:
Each generated protein was synthesized, expressed in E. coli, and measured for fluorescence activity at an excitation wavelength of $485 \mathrm{~nm}$ Fig. 4B left.
We measured brightness similar to positive controls from a number of designs that have higher sequence identity with naturally occurring GFPs.
In this sentence, the speaker is saying that they measured the brightness of something and compared it to a positive control. The positive control is a sample that is known to work well and is used as a reference. The speaker also mentions that they used designs that have a higher sequence identity with naturally occurring GFPs. GFP stands for green fluorescent protein, which is a protein that glows green under certain conditions. Sequence identity refers to how similar the DNA or RNA sequence of the designs is to the naturally occurring GFPs. Overall, the sentence is describing a scientific experiment where the speaker is comparing the brightness of something to a known reference and using designs that are similar to naturally occurring GFPs.
We also identify a design in well B8 (highlighted in a black circle) with only $36 \%$ sequence identity to the 1QY3 sequence and $57 \%$ sequence identity to the nearest existing fluorescent protein, tagRFP.
This design was 50x less bright than natural GFPs and its chromophore matured over the course of a week, instead of in under a day, but it presents a signal of function in a new portion of sequence space that to our knowledge has not been found in nature or through protein engineering.
This sentence is describing a new design for a protein that is similar to a natural protein called GFP (green fluorescent protein). The new design is not as bright as natural GFPs and takes longer to mature, but it shows potential for a new function that has not been seen before in nature or through protein engineering.
We continue the chain of thought starting from the sequence of the design in well B8 to generate a protein with improved brightness, using the same iterative joint optimization and ranking procedure as above.
This sentence is about a process of improving a protein's brightness. The process involves using a sequence of design from well B8 and applying an iterative joint optimization and ranking procedure. This procedure is used to generate a protein with improved brightness. The sentence is written in technical language and may be difficult for a non-expert to understand.
User:
We create a second 96 well plate of designs, and using the same plate reader assay we find that several designs in this cohort have a brightness in the range of GFPs found in nature.
We made a new plate with 96 different designs and tested them using a special machine. We found that some of the designs were as bright as GFPs that occur naturally.
The best design, located in well C10 of the second plate (Fig. 4B right), we designate esmGFP.
The sentence is describing a design that was found in a specific location (well C10) on a plate (the second plate). The design is considered the best and has been given the name "esmGFP". The sentence is likely referring to a scientific experiment or study where different designs were tested and evaluated.
We find esmGFP exhibits brightness in the distribution of natural GFPs.
The sentence is saying that a protein called esmGFP has a level of brightness that is similar to the brightness of other naturally occurring proteins called GFPs.
We evaluated the fluorescence intensity at 0 , 2 , and 7 days of chromophore maturation, and plot these measurements for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP (Fig. 4C).
We conducted an experiment to measure the brightness of a certain type of protein called GFP (green fluorescent protein) at different stages of its development. We did this by measuring the intensity of the fluorescence emitted by the protein at three different time points: 0 days, 2 days, and 7 days. We then plotted the results of our measurements on a graph, which shows how the brightness of the protein changes over time. We compared the results of our experiment with a few other types of GFP, including avGFP, cgreGFP, and ppluGFP. The graph of our results is shown in Figure 4C.
User:
esmGFP takes longer to mature than the known GFPs that we measured, but achieves a comparable brightness after two days.
The sentence is saying that a type of protein called "esmGFP" takes a longer time to fully develop compared to other types of GFPs that have been studied. However, after two days, esmGFP becomes just as bright as the other GFPs.
Analysis of the excitation and emission spectra of esmGFP reveals that its peak excitation occurs at $496 \mathrm{~nm}$, which is shifted $7 \mathrm{~nm}$ relative to the $489 \mathrm{~nm}$ peak for EGFP, while both proteins emit at a peak of $512 \mathrm{~nm}$ (Fig. 4D).
This sentence is discussing the properties of two types of proteins, esmGFP and EGFP, and how they interact with light. The peak excitation of esmGFP occurs at a wavelength of 496 nanometers, which is slightly different from the peak excitation of EGFP at 489 nanometers. However, both proteins emit light at the same peak wavelength of 512 nanometers. This information is shown in a graph in Figure 4D.
User:
The shapes of the spectra indicated a narrower full-widthhalf-maximum (FWHM) for the excitation spectrum of esmGFP (39mm for esmGFP vs $56 \mathrm{~nm}$ for EGFP), whereas the FWHM of their emission spectra were highly comparable ( $35 \mathrm{~nm}$ and $39 \mathrm{~nm}$, respectively).
This sentence is describing the results of an experiment that compared two types of fluorescent proteins, esmGFP and EGFP. The researchers measured the width of the spectra, which is a way of describing the range of colors that the proteins emit. They found that the width of the excitation spectrum (which measures the colors of light that the proteins can absorb) was narrower for esmGFP than for EGFP. However, the width of the emission spectrum (which measures the colors of light that the proteins emit) was similar for both proteins. This suggests that esmGFP may be more selective in the colors of light that it can absorb, but both proteins emit a similar range of colors.
Overall esmGFP exhibits spectral properties consistent with known GFPs.
We next sought to understand how the sequence and structure of esmGFP compares to known proteins.
The sentence means that the researchers wanted to find out how the order and shape of a protein called esmGFP compares to other proteins that are already known.
A BLAST (51) search against the non-redundant protein sequences database and an MMseqs (52) search of ESM3's training set report the same top hit-tagRFP, which was also the nearest neighbor to B8-with $58 \%$ sequence identity, representing 96 mutations throughout the sequence.
This sentence is discussing a search for a protein sequence called B8. The search was done using two different methods: BLAST and MMseqs. Both methods found the same top result, which was a protein called tagRFP. The two proteins have 58% sequence identity, meaning they have a lot of similarities in their amino acid sequences. However, there are also 96 mutations, or differences, between the two sequences. This information is important for understanding the relationship between B8 and tagRFP, and for potentially predicting the function of B8 based on the known function of tagRFP.
User:
tagRFP is a designed variant, and the closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein, which differs from esmGFP by 107 sequence positions ( $53 \%$ identity).
This sentence is discussing a type of protein called esmGFP, which is a variant that has been designed by scientists. The closest naturally occurring protein to esmGFP is called eqFP578, which is a red fluorescent protein. However, eqFP578 is different from esmGFP in 107 places in its genetic code, which means that they are only 53% similar.
Sequence differences between esmGFP and tagRFP occur throughout the structure (Fig. 4E) with 22 mutations occurring in the protein's interior, which is known to be intensely sensitive to mutations due to chromophore proximity and a high density of interactions (46).
This sentence is discussing the differences between two types of proteins, esmGFP and tagRFP. The differences are found throughout the structure of the proteins, and there are 22 mutations that occur in the interior of the protein. This is important because the interior of the protein is very sensitive to mutations due to the presence of a chromophore (a molecule that gives the protein its color) and a high density of interactions between different parts of the protein. The sentence is saying that these mutations could have a significant impact on the function of the protein.
User:
Examination of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins revealed that esmGFP
The sentence is discussing a study that looked at a group of 648 fluorescent proteins, including a specific one called esmGFP. The researchers found that when they compared the sequence of esmGFP to the sequences of other fluorescent proteins, it had a level of similarity that is usually seen when comparing sequences from different groups of organisms (called taxonomic orders). However, this level of similarity was actually found within the same group of organisms (called a taxonomic class). This suggests that esmGFP is a unique type of fluorescent protein that is different from others in its class.
For example, the difference of esmGFP to other FPs is similar to level of difference between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones) both of which belong to the larger class anthozoa of marine invertebrates (Fig. 4G).
The sentence is comparing the difference between a type of fluorescent protein called esmGFP and other fluorescent proteins to the level of difference between two types of marine invertebrates called scleractinia and actiniaria, which belong to the same larger group called anthozoa. The comparison is shown in a figure labeled 4G.
User:
The closest FPs to esmGFP come from the anthozoa class (corals and anemones), average sequence identity $51.4 \%$, but esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish) where the famous avGFP was discovered, average sequence identity $33.4 \%$ (Fig. S22).
We can draw insight from evolutionary biology on the amount of time it would take for a protein with similar sequence identity to arise through natural evolution.
In Fig. 4G we show esmGFP alongside three Anthozoan GFPs.
We use a recent time-calibrated phylogenetic analysis of the Anthozoans (53) that estimated the millions of years ago (MYA) to last common ancestors to estimate evolutionary time between each pair of these species.
This sentence is discussing a study that used a special type of analysis to estimate how long ago different species of Anthozoans (a type of marine animal) shared a common ancestor. The analysis was done using a method called "time-calibrated phylogenetic analysis," which involves looking at the genetic differences between different species and using that information to estimate how long ago they diverged from a common ancestor. The study estimated the time of divergence in "millions of years ago" (MYA) for each pair of species.
User:
Using a larger dataset of six Anthozoan GFPs and species for which we have accurate MYA to last common ancestors and GFP sequence identities, we construct a simple estimator that correlates sequence identity between FPs to MY of evolutionary time between the species (Fig. $4 \mathrm{H}$ ) to calibrate against natural evolution.
This sentence is discussing a method for estimating the amount of time that has passed since two species of Anthozoan GFPs (green fluorescent proteins) shared a common ancestor. The researchers used a dataset of six Anthozoan GFPs and information about the species' last common ancestors and GFP sequence identities to create a simple estimator. This estimator relates the sequence identity between the GFPs to the amount of time that has passed since the species diverged from each other. The researchers used this estimator to calibrate against natural evolution, which means they compared their results to what is known about the evolutionary history of these species. The results of this study are shown in Figure 4H.
Based on this analysis we estimate esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature.
The sentence means that the protein called esmGFP has evolved over a very long time, which is estimated to be more than 500 million years. This estimate is based on a scientific analysis.
sness@sness.net