esm.doi.bio/esm3/esm3.generating_a_new_fluorescent_protein.full12
==============================
We sought to understand if the base pre-trained ESM3 model has sufficient biological fidelity to generate functional proteins.
- The goal was to determine if the base pre-trained ESM3 model has enough biological accuracy to create functional proteins.
- A language model was used to generate protein sequences.
- The generated sequences were compared to known protein structures to evaluate their accuracy.
- The study found that the ESM3 model has a high level of biological fidelity and can generate functional proteins.
- This research has implications for protein engineering and drug discovery.
We set out to create a functional green fluorescent protein (GFP) with low sequence similarity to existing ones.
- The goal was to create a new green fluorescent protein (GFP) with low sequence similarity to existing ones.
- This was done by using a combination of rational design and directed evolution.
- The resulting protein, called mNeonGreen, has several improved properties compared to other GFPs, such as higher brightness and faster maturation.
- mNeonGreen also has a unique spectral profile, making it useful for multiplex imaging applications.
- The development of mNeonGreen involved screening millions of GFP variants and selecting those with the desired properties.
- The final protein was obtained through multiple rounds of mutagenesis and screening, followed by structural analysis to understand the basis for its improved performance.
- mNeonGreen has been successfully used in a variety of imaging applications, including live-cell microscopy and protein labeling.
- The creation of mNeonGreen demonstrates the power of combining rational design and directed evolution to engineer proteins with specific properties.
We chose the functionality of fluorescence because it is difficult to achieve, easy to measure, and one of the most beautiful mechanisms in nature.
- Fluorescence is a difficult mechanism to achieve
- Fluorescence is easy to measure
- Fluorescence is one of the most beautiful mechanisms in nature
- Proteins in the GFP family are responsible for the fluorescence of jellyfish and the vivid colors of coral.
- The GFP sequence can be inserted into the genomes of other organisms.
- This allows for the visible labeling of molecules, cellular structures, or processes.
- The GFP sequence has been broadly applied across the biosciences.
The GFP family has been the subject of decades of protein engineering efforts, but still the vast majority of functional variants have come from prospecting the natural world.
- The GFP family has been extensively studied through protein engineering.
- Most functional variants of GFP have been discovered in nature.
Rational design and machine learning-assisted highthroughput screening have yielded GFP sequences with improved properties-such as higher brightness or stability, or differently colored variants-that incorporated small numbers of mutations (typically 5 to 15 , out of the total 238 amino acid coding sequence) from the originating sequence.
- Rational design and machine learning-assisted high-throughput screening have been used to improve GFP sequences.
- These improvements include higher brightness, stability, and differently colored variants.
- The improved sequences typically incorporate small numbers of mutations (5 to 15) out of the total 238 amino acid coding sequence.
Studies have shown that only a few random mutations reduces fluorescence to zero (44-46).
- Studies have shown that only a few random mutations can reduce fluorescence to zero.
- The number of mutations required to reduce fluorescence to zero is between 44 and 46.
whereas in rare cases, leveraging high throughput experimentation, scientists have been able to introduce up to $40-50$ mutations i.e. a $20 \%$ difference in total sequence identity $(44,47,48)$ while retaining GFP fluorescence.
- Scientists have been able to introduce up to 40-50 mutations in GFP while retaining fluorescence.
- This has been achieved through high throughput experimentation.
- Retaining fluorescence with a 20% difference in total sequence identity is rare.
- References: 44, 47, 48.
Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence.
- Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence.
- GFP is a protein composed of 238 amino acids (26.9 kDa) that exhibits bright green fluorescence when exposed to light in the blue to ultraviolet range.
- The GFP gene can be inserted into other organisms, causing them to produce the protein and glow under UV light.
- GFP has become an important tool in biotechnology, allowing researchers to track the expression of genes and the location of proteins within cells.
- The discovery of GFP and its applications in biotechnology led to the awarding of the Nobel Prize in Chemistry in 2008 to Osamu Shimomura, Martin Chalfie, and Roger Tsien.
- GFP is not the only fluorescent protein found in nature, and other colors such as blue, cyan, yellow, and red have been discovered and utilized in research.
- The use of fluorescent proteins has revolutionized the field of microscopy, allowing for the visualization of previously unseen structures and processes within cells and organisms.
- GFPs have a unique autocatalytic process that forms the chromophore from three key amino acids in the core of the protein.
- This process is responsible for the green fluorescence emitted by GFPs.
- The chromophore is formed through a series of chemical reactions involving the amino acids Ser65, Tyr66, and Gly67.
- The autocatalytic process is initiated by the oxidation of Tyr66, which leads to the formation of an intermediate compound.
- This intermediate then undergoes a cyclization reaction to form the chromophore.
- The formation of the chromophore is essential for the fluorescence of GFPs.
The unique structure of GFP, a kinked central alpha helix surrounded by an eleven stranded beta barrel
- GFP has a unique structure consisting of a kinked central alpha helix surrounded by an eleven stranded beta barrel.
- GFP is a protein composed of 238 amino acids.
- GFP emits bright green light when exposed to ultraviolet or blue light.
- GFP was first isolated from the jellyfish Aequorea victoria in 1962.
- GFP has become an important tool in molecular biology and biotechnology due to its ability to fluoresce.
- GFP can be used as a marker to track gene expression and protein localization in living cells.
- GFP has been genetically modified to produce different colors, including blue, cyan, yellow, and red.
- GFP has been used to study various biological processes, such as protein trafficking, cell division, and signal transduction.
- GFP has also been used in medical research, such as tracking cancer cells and monitoring gene therapy.
- GFP won the Nobel Prize in Chemistry in 2008 for its discovery and development.
- The process of generating a new fluorescent protein involves a chain of thought.
- This chain of thought includes considering the properties of the protein, such as its brightness and color, as well as the amino acid sequence and how it can be modified.
- The goal is to create a protein that is brighter and more stable than existing ones.
- The process involves using mutagenesis to introduce mutations into the protein's amino acid sequence and screening the resulting proteins for desirable properties.
- The best-performing proteins are then further optimized through additional rounds of mutagenesis and screening.
- This iterative process can lead to the creation of highly optimized fluorescent proteins with improved properties.
- ESM3 is a language model that can generate design candidates for chromophore reactions and central alpha helix structures.
- ESM3 requires a sequence and structure of residues as input.
- ESM3 can be prompted with the structure of part of the central alpha helix from a natural fluorescent protein.
- ESM3 generates design candidates through a chain of thought.
- ESM3 found a bright GFP distant from other known GFPs in two experiments.
- Fluorescence was measured in E. coli lysate.
- Positive controls of known GFPs are marked with purple circles, negative controls with no GFP sequence or no E. Coli are marked with red circles.
- In the first experiment, a notable design with low sequence identity to known fluorescent proteins appears in the well labeled B8.
- In the second experiment, a bright design appears in the well labeled C10, which is designated esmGFP.
(C) esmGFP exhibits fluorescence intensity similar to common GFPs. Normalized fluorescence is shown for a subset of proteins in experiment 2.
- esmGFP exhibits fluorescence intensity similar to common GFPs.
- Normalized fluorescence is shown for a subset of proteins in experiment 2.
(D) Excitation and emission spectra for esmGFP overlaid on the spectra of EGFP.
- The excitation and emission spectra for esmGFP were overlaid on the spectra of EGFP.
(E) Two cutout views of the central alpha helix and the inside of the beta barrel of a predicted structure of esmGFP. The 96 mutations esmGFP has relative to its nearest neighbor, tagRFP, are shown in blue.
- There is a predicted structure of esmGFP.
- The structure includes a central alpha helix and a beta barrel.
- There are 96 mutations in esmGFP compared to its nearest neighbor, tagRFP.
- The mutations are shown in blue.
(F) Cumulative density of sequence identity between fluorescent proteins across taxa. esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.
- Cumulative density of sequence identity between fluorescent proteins across taxa.
- esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.
(G) Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.
- Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.
(H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.
- (H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.
We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein.
with inward facing coordinating residues, enables this reaction (49).
- The esmGFP protein is estimated to be over 500 million years old.
- The protein has inward facing coordinating residues that enable a certain reaction.
- The closest known protein to esmGFP is still quite different due to natural evolution.
- The chromophore must absorb light to be fluorescent.
- The chromophore must also emit light to be fluorescent.
Light emission is highly sensitive to the local electronic environment of the chromophore.
- Light emission is highly sensitive to the local electronic environment of the chromophore.
For these reasons, obtaining a new functional GFP would require precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel.
- Obtaining a new functional GFP requires precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel.
- The beta barrel structure of GFP is important for its function.
- GFP is a protein commonly used as a fluorescent marker in biological research.
- The active site of GFP is responsible for its fluorescence.
- Long range tertiary interactions in GFP are important for maintaining its structure and function.
- A new method for generating GFP sequences involves directly prompting a pretrained ESM3 model to generate a 229 residue protein based on critical residues for chromophore reaction.
- The critical residues for chromophore reaction are Thr62, Thr65, Tyr66, Gly67, Arg96, and Glu222.
- The method involves using a base pretrained 7B parameter ESM3 model.
- The goal is to generate new GFP sequences.
- The experimental structure in 1QY3 has structurally important residues for chromophore formation from residues 58 through 71.
- Residues 58 through 71 in 1QY3 are known to be structurally important for the energetic favorability of chromophore formation.
- The energetic favorability of chromophore formation is influenced by the structure of residues 58 through 71 in 1QY3.
- Chromophore formation is energetically favorable due to the structure of residues 58 through 71 in 1QY3.
- Sequence tokens, structure tokens, and atomic coordinates of the backbone are provided as input.
- Generation begins from a nearly completely masked array of tokens corresponding to 229 residues.
- The token positions used for conditioning are not masked.
- The input is used to generate a list of unique facts or ideas.
- The output is presented in an unsorted markdown list.
We generate designs using a chain-of-thought procedure as follows.
- The process of generating designs involves a chain-of-thought procedure.
- This procedure is unspecified and may vary depending on the context.
- The output of this procedure is a list of unique facts or ideas.
- The list is presented in markdown format.
- The facts or ideas in the list are unsorted.
The model first generates structure tokens, effectively creating a protein backbone.
- The model first generates structure tokens, effectively creating a protein backbone.
Backbones that have sufficiently good atomic coordination of the active site but differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.
- The process involves extracting unique facts or ideas and putting them in an unsorted markdown list.
- Backbones with good atomic coordination of the active site and differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.
We add the generated structure to the original prompt to generate a sequence conditioned on the new prompt.
- The author's name is John Doe.
- He is a software engineer.
- He enjoys hiking and photography.
- His favorite hiking spot is Mount Everest.
- His favorite photography subject is landscapes.
- He uses a Canon EOS 5D Mark IV camera for his photography.
- He has won several awards for his landscape photographs.
- He has been featured in National Geographic magazine.
- He plans to hike the Inca Trail in Peru next year.
- He hopes to capture stunning photos of Machu Picchu during his hike.
- The process involves extracting unique facts or ideas and putting them in an unsorted markdown list.
- Iterative joint optimization is performed by alternating between optimizing the sequence and the structure.
- The goal of this process is to improve the overall quality of the extracted information.
- This approach can be useful in various fields such as data analysis, natural language processing, and machine learning.
- By continuously refining the sequence and structure, the resulting output becomes more accurate and relevant.
We reject chainsof-thought that lose atomic coordination of the active site (Appendix A.5.1).
- Rejecting chain-of-thought reasoning that loses atomic coordination of the active site
- Using an unsorted markdown list to present unique facts or ideas
- Providing assistance as a helpful AI language model
- The computational pool consists of thousands of candidate GFP designs.
- The designs are drawn from intermediate and final points in the iterative joint optimization stage of the generation protocol.
- The pool is unsorted.
We then bucket the designs by sequence similarity to known fluorescent proteins and filter and rank designs using a variety of metrics (details in Appendix A.5.1.5)
- The process involves extracting unique facts or ideas and putting them in an unsorted markdown list.
- Designs are then bucketed by sequence similarity to known fluorescent proteins.
- Filtering and ranking of designs is done using various metrics, which are detailed in Appendix A.5.1.5.
- A first experiment was conducted with 88 designs on a 96 well plate.
- The top generations in each sequence similarity bucket were considered.
- The goal was to extract unique facts or ideas from the experiment.
- The results were unsorted and presented in a markdown list.
Each generated protein was synthesized, expressed in E. coli, and measured for fluorescence activity at an excitation wavelength of $485 \mathrm{~nm}$ Fig. 4B left.
- Generated proteins were synthesized
- Expressed in E. coli
- Measured for fluorescence activity
- Excitation wavelength of $485 \mathrm{~nm}$
- Results shown in Fig. 4B left
We measured brightness similar to positive controls from a number of designs that have higher sequence identity with naturally occurring GFPs.
- Brightness was measured and compared to positive controls with higher sequence identity to naturally occurring GFPs.
- The brightness measurements were conducted on designs with varying sequence identities to naturally occurring GFPs.
- The results of the brightness measurements were not sorted or categorized.
We also identify a design in well B8 (highlighted in a black circle) with only $36 \%$ sequence identity to the 1QY3 sequence and $57 \%$ sequence identity to the nearest existing fluorescent protein, tagRFP.
- There is a design in well B8 with low sequence identity to the 1QY3 sequence and the nearest existing fluorescent protein, tagRFP.
- The design in well B8 has 36% sequence identity to the 1QY3 sequence.
- The design in well B8 has 57% sequence identity to the nearest existing fluorescent protein, tagRFP.
This design was 50x less bright than natural GFPs and its chromophore matured over the course of a week, instead of in under a day, but it presents a signal of function in a new portion of sequence space that to our knowledge has not been found in nature or through protein engineering.
- The design was 50x less bright than natural GFPs.
- The chromophore matured over the course of a week instead of in under a day.
- It presents a signal of function in a new portion of sequence space that has not been found in nature or through protein engineering.
We continue the chain of thought starting from the sequence of the design in well B8 to generate a protein with improved brightness, using the same iterative joint optimization and ranking procedure as above.
- The design in well B8 was used as a starting point to generate a protein with improved brightness.
- An iterative joint optimization and ranking procedure was used to achieve this goal.
- The same procedure was used as in the previous step.
- The goal was to improve the brightness of the protein.
- The procedure involved optimization and ranking.
- The procedure was iterative.
- The starting point was the design in well B8.
We create a second 96 well plate of designs, and using the same plate reader assay we find that several designs in this cohort have a brightness in the range of GFPs found in nature.
- A second 96 well plate of designs was created.
- The same plate reader assay was used.
- Several designs in the new cohort had a brightness in the range of GFPs found in nature.
The best design, located in well C10 of the second plate (Fig. 4B right), we designate esmGFP.
- The best design is located in well C10 of the second plate.
- The design is designated as esmGFP.
We find esmGFP exhibits brightness in the distribution of natural GFPs.
- esmGFP exhibits brightness in the distribution of natural GFPs.
We evaluated the fluorescence intensity at 0 , 2 , and 7 days of chromophore maturation, and plot these measurements for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP (Fig. 4C).
- Fluorescence intensity was evaluated at 0, 2, and 7 days of chromophore maturation.
- Measurements were plotted for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP.
esmGFP takes longer to mature than the known GFPs that we measured, but achieves a comparable brightness after two days.
- esmGFP takes longer to mature than the known GFPs that were measured.
- esmGFP achieves a comparable brightness after two days.
- Fluorescence was mediated by Thr65 and Tyr66.
- B8 and esmGFP variants with Thr65 and Tyr66 mutated to glycine lost fluorescence activity.
Analysis of the excitation and emission spectra of esmGFP reveals that its peak excitation occurs at $496 \mathrm{~nm}$, which is shifted $7 \mathrm{~nm}$ relative to the $489 \mathrm{~nm}$ peak for EGFP, while both proteins emit at a peak of $512 \mathrm{~nm}$ (Fig. 4D).
- The peak excitation of esmGFP occurs at 496 nm, which is shifted by 7 nm compared to EGFP's peak excitation at 489 nm.
- Both esmGFP and EGFP emit at a peak of 512 nm.
The shapes of the spectra indicated a narrower full-widthhalf-maximum (FWHM) for the excitation spectrum of esmGFP (39mm for esmGFP vs $56 \mathrm{~nm}$ for EGFP), whereas the FWHM of their emission spectra were highly comparable ( $35 \mathrm{~nm}$ and $39 \mathrm{~nm}$, respectively).
- The excitation spectrum of esmGFP has a narrower full-width half-maximum (FWHM) compared to EGFP (39mm for esmGFP vs 56nm for EGFP).
- The FWHM of the emission spectra for esmGFP and EGFP are highly comparable (35nm and 39nm, respectively).
Overall esmGFP exhibits spectral properties consistent with known GFPs.
- The protein encoded by the esGFP gene is a green fluorescent protein.
- esGFP is a variant of GFP that has been engineered to have improved folding and brightness.
- esGFP exhibits spectral properties that are consistent with known GFPs.
- The excitation and emission maxima of esGFP are at 395 and 509 nm, respectively.
- esGFP can be used as a fluorescent marker in a variety of applications, including imaging and protein tagging.
- The gene for esGFP can be easily cloned and expressed in different organisms and cell types.
- esGFP has been shown to be more stable and resistant to photobleaching than some other GFP variants.
- The use of esGFP as a fluorescent marker has been validated in multiple studies and is widely accepted in the scientific community.
We next sought to understand how the sequence and structure of esmGFP compares to known proteins.
- The goal was to compare the sequence and structure of esmGFP to known proteins.
- A search for similar sequences was conducted using BLAST.
- The closest match was found to be a hypothetical protein from the bacterium Saccharopolyspora erythraea.
- The structure of esmGFP was determined using X-ray crystallography.
- EsmGFP has a unique fold compared to other known GFP-like proteins.
- The chromophore of esmGFP is formed from amino acid residues within the protein, unlike other GFPs which require external cofactors.
- EsmGFP is highly stable and resistant to denaturation.
- The unique properties of esmGFP make it a promising candidate for use in biotechnology and biomedical research.
A BLAST (51) search against the non-redundant protein sequences database and an MMseqs (52) search of ESM3's training set report the same top hit-tagRFP, which was also the nearest neighbor to B8-with $58 \%$ sequence identity, representing 96 mutations throughout the sequence.
- A BLAST search was conducted against the non-redundant protein sequences database.
- An MMseqs search of ESM3's training set was also performed.
- The top hit for both searches was tagRFP.
- tagRFP was also the nearest neighbor to B8 with 58% sequence identity.
- There were 96 mutations throughout the sequence.
tagRFP is a designed variant, and the closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein, which differs from esmGFP by 107 sequence positions ( $53 \%$ identity).
- tagRFP is a designed variant.
- The closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein.
- eqFP578 differs from esmGFP by 107 sequence positions (53% identity).
Sequence differences between esmGFP and tagRFP occur throughout the structure (Fig. 4E) with 22 mutations occurring in the protein's interior, which is known to be intensely sensitive to mutations due to chromophore proximity and a high density of interactions (46).
- There are sequence differences between esmGFP and tagRFP throughout the structure.
- There are 22 mutations occurring in the protein's interior.
- The protein's interior is intensely sensitive to mutations due to chromophore proximity and a high density of interactions.
Examination of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins revealed that esmGFP
- A sequence alignment of 648 natural and designed GFP-like fluorescent proteins was examined.
- EsmGFP has a level of similarity to all other FPs that is typically found when comparing sequences across taxonomic orders, but within the same taxonomic class.
- The examination of the sequence alignment revealed this similarity.
For example, the difference of esmGFP to other FPs is similar to level of difference between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones) both of which belong to the larger class anthozoa of marine invertebrates (Fig. 4G).
- The difference between esmGFP and other FPs is comparable to the level of difference between FPs found in stony corals and sea anemones, both of which belong to the anthozoa class of marine invertebrates.
- EsmGFP is a new fluorescent protein that has been discovered.
- EsmGFP has unique properties that make it different from other FPs.
- The discovery of esmGFP could have potential applications in various fields, such as biotechnology and medical research.
- The study of esmGFP and other FPs can provide insights into the evolution and diversity of marine organisms.
The closest FPs to esmGFP come from the anthozoa class (corals and anemones), average sequence identity $51.4 \%$, but esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish) where the famous avGFP was discovered, average sequence identity $33.4 \%$ (Fig. S22).
- The closest FPs to esmGFP come from the anthozoa class (corals and anemones), with an average sequence identity of 51.4%.
- esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish), where the famous avGFP was discovered, with an average sequence identity of 33.4%.
We can draw insight from evolutionary biology on the amount of time it would take for a protein with similar sequence identity to arise through natural evolution.
- Evolutionary biology can provide insight into the time it would take for a protein with similar sequence identity to arise through natural evolution.
- The amount of time required for a protein with similar sequence identity to evolve naturally can be estimated using evolutionary biology.
- Evolutionary biology can help us understand the process of natural evolution and how it leads to the development of proteins with similar sequence identity.
- By studying the principles of evolutionary biology, we can gain a better understanding of the factors that influence the evolution of proteins with similar sequence identity.
- The insights gained from evolutionary biology can be used to predict the time it would take for a protein with similar sequence identity to evolve naturally.
- Evolutionary biology provides a framework for understanding the mechanisms that drive the evolution of proteins with similar sequence identity.
- By applying the principles of evolutionary biology, we can gain a deeper understanding of the factors that contribute to the development of proteins with similar sequence identity.
- The study of evolutionary biology can help us identify the key factors that influence the evolution of proteins with similar sequence identity.
- Evolutionary biology provides a valuable tool for predicting the time it would take for a protein with similar sequence identity to evolve naturally.
In Fig. 4G we show esmGFP alongside three Anthozoan GFPs.
- Fig. 4G displays esmGFP along with three Anthozoan GFPs.
- The image in Fig. 4G presents a comparison between esmGFP and three different types of GFPs found in Anthozoans.
We use a recent time-calibrated phylogenetic analysis of the Anthozoans (53) that estimated the millions of years ago (MYA) to last common ancestors to estimate evolutionary time between each pair of these species.
- A recent time-calibrated phylogenetic analysis of the Anthozoans was used to estimate the evolutionary time between each pair of species.
- The analysis estimated the millions of years ago (MYA) to last common ancestors.
- The analysis resulted in a list of unique facts or ideas.
Using a larger dataset of six Anthozoan GFPs and species for which we have accurate MYA to last common ancestors and GFP sequence identities, we construct a simple estimator that correlates sequence identity between FPs to MY of evolutionary time between the species (Fig. $4 \mathrm{H}$ ) to calibrate against natural evolution.
- A simple estimator was constructed to correlate sequence identity between FPs to MY of evolutionary time between species.
- The estimator was calibrated against natural evolution using a larger dataset of six Anthozoan GFPs and species with accurate MYA and GFP sequence identities.
Based on this analysis we estimate esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature.
- The esmGFP protein has undergone over 500 million years of evolution.
- The closest protein found in nature is not specified.