esm3.a5

ESM3 generates a dim distant GFP B8 and a bright distant protein esmGFP. Details are provided below on com-

\begin{tabular}{|c|c|c|} \hline PDB ID & Coordinating Residues & Ligand ID \ \hline $7 \mathrm{map}$ & D25 G27 A28 D29 D30 G48 G49 V50 & 017 \ \hline $7 n 3 \mathrm{u}$ & I305 F310 V313 A326 K328 N376 C379 G382 D386 F433 & $05 \mathrm{~J}$ \ \hline 7 exd & D103 I104 C107 T108 I174 H176 T182 W306 F309 E313 Y337 & $05 \mathrm{X}$ \ \hline $8 g x p$ & W317 C320 A321 H323 V376 F377 L396 I400 H479 Y502 & $06 \mathrm{~L}$ \ \hline $7 \mathrm{n} 4 \mathrm{z}$ & M66 C67 R124 L130 C134 Y135 D152 F155 & $08 \mathrm{~N}$ \ \hline $7 \mathrm{vrd}$ & A40 S41 H161 Q169 E170 E213 D248 D324 K349 H377 R378 S379 K400 & $2 \mathrm{PG}$ \ \hline $7 \mathrm{zyk}$ & V53 V66 V116 H160 N161 I174 D175 & ADP \ \hline $6 \mathrm{yj} 7$ & K23 V24 A25 Y45 T46 A47 F115 I128 & AMP \ \hline $8 \mathrm{ppb}$ & H185 F198 K209 Q249 D250 L251 D262 K336 I415 D416 & ATP \ \hline $7 \mathrm{knv}$ & E33 F94 E95 D125 & $\mathrm{CA}$ \ \hline 7 xer & Y466 L505 T525 & CLR \ \hline $7 \mathrm{tj} 6$ & F366 G367 T378 R418 & CMP \ \hline $6 x m 7$ & $\mathrm{H} 167 \mathrm{H} 218 \mathrm{H} 284 \mathrm{H} 476$ & $\mathrm{CO}$ \ \hline $7 \mathrm{bfr}$ & Q62 X126 H248 & $\mathrm{CO} 3$ \ \hline $6 x \operatorname{lr}$ & X272 Y495 H496 H581 & $\mathrm{CU}$ \ \hline 6 tnh & N40 A41 S127 T128 Q187 L191 C201 T202 V236 & DGP \ \hline $7 \mathrm{ndr}$ & F73 S101 F102 D103 R106 & EDO \ \hline $8 \mathrm{axy}$ & H68 H109 E144 & $\mathrm{FE}$ \ \hline $7 \mathrm{o6c}$ & E62 E107 Q141 & FE2 \ \hline 8aul & P31 M32 T33 Q106 H185 R237 S319 G320 G321 G342 R343 F369 Y370 & $\mathrm{FMN}$ \ \hline $7 \mathrm{vcp}$ & N37 D38 Q54 F97 S98 R159 D160 E214 Y276 W297 & FRU \ \hline $7 b 7 f$ & G167 T168 G189 W195 & FUC \ \hline $8 \mathrm{~d} 0 \mathrm{w}$ & F73 L136 E137 F329 & GAL \ \hline 7yua & T13 T14 I15 D40 H85 S86 D87 D110 N290 & GDP \ \hline $7 \mathrm{w} 1 \mathrm{a}$ & L44 Y88 L91 I212 & GMP \ \hline $71 j n$ & G71 S72 D91 K236 S253 V254 D309 R310 & GTP \ \hline $6 s 4 \mathrm{f}$ & Y84 N87 K88 V131 Q132 L133 D155 F157 I276 P309 G310 G313 P314 V317 & $\mathrm{KUN}$ \ \hline $7 \mathrm{mg} 7$ & Y12 G98 L99 Y100 A207 D208 G227 R228 & MAN \ \hline 7qow & D12 T118 E268 & $\mathrm{MG}$ \ \hline $7 \mathrm{dmm}$ & E181 E217 D245 D287 & $\mathrm{MN}$ \ \hline $7 \mathrm{qoz}$ & G11 G12 I13 Y34 D35 V36 A86 G87 V126 T127 N128 H185 M235 & NAD \ \hline $7 v 2 r$ & G89 F93 K98 F101 E121 Y204 E209 F229 & $\mathrm{NAI}$ \ \hline $7 \mathrm{a} 7 \mathrm{~b}$ & F51 Y128 K165 N166 S167 Y186 R187 I248 G249 A299 & NAP \ \hline 7 pae & M20 L22 L38 V49 I53 C56 K57 R61 Q78 V80 W90 I109 M117 I129 L147 Y149 & O7T \ \hline 8egy & H82 K83 S186 G230 S231 N232 E345 S368 G369 & PLP \ \hline 7qow & S65 R129 D273 H465 & $\mathrm{PO} 4$ \ \hline $7 \mathrm{wmk}$ & E77 L124 R129 S174 T189 Q191 W241 D304 E306 K349 D410 W411 Y486 & PQQ \ \hline $7 \mathrm{pl} 9$ & D607 A608 Y637 M638 Y705 G706 M735 K736 & RET \ \hline $7 \mathrm{yf} 2$ & G153 E174 L175 L209 N210 L211 Y295 & $\mathrm{SAH}$ \ \hline $7 v 6 \mathrm{j}$ & G207 D230 L231 D250 M251 K264 & SAM \ \hline 7 ys6 & D106 C110 N288 & SRO \ \hline $6 \mathrm{w} 8 \mathrm{~m}$ & A22 A23 G70 S110 T111 G112 V113 Y114 & TJY \ \hline $8 g 27$ & S258 D294 K435 R717 & $\mathrm{UDP}$ \ \hline $7 x y k$ & R24 C170 R190 S191 D193 N201 H231 Y233 & UMP \ \hline $8 \mathrm{~g} 3 \mathrm{~s}$ & H224 F228 V249 M250 V253 R263 T266 L267 F270 & YLT \ \hline 8 it 9 & T92 P93 R96 Y108 L109 K216 V228 S229 H231 H232 & ZL6 \ \hline \end{tabular} \footnotetext{ Table S13. Atomic coordination dataset. Selected PDBs and coordinating residues (along with binding ligand) for each protein sample in } the atomic coordination dataset.

Figure S17. Alignment improves model generations. pTM, cRMSD distributions of generations from the 98B base model and aligned model for all ligands in the atomic coordination dataset. Each ligand/model pair has 1024 generations.

Figure S18. Randomly selected successful generations from the base model and finetuned model. A random sample of ligands is selected and visualized with the ground truth PDB chain from which the ligand was taken. Solutions produced by ESM3 are diverse, and the finetuned model gives significantly more successes (out of 1024 total samples). putational methods, experimental protocols, results, and post-experiment analyses.

The base ESM3 7B model generates candidate GFP designs for laboratory testing using a single prompt and a chain of thought over sequence and structure tokens. Candidates are filtered and ranked by metrics at several steps in the process. Experiment 1 tests candidates across a range of sequence identity to a template, yielding multiple GFPs including dim hit B8. Experiment 2 consists of designs starting a chain of thought from the sequence of B8, yielding numerous bright GFPs including C10 which we term esmGFP. This section details the computational protocol that generated and selected candidate GFP designs for Experiments 1 and 2, shown in Fig. 4B. Protocols, metrics, and selection conventions are separately introduced and then synthesized in descriptions of the two experiments, at the end of the section.

All candidate GFP designs were created using the base ESM3 7B model with no finetuning. Throughout generation, the model is prevented from decoding cysteine residues.

All candidate GFP designs in Experiment 1 are produced with a chain of thought beginning from a single prompt. The goal of the prompt is to capture essential residue identities and structural features needed for chromophore formation and fluorescence, leaving other degrees of freedom open for the model to generate diverse designs.

Template To this end, we prompt ESM3 with a minimal set of sequence and structure information from 16 residues near the chromophore formation site from a template protein. We select a pre-cyclized intermediate crystal structure from (50), PDB ID 1QY3, as our template. We reverse the chromophore maturation slowing mutation R96A in 1QY3 so the prompt contains Arg96. We subsequently refer to the full sequence and structure of 1QY3 with mutation A96R as 1QY3 A96R or the template.

Sequence prompt The sequence portion of our prompt consists of 7 template residues: Met1, Thr62, Thr65, Tyr66, Gly67, Arg96, and Glu222. Residues 65-67 form the chromophore. Met1 ensures proper start codon placement. Residues 62, 96, and 222 are described in (50) and other works to have key catalytic roles in chromophore formation.

Structure prompt The structure portion of our prompt consists of structure tokens and backbone atomic coordinates taken from 16 template residues at positions 96,222 , and 58-71 (inclusive) which roughly captures the central alpha helix. The unique geometry of the central alpha helix is known to be crucial for chromophore formation (50).

All other positions and tracks in the prompt are masked. The overall prompt length is 229 , matching that of the template. Residue indices are contiguous and begin from 1.

\section*{A.5.1.3. Joint SeQUENCE StRUcture OptimiZation}

We employ the following procedure to jointly optimize the sequence and structure of designs throughout our experiments: While annealing temperature linearly from 1 to 0 , we perform multiple iterations of first predicting the structure of a designed sequence and subsequently Gibbs sampling each position in the sequence for that predicted structure. In algorithmic form:

Algorithm 15 gibbs_seq_given_struct
Input: ESM3 $f$, sequence $x \in:\{0 . .20\}^{L}$, structure $y$, tem-
    perature $t$
    for $i=\operatorname{shuffle}(\{1, \ldots, L\})$ do
        $x_{i} \sim \exp \left(\log f\left(x_{i} \mid x_{\backslash i}, y\right) / t\right)$
    end for
    return $\mathrm{x}$
Algorithm 16 joint_optimize
Input: ESM3 $f$, initial sequence $x_{1}$, iterations $I$, initial
    temperature $t_{1}$, final temperature $t_{f}$
    for $i=1, \ldots, I$ do
        $t_{i}=\left(t_{f}-t_{1}\right) \cdot(i /(I-1))+t_{1}$
        $y_{i}=$ generate $_{\text {struct }}\left(f, x_{i}\right.$, len $\left.\left(x_{i}\right), T=0\right)$
        $x_{i+1}=$ gibbs_seq_given_struct $\left(f, x_{i}, y_{i}, t_{i}\right)$
    end for
    return $x_{I+1}$

Three variants of gibbsseqgivenstruct in jointoptimize were employed for Experiments 1 and 2. Joint optimization occasionally produces repetitive spans of amino acids when temperature is annealed to low values. Variant 1 and 2 are intended to address this, in differing ways. Variant 3 is an experiment in biasing the logits with a PSSM of known natural GFPs. Half of the candidates in Experiment 2 were produced using Variant 3. This half did not include esmGFP.

Variant 1: Negative Local Sequence Guidance We bias the logits of the model away from those produced just from a highly local span of the sequence. Specifically, we use classifier free guidance (99): $$ \text { logits }^{\prime}=\text { weight } *\left(\text { logits }{\text {cond }}-\text { logits }{\text {uncond }}\right)+\text { logits }{\text {uncond }} $$ but push away from the logits produced by inputting just 7 residues centered on the position being sampled, with weight 2 and nothing else. All other sequence positions and all other model inputs are left blank. logits $^{\prime}=2 *\left(\right.$ logits ${\text {cond }}-$ logits $\left.{\text {localseq }}\right)+$ logits ${\text {localseq }}$
Variant 2: Max Decoding Entropy Threshold We optionally skip resampling of sequence during Gibbs sampling at positions whose entropy over sequence tokens exceeds a user specified threshold.
Variant 3: PSSM Bias In Experiment 2 only, we experiment with both including and excluding a PSSMbased bias during Gibbs sequence sampling. Specifically, we add a PSSM constructed from 71 natural GFPs (see Appendix A.5.1.4 for details) directly to the sequence output logits of the model, with a userspecific weight. esmGFP did not use this option; it was produced with weight 0 .

GFP designs are produced and scored by a number of ESM3derived and independent metrics. Unless otherwise noted, designed structures are predicted using ESM3 with only sequence as input, using iterative decoding of structure tokens with temperature 0 and subsequent decoding of backbone coordinates with an older version of the structure token decoder.

The following is an exhaustive list of metrics used. An exact break down of where and how specific metrics are used can be found in Appendix A.5.1.5, Appendix A.5.1.6 and Appendix A.5.1.7. Template Chromophore Site RMSD is calculated via an optimal alignment (100) of N, C, CA, and inferred $\mathrm{CB}$ atoms at positions $62,65,66,67,96$, and 222 in the predicted structure of a design and the template (crystal) structure. Template Helix RMSD is calculated in the same way, but for N, C, CA atoms only, at design and template positions 58-71 (inclusive).

1EMA Helix RMSD is a metric proposed in (101). An RMSD is calculated between alpha helix residues in the predicted designed structure and a specific crystal structure of avGFP, PDB ID 1EMA. Our calculation differs slightly from (101). We calculate RMSD for $\mathrm{N}, \mathrm{C}, \mathrm{CA}$ and inferred $\mathrm{O}$ atoms, and consider only positions 60-64 and 68-74 (both ranges inclusive) to exclude chromophore positions 65-67.

Sequence Pseudo-perplexity is calculated as defined in (102). Given a protein sequence, positions are masked one at a time, negative log-likelihoods of input tokens at masked positions are averaged across all positions in the sequence, and the result is exponentiated.

Round-trip Perplexity is calculated for a designed sequence via predicting its structure with ESM3, and then evaluating the perplexity of the sequence given that predicted structure under a single forward pass of ESM3.

$\mathbf{N}$-gram Score is calculated as the $E{\text {ngram }}$ term defined in (10). This score assesses the divergence between the $\mathrm{N}$ gram frequencies of residues in the designed sequence and those found in a background distribution, derived from UniRef50 201803. Specifically, for a function ngram ${i}$ that takes in a sequence $x$ and an $\mathrm{N}$-gram order $i$, and a precomputed distribuion of background $\mathrm{N}$ gram frequencies ngram ${ }{i, b g}$, the score is calculated as:

PSSM A position-specific scoring matrix (PSSM) is constructed from a MSA of 71 natural GFPs (103). Specifically, at positions aligned to our template, frequencies for the 20 canonical amino acids (excluding gaps) are transformed to log odds via dividing by the uniform background $(p(a a)=0.05)$, adding an epsilon of $1 \mathrm{e}-9$, and applying $\log$ base 2 . This produces a matrix of scores of size 229 x 20 .

PSSM score We extract from the PSSM values at (position, amino acid) pairs occurring in an input sequence. These are averaged to produce a score.

N-terminus Coil Count is metric intended to measure structural disorder at the $\mathrm{N}$-terminus of a design. We observed that predicted structures have various levels of disorder in this region. To quantify it for possible filtering, we apply mkdssp (76) to the ESM3-predicted structure of a design, and record how many of the first 12 positions are reported as having SS8 labels in ${\mathrm{S}, \mathrm{T}, \mathrm{C}}$.

Among Experiment 1 and 2, designs are selected for testing by first applying a set of filters, and then selecting the top$\mathrm{N}$ designs according to a score-based ranking. Scores are calculated by summing the values of several metrics, which are each normalized across designs to have zero mean and unit variance and which are negated when appropriate so that lower values are always better.

Common Filters: The following filters are applied in both Experiments 1 and 2.

Template Chromophore Site RMSD $<1.5 \AA$
Template Helix RMSD $<1.5 \AA$
N-gram Score $<5$

Common Score Terms: The following score terms are used in both Experiments 1 and 2.

Sequence Pseudo-perplexity
Round-trip Perplexity
ESM3 pTM

In this experiment, we generate a set of GFP designs for experimental testing with a range of sequence identities to our template. Designs are generated by a chain of thought: From the prompt, ESM3 decodes all masked structure tokens, then all masked sequence tokens. Lastly, sequence and structure tokens are jointly optimized.

Initial Generation: Starting from the prompt, we first generate $38 \mathrm{k}$ structures by decoding masked structure tokens one at a time using a fixed temperature sampled uniformly from the range $(0,1.25)$ for each generation. To focus compute on the most promising structures, we filter according to Template Chromophore Site RMSD $<1 \AA$, yielding $24 \mathrm{k}$ selected structures. We next generate $\approx 4$ sequences for each structure with a temperature uniformly sampled from the range $(0,0.6)$, yielding $92 \mathrm{k}$ total sequences.

Selection: We select a subset of promising initial generations for further optimization by applying Common Filters with $\mathrm{N}$-gram score's threshold modified to $<5.5$, ranking designs according to ${$ Common Score Terms, mean ESM3 pLDDT, mean ESMFold pLDDT, and ESMFold pTM $}$, and selecting the best 40 designs in each interval of 0.1 sequence identity to the template sequence in $[0.2,1.0], 320$ in total.

Joint Sequence Structure Optimization: We then jointly optimize the sequence and structure of designs. Using 30 iterations in each case, we run 5 seeds of optimization with max decoding entropy threshold $=1.5$ and 2 seeds of optimization with negative local sequence guidance $=2.0$, yielding $67 \mathrm{k}$ total designs. Designs from every iteration are included in this pool.

Selection To select a set of designs for laboratory testing, we apply {Common Filters, N-terminus Coil Count $<6}$, rank designs according to ${$ Common Score Terms, ESMFold pTM, 15 * PSSM Score $}$, and select the best 88 designs across 8 buckets of sequence identity to our template among intervals of width 0.1 in range $[0.2,1]$.

In this experiment, we perform further refinement of the dim, distant GFP found in Experiment 1, B10. To produce a diversity of designs, we sweep over a number of settings: two variations of refinement are performed, and 2 selection protocols are used.

Local Joint Optimization: Starting from our dim GFP design, B10, we perform joint_optimize using a full grid sweep of the following sets of settings: Initial temperatures ${0.001,0.01,0.05,0.1,0.5}$, PSSM bias weights ${0,0.01,0.05,0.1,0.5}$, Max decoding entropy thresholds ${0.8,1,1.25,1.5,2.0}$. For each unique settings combination, we use 20 iterations of optimization with 3 seeds, continuing the final step of Gibbs sampling until convergence. After accounting for some distributed system machine failures, this yields $6.3 \mathrm{k}$ total candidate designs.

Selection: We select two sets of 45 designs for laboratory testing via two filters and a shared set of ranking criteria.

Set 1: We filter according to ${$ PSSM Bias $\neq 0$, Common Filters, RMSD to starting structure $<1 \AA$, Identity to starting sequence in $(0.7,1.0)}$.
Set 2: We filter according to ${$ PSSM Bias $=0$ (no bias), Common Filters, RMSD to starting structure $<1 \AA$, Identity to starting sequence in (0.9, $1.0)}$. esmGFP comes from this pool. For each set, we rank according to ${$ Common Score Terms, 8 * PSSM Score, 15 * 1EMA Helix RMSD} and select 45 designs each for testing.

We designed a custom bacterial expression vector containing an Ampicillin-resistance gene, the BBa_R0040 TetR promoter, the $\mathrm{BBa} B 0015$ terminator, and a Bsa-I golden gate site between the promoter and terminator. GFP designs were codon optimized for E. coli expression and ordered from IDT (Integrated Device Technology Inc.) containing compatible golden gate overhangs. They were then cloned by golden gate assembly into the vector. We evaluated our GFP designs in the E. coli host Mach1.

To evaluate the fluorescence of our GFP designs, we transformed our designs into Mach1 cells. For each of two replicates of a design, a colony was seeded into a $1 \mathrm{~mL}$ TB culture containing $50 \mu \mathrm{g} / \mathrm{mL}$ carbenicillin. Cultures were grown in 96 deep well blocks at $37^{\circ} \mathrm{C}$ in an Infors HT Multitron Shaker with a shaking speed of 1000 RPM for 24 hours. After 24 hours, $1 \mu \mathrm{L}$ of the cultures were diluted in $200 \mu \mathrm{l}$ of $0.2 \mu \mathrm{m}$ filtered DPBS.

Fluorescence intensity of the samples was then quantified at the single cell level using a NovoCyte Quanteon Flow Cytometer (Fig. S19).

The remaining cultures were spun down at $4000 \mathrm{~g}$ for 10 minutes, resuspended and lysed with $300 \mu \mathrm{L}$ lysis buffer (1x bugbuster, $500 \mathrm{mM} \mathrm{NaCl}, 20 \mathrm{mM}$ Tris-HCl pH 8, 10\% glycerol, cOmplete ${ }^{\mathrm{TM}}$, EDTA-free Protease Inhibitor Cocktail), incubated at room temperature on a Belly Dancer Orbital Shaker for 10 minutes, and lysate clarified by centrifugation at $4000 \mathrm{~g}$ for 20 minutes. $100-120 \mu \mathrm{l}$ lysate was transferred to a 96 well black clear-bottom plate, and GFP fluorescence was measured using a Tecan Spark Reader. Fluorescence emission was captured at $515 \mathrm{~nm}$ with a $10 \mathrm{~nm}$ bandwidth and excited with $485 \mathrm{~nm}$ with a $10 \mathrm{~nm}$ bandwidth. Absorbance was captured at $280 \mathrm{~nm}$ with a $3.5 \mathrm{~nm}$ bandwidth to assess total protein content per well. For longer time points, plates containing lysate were sealed and incubated at $37^{\circ} \mathrm{C}$ for up to 7 days prior to measuring fluorescence. GFP fluorescence values were first ratio normalized within a well by their absorbance at $280 \mathrm{~nm}$, and then further ratio normalized across wells using the measured values from a negative control E. coli containing vector without GFP. Data from two replicates was then averaged for (Fig. 4B bottom) and (Fig. 4C).

Overview photos of the plates (Fig. 4B top) were taken with an iPhone 12 mini under blue light illumination from an Invitrogen Safe Imager 2.0 Blue Light Transilluminator.

For excitation spectra, emission was captured at $570 \mathrm{~nm}$ with a $50 \mathrm{~nm}$ bandwidth, while the excitation wavelength was varied from 350 to $520 \mathrm{~nm}$ with a $10 \mathrm{~nm}$ bandwidth. For emission spectra, an excitation wavelength of $430 \mathrm{~nm}$ was used with a $50 \mathrm{~nm}$ bandwidth, while emission was captured at varying wavelengths from 480 to $650 \mathrm{~nm}$ with a $10 \mathrm{~nm}$ bandwidth. Excitation and emission spectra were normalized by their maximum values (Fig. 4C).

Plate overview photographs (Fig. 4B top) were taken over two weeks since the initial lysate was created and over one week after the final plate reader quantification was done, and so possibly show additional brightness from slow chromophore maturing designs. We observed some low level contamination of wells $\mathrm{H} 11$ (vector with no GFP or designs) and H12 (lysis buffer only) in the photograph of Experiment 1 (Fig. 4B top left). Some of this contamination is already visible in well $\mathrm{H} 12$ during the initial plate reader quantification (Fig. 4B bottom left). To address potential contamination concerns we performed an additional replication of B8 and observed a similar level of brightness to Experiment 1 (50x less bright than natural GFPs) (Fig. S20).

Chromophore knockout versions of 1QY3 A96R and esmGFP were created through additional T65G and Y66G mutations. These variants, along with 1QY3 and esmGFP, were synthesized and measured as part of an independent replicate performed by Genscript following the E. Coli based fluorescent plate reader assay described above. Normalization was performed with an OD600 measurement of the cells prior to lysis. Analysis otherwise proceeded as above. Two replicates were performed for each design and results were averaged. Chromophore knockout reduced fluorescence to background levels (Fig. S21).

BLAST nr search: esmGFP's sequence was searched with BLAST's online server using the non-redundant sequences database $\mathrm{nr}$ with all default settings. tagRFP's sequence was taken from the top hit. The exact top hit found was TagRFP Cloning vector pLX-B2-TagRFP-T, Sequence ID ASG92118.1 and is shown in its entirety in Table S14.

Train set search: MMseqs2 (73), version 15.6f452, was used to search all datasets that ESM3 was trained on at the maximum available expansion level; for cluster resampling datasets all cluster members are searched, not just cluster centers. The goal is to search against every possible sequence that ESM3 may have seen during pre-training. Settings are selected for conducting a high sensitivity search: -s 6 -a --max-seqs 10000 .

To calculate sequence identities involving the two highlighted GFP designs (B8, esmGFP) and select reference proteins, the following procedure is used. MAFFT (104) v7.525 is applied with all default settings to the sequences of B8, esmGFP, the top tagRFP sequence found by BLAST, eqFP578 (from FPBase (105)), the template (PDB ID 1QY3, with mutation A96R), and avGFP (from FPBase). Identities between two sequences are calculated as the number of matching non-gap residues at aligned positions divided by the minimum non-gapped length of the query and target protein. This is the same sequence identity formula used in Appendix A.5.4. Aligned sequences and identities and mutation counts to esmGFP are provided in Table S14.

Figure S19. Flow cytometry data confirms cells expressing esmGFP can be detected at the single cell level. Forward Scatter-Area (FSC-A), a measure of cell size vs Fluorescein Isothiocyanate-Area (FITC-A), a measure of GFP-like fluorescent signal, for expressing 1QY3 A96R, esmGFP, and a negative control that does not express any GFP. A gate was set at the $99.9 \%$ quantile for the negative control data, and the fraction of cells passing the gate were quantified for each sample.

Figure S20. Replication of design B8 and select controls. Results are averages of eight wells across two plates.

Positions in esmGFP are described as internal if they have SASA $<5$ in their predicted structure. SASA is calculated as in Appendix A.2.1.6) from the all-atom structure of esmGFP, predicted with ESM3 7B.

Sequences and metadata of natural and designed fluorescent proteins were obtained from FPBase (105). An initial set of 1000 proteins was filtered to protein which contained the following metadata: a specified parent organism, an amino acid sequence between 200 and 300 residues long, a specified emission maximum, and no cofactors. NCBI taxonomy database was used to obtain taxonomic information about each species. These sequences were further filtered according to keep those that had species found by NCBI and were Eukaryotic but not from Chlorophyta (to exclude Channelrhodopsin like proteins). The 648 sequences that passed these criteria, along with the sequence for esmGFP, were aligned to a multiple sequence alignement using MAFFT and sequence idenity was computed between each pair of sequences as described above. All pairs within and across taxa were considered for (Fig. 4F). All designed sequences were considered to belong to the species annotated as their parent organism.

All 648 used sequences belonged to the Leptocardii (e.g. laGFP), Hexanauplia (e.g. ppluGFP), Hydrozoa (e.g. avGFP), or Anthrozoa (e.g. efasGFP) classes. The sequence identity of esmGFP was computed to each protein in these classes Fig. S22. esmGFP was found to be closest to Anthrozoan GFPs (average sequence identity $51.4 \%$ ) but also shares some sequence identity to Hydrozoan GFPs (average sequence identity $33.4 \%$ ).

To estimate the millions of years of evolutionary distance by time between esmGFP and known fluorescent proteins we built an estimator to go from sequence identity between pairs of GFPs to millions of years (MY) apart. We used the following six Anthozoan species Acropora millepora, Ricordea florida, Montastraea cavernosa, Porites porites, Discosoma sp., Eusmilia fastigiata along with the six GFPs amilGFP, rfloGFP, mcavGFP, pporGFP, dis3GFP, efasGFP respectively. These species and GFPs were chosen because they were annotated in both a recent time calibrated phylogenetic analysis of the Anthozoans (53) and a recent study of GFPs (44). Each of these species contains multiple GFP like sequences including red and cyan FPs. These particular GFPs were chosen as they were annotated to be the main GFP in each species. The millions of years between each species was estimated as twice the millions of years to the last common ancestor annotated in the time calibrated phylogenetic analysis. Using statsmodels (106), a line of best fit was fit between MY and sequence identity. The line was required to pass through a sequence identity of 1.0 and 0 MY. The MY to esmGFP was then estimated using this line and the sequence identity of esmGFP to the nearest known protein.