esm3.a2.full14

==============================

UniRef release 202302 is downloaded and parsed from the official UniRef website (71). MGnify90 version 202302 is downloaded and parsed from MGnify (35). All nonrestricted studies available in JGI on July 31st, 2023 are downloaded and concatenated into the JGI dataset (72). OAS, which includes over a billion antibody sequences from 80 studies, is downloaded and clustered at $95 \%$ sequence identity (36).

The text describes the process of downloading and parsing various datasets from different sources. UniRef release 202302 is obtained from the official UniRef website, while MGnify90 version 202302 is downloaded from MGnify. Additionally, all nonrestricted studies available in JGI on July 31st, 2023 are downloaded and concatenated into the JGI dataset. Finally, OAS, which includes over a billion antibody sequences from 80 studies, is downloaded and clustered at $95 \%$ sequence identity.

User:

In all cases, data is clustered with mmseqs2 (73), with flags --kmer-per-seq 100 --cluster-mode 2 --cov-mode 1 -c 0.8 --min-seq-id $<$ seqid>.

The command "mmseqs2" is used to cluster data with a k-mer size of 100 and a cluster mode of 2. The "--cov-mode 1" flag specifies that the coverage threshold should be set to 1, meaning that all sequences in the cluster must have at least one k-mer in common. The "-c 0.8" flag sets the minimum sequence identity threshold to 80%. Finally, the "--min-seq-id " flag specifies that the minimum sequence identity threshold should be set to the value of "seqid".

User:

In order to do cluster expansion, we separately cluster the dataset at the two levels, and perform a join to determine cluster member and cluster center based on IDs. We first sample a cluster center at the lower level, and then sample a sequence within the cluster at the higher level. As an example, for expansion of UniRef70 at $90 \%$, we first cluster UniRef at $70 \%$ sequence similarity using mmseqs linclust. Then, we cluster it separately at $90 \%$. Since each UniRef90 cluster center is by definition a UniRef70 cluster member, we filter out UniRef70 for all cluster members that are in the UniRef90 clusters. We can then drop all cluster centers without any members, which may occur due to the nondeterminism of clustering. This allows us to sample a UniRef70 center, and then a member within that cluster, of which each are $90 \%$ sequence similarity apart. For ease of dataloading, we additionally limit the number of data points within a cluster to 20 .

To perform cluster expansion, we first cluster the dataset at two different levels of sequence similarity, in this case, 70% and 90%. We then perform a join operation to determine the cluster members and cluster centers based on their IDs.

We start by sampling a cluster center at the lower level of sequence similarity, which in this example is UniRef70. We then sample a sequence within the cluster at the higher level of sequence similarity, which is UniRef90.

To ensure that the cluster centers are valid, we filter out any UniRef70 cluster members that are not in the UniRef90 clusters. We also drop any cluster centers that do not have any members, which may occur due to the nondeterminism of clustering.

This allows us to sample a UniRef70 center and a member within that cluster, which are 90% sequence similarity apart. To make data loading easier, we limit the number of data points within a cluster to 20.

User:

As data augmention we train a 200M parameter inverse folding model and use it to create additional training examples.

As an AI language model, I can explain that data augmentation is a technique used to increase the size of a dataset by creating new examples from the existing data. In this case, a 200M parameter inverse folding model is trained and used to generate additional training examples. This can help improve the performance of the model by providing more diverse and representative data for training. The inverse folding model is likely a type of neural network that is designed to generate new examples that are similar to the original data but with some variations, such as rotations, translations, or other transformations. By using this model to create new examples, the dataset can be expanded without the need for manual labeling or data collection, which can be time-consuming and expensive. Overall, data augmentation is a powerful technique that can help improve the accuracy and robustness of machine learning models, especially when dealing with limited or imbalanced datasets.

The inverse folding model uses the geometric attention layer for structure conditioning and output projection head for the sequence logits as ESM3. Unlike ESM3 the transformer stack alternates between blocks with geometric attention and standard attention. The model is trained on the sequence and structure pairs in PDB, AlphaFold-DB, and ESMAtlas, with the single training task of (and loss computed on) predicting sequence at the output given structure at the input. Model architecture and training methodology is otherwise substantially similar to ESM3.

The inverse folding model is a type of protein structure prediction model that utilizes the geometric attention layer for structure conditioning and the output projection head for sequence logits. It is similar to the ESM3 model, but with the addition of alternating blocks of geometric attention and standard attention in the transformer stack.

The model is trained on a dataset of sequence and structure pairs from PDB, AlphaFold-DB, and ESMAtlas, with the goal of predicting the sequence at the output given the structure at the input. The training task is a single task, and the loss is computed on this task.

The model architecture and training methodology are similar to ESM3, with the main difference being the addition of the geometric attention layer and the alternating blocks of attention. This allows the model to better capture the relationship between protein structure and sequence, leading to more accurate predictions.

User:

This model is used to generate additional sequences corresponding to each structure in the training data for ESM3 ( 5 sequences per structure for ESMAtlas and AlphaFold$\mathrm{DB}, 64$ sequences per structure for the $\mathrm{PDB})$. When training ESM3, with $50 \%$ probability the original sequence and structure pair is presented to the model as a training example. The other $50 \%$ of the time one of these 5 sequences is paired with the structure as the training example seen by ESM3.

This model is designed to generate additional sequences that correspond to each structure in the training data for ESM3. The number of sequences generated per structure varies depending on the dataset used, with 5 sequences per structure for ESMAtlas and AlphaFold$\mathrm{DB}$, and 64 sequences per structure for the $\mathrm{PDB}$. During the training process for ESM3, there is a 50% chance that the original sequence and structure pair will be presented to the model as a training example. The other 50% of the time, one of the 5 or 64 generated sequences will be paired with the structure as the training example seen by ESM3. This approach helps to increase the diversity of the training data and improve the performance of the model.

User:

Functional labels are obtained from InterPro (38) and InterProScan (74), both version 95.0. All annotations for UniProtKB were downloaded from the InterPro website via the 'protein2ipr.dat.gz' file. InterProScan was applied to the entirety of MGnify 90 with flags --goterms --iprlookup --pathways --disable-precalc. The resultant values are taken as ground truth functional labels for model training.

The process of obtaining functional labels for model training involves using InterPro and InterProScan, which are tools that provide annotations for proteins. The annotations are downloaded from the InterPro website and applied to the entirety of MGnify 90 using InterProScan with specific flags. The resulting values are considered ground truth functional labels and are used for model training. This process is important for accurately predicting the functions of proteins and understanding their roles in biological processes.

User:

We use all PDB chains, clustered by unique PDB ID and entity ID within the PDB structure. We filter to all structures deposited before May 1, 2020, determined by X-ray crystallography, and better than $9 \AA$ resolution. (37)

This statement is describing the criteria used to select a specific set of protein structures from the Protein Data Bank (PDB). The selection process involves clustering all PDB chains by their unique PDB ID and entity ID within the PDB structure. This means that if there are multiple chains in a single PDB file, they will be grouped together based on their unique identifiers.

Next, the selection is filtered to only include structures that were deposited before May 1, 2020. This means that any structures deposited after this date will not be included in the selection.

Finally, the selection is further filtered to only include structures that were determined by X-ray crystallography and have a resolution better than 9 angstroms (Å). This means that any structures determined by other methods or with a resolution worse than 9 Å will not be included in the selection.

Overall, this selection process is designed to identify a specific set of high-quality protein structures that can be used for further analysis or research.

AlphaFoldDB is downloaded as the $v 4$ version specified on their website (4). We notice that structures with high pLDDT are disproportionately alpha helices. Therefore, we ensure globularity by measuring the number of long range ( $>12$ sequence distance) contacts in the chain. If this value is $<0.5 \mathrm{~L}$ with an $\mathrm{L}$ length protein, we omit it from our training set. We also filter out all proteins $<0.7$ pLDDT.

AlphaFoldDB is a database that provides protein structure predictions using the AlphaFold algorithm. The database is available in different versions, and the user has downloaded version 4. The user has observed that structures with high predicted local distance difference test (pLDDT) values tend to have a high proportion of alpha helices. To ensure that the structures in the training set are globular, the user has measured the number of long-range contacts in the protein chain. If this value is less than 0.5 times the length of the protein, the structure is omitted from the training set. Additionally, the user has filtered out all structures with pLDDT values less than 0.7.

User:

ESMAtlas is downloaded as version v0 and v2023_02. Similarly we use $\mathrm{a}<0.7$ pLDDT filter. We use a $0.7 \mathrm{pTM}$ cutoff as well to enforce globularity. High pTM structures tends to be more compact.

ESMAtlas is a database of protein structures that can be downloaded in two different versions: v0 and v2023_02. These versions differ in the number and types of structures included in the database.

To filter the structures in the database, we use a pLDDT filter with a cutoff of 0.7. pLDDT stands for predicted local distance difference test and is a measure of the accuracy of a protein structure prediction. A cutoff of 0.7 means that only structures with a pLDDT score of 0.7 or higher are included in the filtered dataset.

In addition to the pLDDT filter, we also use a pTM cutoff of 0.7 to enforce globularity. pTM stands for predicted transmembrane helices and is a measure of the likelihood that a protein contains transmembrane helices. A cutoff of 0.7 means that only structures with a pTM score of 0.7 or higher are included in the filtered dataset. This is because high pTM structures tend to be more compact and globular, which is desirable for certain types of analyses.

User:

Structural data also includes any functional labels that exist for the corresponding sequence.

Structural data refers to the three-dimensional structure of a protein or other biological molecule, which can be determined using techniques such as X-ray crystallography or NMR spectroscopy. This information is important for understanding how the molecule functions and interacts with other molecules. Functional labels are annotations or descriptions of the specific functions or activities of the molecule, such as its role in a metabolic pathway or its ability to bind to a particular ligand. These labels are often based on experimental data or computational predictions and can provide valuable insights into the biological processes that the molecule is involved in. When structural data and functional labels are combined, they can provide a more complete understanding of the molecule's structure and function, which can be useful for drug discovery, protein engineering, and other applications.

For solvent accessibility surface area, we use the ShrakeRupley rolling probe algorithm as implemented in biotite (75). This generates a set of real numbers, or a nan value when structural coordinates are not provided. Similarly, SS8 labels are generated using the mkdssp tool (76) and taken as ground truth labels.

The Shrake-Rupley rolling probe algorithm is a method used to calculate the solvent accessibility surface area of a protein structure. This algorithm is implemented in the biotite software package and generates a set of real numbers that represent the solvent accessibility surface area. However, if the structural coordinates are not provided, the algorithm returns a "not a number" (NaN) value.

In addition, SS8 labels are generated using the mkdssp tool, which is a program that assigns secondary structure labels to protein structures. These labels are considered to be the ground truth labels for the protein structure.

Overall, the combination of the Shrake-Rupley rolling probe algorithm and the mkdssp tool allows for the accurate calculation of solvent accessibility surface area and the assignment of secondary structure labels to protein structures.

User:

In both cases, we use the set of high quality predicted structures in AlphaFoldDB and ESMAtlas. We split our datasets into structural and sequence data. Structural data is shown separately in order to weight the ratios of structural data (mostly synthetic) properly with the amount of sequence data (mostly real).

In both cases, we are using the AlphaFoldDB and ESMAtlas databases to obtain high quality predicted structures. These structures are then used to create datasets that are split into two categories: structural data and sequence data. The structural data is separated from the sequence data to ensure that the ratios of structural data (which is mostly synthetic) are properly weighted with the amount of sequence data (which is mostly real). This helps to ensure that the resulting analysis is accurate and reliable.

User:

An oversight was that we did not manage to apply these augmentations to PDB. However, since PDB constituted a relatively small portion of our training data, and these structural conditioning tasks did not depend on precise sidechain positions, we reasoned that high confidence synthetic structures would perform equally well and annotation of PDB was not necessary.

The oversight mentioned was that the team did not apply certain augmentations to PDB, which is a small portion of their training data. However, they believed that high confidence synthetic structures would perform equally well and that annotation of PDB was not necessary since the structural conditioning tasks did not depend on precise sidechain positions.

We keep track of validation set performance on a set of held out sequences from each training set, UniRef, MGnify, and JGI. In order to properly hold out a sufficiently diverse set of validation proteins, we first sample 25000 proteins from each set. Then we use mmseqs easy-search to filter out proteins from this set with a $70 \%$ sequence identity threshold. We choose the set of proteins from our training set to be the "query" set, and the set of validation proteins as our "target" set for mmseqs. We use the flags --alignment-mode 3 -c 0.8 {cov-mode 0 --max-seqs 300 --max-accept 3 --start-sens 2 -s 7 --sens-steps 3.

This explanation is about a process for validating the performance of a machine learning model on a set of held out sequences from each training set. The process involves sampling 25000 proteins from each set and using mmseqs easy-search to filter out proteins with a 70% sequence identity threshold. The training set is used as the "query" set and the validation proteins are used as the "target" set for mmseqs. The flags used in mmseqs include --alignment-mode 3, -c 0.8, {cov-mode 0, --max-seqs 300, --max-accept 3, --start-sens 2, -s 7, and --sens-steps 3.

User:

This query is designed such that early stopping in mmseqs will not affect if we find a hit in the "query" training set.

This query is structured in a way that ensures that even if the search process is terminated early, any potential hits within the "query" training set will still be identified. This is achieved by implementing a specific algorithm or parameter that allows for the identification of hits within the training set, regardless of whether the search process is completed or not. This approach is particularly useful in situations where the search process may be computationally intensive or time-consuming, as it allows for the identification of relevant hits without the need for a full search.

User:

Train purges are run to generate a list of blacklisted UniRef, MGnify, and JGI IDs, which are removed from the training set.

I can explain that train purges are a process used to improve the accuracy of machine learning models. in this process, a list of blacklisted uniref, mgnify, and jgi ids is generated. these ids are then removed from the training set, which is the set of data used to train the machine learning model. by removing these blacklisted ids, the model can be trained on a more accurate and relevant dataset, which can improve its performance and accuracy. this process is typically performed by data scientists or machine learning experts who are working to optimize the performance of their models.

User:

The dataset counts in Table S3 are computed after limiting the large clusters to 20 . The number of tokens are computed by multiplying the number of sequences with the average length of the dataset.

The dataset counts in Table S3 have been calculated after limiting the large clusters to 20. This means that any clusters with more than 20 sequences have been excluded from the analysis. The number of tokens in the dataset has been determined by multiplying the total number of sequences by the average length of each sequence. This provides a measure of the total amount of data in the dataset.

User:

In order to compute the approximate number of sequences and tokens seen during training, we first compute the number of times the dataset is repeated at the cluster level. Given the the number of repeats, we know the expected number of unique samples seen when sampling with replacement is $n\left(1-\left(1-\frac{1}{n}\right)^{k}\right)$ with a cluster of size $n$ and $k$ items selected. Computing this on the size of each cluster and number of dataset repeats results in the approximate number of tokens we present as presented in Table S4. Our largest model is trained on all of this data, while our smaller models use a portion of it depending on the model's token budget.

To estimate the number of sequences and tokens seen during training, we first need to determine how many times the dataset is repeated at the cluster level. Once we have this information, we can calculate the expected number of unique samples seen when sampling with replacement. This is done using the formula $n\left(1-\left(1-\frac{1}{n}\right)^{k}\right)$, where $n$ is the size of the cluster and $k$ is the number of items selected. By applying this formula to each cluster and the number of dataset repeats, we can obtain an approximate number of tokens presented during training. This information is presented in Table S4. Our largest model is trained on all of the available data, while our smaller models use a portion of it based on their token budget.

User:

In the masked generative framework, corruption is applied to each input to the model. To enable generation, the amount of noise applied to an input is sampled from a distribution with probability mass on all values between 0 and 1 .

In the masked generative framework, the input to the model is corrupted by applying noise to it. This is done to enable generation of new data. The amount of noise applied to the input is randomly sampled from a distribution that has probability mass on all values between 0 and 1. This means that the amount of noise applied can vary between 0 and 1, and is chosen randomly for each input. The purpose of this is to introduce some randomness into the input, which can help the model generate more diverse and creative outputs.

We select various noise schedules for different tracks with several goals in mind. First, ESM3 should see all combinations of tracks as input and output, enabling it to generate and predict based on arbitrary inputs. Second, ESM3 should maintain a balance of strong representation learning and high quality generations. Third, the type of inputs provided should be representative of what users would like to prompt the model with.

In initial experimentation, we found that a fixed $15 \%$ noise schedule led to poor generation results, while a linear noise schedule where probability of each mask rate was constant led to good generation but poor representation learning results. We find a good trade-off between representation learning and generation by sampling the noise schedule from a mixture distribution. $80 \%$ of the time, the mask rate is sampled from a $\beta(3,9)$ distribution with mean mask rate $25 \%$. $20 \%$ of the time, the mask rate is sampled from a uniform distribution, resulting in an average overall mask rate of $30 \%$.

The goal of selecting various noise schedules for different tracks is to ensure that the model, ESM3, is able to generate and predict based on arbitrary inputs. This is achieved by exposing the model to all combinations of tracks as input and output. Additionally, the noise schedules are designed to maintain a balance between strong representation learning and high quality generations.

In order to achieve this balance, the team experimented with different noise schedules. They found that a fixed 15% noise schedule led to poor generation results, while a linear noise schedule with constant probability of each mask rate led to good generation but poor representation learning results.

To address this issue, the team decided to sample the noise schedule from a mixture distribution. 80% of the time, the mask rate is sampled from a beta distribution with a mean mask rate of 25%. The remaining 20% of the time, the mask rate is sampled from a uniform distribution, resulting in an average overall mask rate of 30%.

This approach allows the model to maintain a balance between representation learning and generation, while also ensuring that the type of inputs provided are representative of what users would like to prompt the model with.

User:

The noise schedules applied to each input are listed in Table S6. For the structure coordinate track, we also modify the noise to be applied as span dropping, as opposed to i.i.d over the sequence with $50 \%$ probability. This ensures that the model sees contiguous regions of masked and provided coordinates, which better mimics the types of inputs users may provide.

The noise schedules applied to each input are listed in Table S6. For the structure coordinate track, we modify the noise to be applied as span dropping, as opposed to i.i.d over the sequence with $50 \%$ probability. This modification ensures that the model sees contiguous regions of masked and provided coordinates, which better mimics the types of inputs users may provide.

User:

Along with applying noise to each track, we want to ensure ESM3 is able to perform well when some tracks are not provided at all (e.g. to perform structure prediction when no structure is provided as input). We enable this by wholly dropping out some tracks with varying probabilities, listed in Table S6.

To an expert, this means that in addition to adding noise to each track, we also want to ensure that the ESM3 model can perform well even when some tracks are missing. To achieve this, we randomly drop out some tracks with varying probabilities, as listed in Table S6. This helps the model to learn to make predictions even when some information is missing, which is a common scenario in real-world applications.

User:

We apply gaussian noise with standard deviation 0.1 to all coordinates the model takes as input.

I can explain the concept of applying gaussian noise with standard deviation 0.1 to all coordinates the model takes as input.

gaussian noise is a type of random noise that follows a normal distribution, also known as a bell curve. it is commonly used in machine learning and data science to introduce randomness into the data, which can help prevent overfitting and improve the generalization performance of the model.

in this case, the gaussian noise is being applied to all coordinates the model takes as input. this means that each coordinate value is being randomly perturbed by a small amount, with the amount of perturbation determined by the standard deviation of the gaussian noise.

the standard deviation of 0.1 means that the noise will be relatively small, with most of the values being close to the original coordinate value. this can help to introduce some variability into the data without significantly altering the overall structure of the input.

overall, applying gaussian noise with a standard deviation of 0.1 to all coordinates the model takes as input is a common technique used in machine learning to improve the performance of the model by introducing some randomness into the data.

User:

An interesting use case of generative protein models involves conditioning on key structural information, such as an active site, and generating the sequence and structure of a protein that contains this information. It is possible to define an atomic coordination task as 3 residues which are mutually in contact in structure space ( $C \alpha-C \alpha$ distance $<6 \AA$ ), but are distant in sequence space ( $\geq 10$ positions apart) (23). Training on this conditioning may enable the model to better perform the type of atomic coordination required for active site sampling.

This is a technique used in generative protein models where the model is trained to generate protein sequences and structures that contain specific structural information, such as an active site. The model is conditioned on this information, which means that it is given a set of constraints or parameters to work with.

In this case, the model is trained to generate sequences and structures that contain atomic coordination tasks, which are defined as three residues that are in contact in structure space but distant in sequence space. By training the model on this specific task, it can better perform the type of atomic coordination required for active site sampling.

Overall, this technique allows for more efficient and accurate protein modeling, which can have important implications for drug discovery and other applications in the field of biochemistry.

User:

While this task will be sampled with some probability under the standard noise schedules, we also manually sample the task with $5 \%$ probability whenever a structure is available. If the task is sampled and a valid atomic coordination triplet is found, the structure coordinates for that triplet are shown to the model. For each residue in the triplet, the adjacent residues are also independently shown with $50 \%$ probability, which leads to a total size of between 3 and 9 residues. All other structure coordinates are masked. Normal masking is applied to the other tracks.

This task will be sampled with some probability under the standard noise schedules, which means that it will be selected randomly based on a predetermined probability. However, we also manually sample the task with a 5% probability whenever a structure is available. This means that if a structure is available, there is a 5% chance that this task will be selected.

If the task is sampled and a valid atomic coordination triplet is found, the structure coordinates for that triplet are shown to the model. This means that if the task is selected and a valid atomic coordination triplet is found, the coordinates for that triplet will be displayed to the model.

For each residue in the triplet, the adjacent residues are also independently shown with a 50% probability. This means that for each residue in the triplet, there is a 50% chance that the adjacent residues will also be displayed to the model.

All other structure coordinates are masked, which means that they are hidden from the model. Normal masking is applied to the other tracks, which means that the other tracks are also hidden from the model.

User:

Predicting and generating binding interfaces is another important task for generative protein models. To help with this capability, we add computational data augmentation that simulates the binding interface task.

I can explain that predicting and generating binding interfaces is a crucial task for generative protein models. to enhance this capability, computational data augmentation is added, which mimics the binding interface task. this technique involves generating additional data by applying various transformations to the existing data, such as rotations, translations, and reflections. by doing so, the model can learn from a larger and more diverse dataset, which can improve its performance in predicting and generating binding interfaces. this approach can be particularly useful when the available data is limited or biased, as it can help to overcome these limitations and improve the accuracy of the model.

User:

We define a tertiary interface as one involving a long range contact $(C \alpha-C \alpha$ distance $<8 \AA, \geq 24$ sequence positions). When this task is sampled ( $5 \%$ probability whenever a structure is present), a long range contact is found, then the chain is split into two chains, each containing one side of the contact interface. Suppose the contacting positions are given by the indices $i, j$. Then the first chain will contain residues between [RANDINT $(1, i-3)$, RANDINT $(i+3, j-15)$ ], while the second chain will contain residues between [RANDINT $(i+15, j-3)$, RANDINT $(j+15, L)$ ]. This ensures there is always a residue gap between the two pseudochains. A chainbreak token "-" is inserted to represent the residue gap.

A tertiary interface is a type of protein-protein interaction that involves a long range contact between two amino acid residues, with a distance of less than 8 angstroms and at least 24 sequence positions apart. When this task is sampled, a long range contact is found and the chain is split into two chains, each containing one side of the contact interface. The contacting positions are given by the indices $i, j$, and the first chain contains residues between [RANDINT $(1, i-3)$, RANDINT $(i+3, j-15)$ ], while the second chain contains residues between [RANDINT $(i+15, j-3)$, RANDINT $(j+15, L)$ ]. A chainbreak token "-" is inserted to represent the residue gap between the two pseudochains.

User:

To encourage the model to learn to represent residue gaps using the chainbreak token, we introduce a task which randomly splits a single chain into multiple subchains.

To enhance the model's ability to represent residue gaps using the chainbreak token, we have introduced a task that involves randomly dividing a single chain into several subchains. This task is designed to encourage the model to learn how to effectively represent gaps in the sequence using the chainbreak token. By doing so, the model will be better equipped to handle situations where there are gaps in the sequence, which is a common occurrence in many biological sequences. Overall, this task will help to improve the accuracy and effectiveness of the model in representing and analyzing biological sequences.

User:

First, a number of chains to sample is sampled from a geometric distribution with probability 0.9 , up to a maximum of 9 possible chains. If the number of chains sampled is 1 , no additional transformations are applied. A minimum separation of 10 residues between chains is defined. Sequence lengths of the chains along with gaps are sampled from a dirichlet distribution to maintain identically distributed sequence lengths for each chain. This transformation is applied to all samples.

The process begins by sampling a number of chains from a geometric distribution with a probability of 0.9, up to a maximum of 9 possible chains. If only one chain is sampled, no additional transformations are applied. To ensure a minimum separation of 10 residues between chains, sequence lengths of the chains along with gaps are sampled from a dirichlet distribution. This is done to maintain identically distributed sequence lengths for each chain. This transformation is applied to all samples.

User:

In the case that multiple chains are provided to the model from either the interface sampling or pseudo-multimer augmentation tasks, we mask the geometric attention layer to prevent the model from attending to cross-chain coordinates. This simulates tasks where the structure of individual chains is known, but the interface is unknown.

In situations where multiple chains are provided to the model, either through interface sampling or pseudo-multimer augmentation, we use a masking technique to prevent the model from attending to coordinates that cross over different chains. This is done to simulate tasks where the structure of individual chains is known, but the interface between them is not. By masking the geometric attention layer, we ensure that the model only attends to coordinates within the same chain, which helps to improve the accuracy of the model's predictions.

User:

We train all models using AdamW optimizer (77), with the following hyperparameters: $\beta{1}=0.9, \beta{2}=0.95$. We use a weight decay of 0.01 and gradient clipping of 1.0. We employ $5 \mathrm{~K}$ to $20 \mathrm{~K}$ warmup steps until reaching the maximum learning rate, and utilize a cosine decay scheduler to decay LR to $10 \%$ of the maximum learning rate by the end of training.

The AdamW optimizer is a variant of the Adam optimizer that adds weight decay regularization to prevent overfitting. The hyperparameters $\beta{1}$ and $\beta{2}$ control the exponential moving averages of past squared gradients and past gradients, respectively. The weight decay hyperparameter adds a penalty term to the loss function that discourages large weights. Gradient clipping limits the maximum gradient norm to prevent exploding gradients. Warmup steps gradually increase the learning rate from 0 to the maximum learning rate, which helps the model converge faster. The cosine decay scheduler gradually decreases the learning rate to 10% of the maximum learning rate by the end of training, which helps the model converge to a better minimum. Overall, these hyperparameters and techniques are commonly used in deep learning to improve model performance and prevent overfitting.

User:

Our training codebase uses Pytorch. We use Pytorch's FSDP (78) implementation for data parallelism. We also use custom components from the TransformerEngine (79) library.

The training codebase utilizes Pytorch, specifically its FSDP implementation for data parallelism. Additionally, custom components from the TransformerEngine library are incorporated.

We have made several optimizations to increase the training speed of our models. For multi-head attention uses, we use the memory efficient implementation from the xformers library (80). We also save activations that are expensive to compute during training when necessary. We employ mixed precision training, utilizing FP8, BF16, and FP32 as needed based on accuracy requirements and kernel availability throughout our network.

We have implemented various techniques to enhance the training speed of our models. For multi-head attention, we utilize the memory-efficient implementation provided by the xformers library (80). Additionally, we cache activations that are computationally expensive to generate during training when required. We also use mixed precision training, which involves the use of FP8, BF16, and FP32 as necessary based on accuracy requirements and kernel availability throughout our network.

User:

Scaling ESM3 to 98 billion parameters with its novel architecture, multi-modal inputs, and low precision computation requirements poses significant training stability challenges. Our model is significantly deeper than its NLP counterparts, and literature has shown that deeper networks are harder to train due to attention collapse (81).

Scaling ESM3 to 98 billion parameters with its novel architecture, multi-modal inputs, and low precision computation requirements poses significant training stability challenges. Our model is significantly deeper than its NLP counterparts, and literature has shown that deeper networks are harder to train due to attention collapse (81).

This means that as we increase the number of parameters and the depth of the network, it becomes more difficult to train the model effectively. This is because the model may become unstable during training, leading to issues such as vanishing gradients, exploding gradients, and overfitting.

Additionally, the use of multi-modal inputs and low precision computation requirements can further complicate the training process. Multi-modal inputs require the model to process and integrate information from different modalities, which can be challenging. Low precision computation requirements can also introduce additional noise and errors into the training process, making it more difficult to achieve stable and accurate results.

Overall, scaling ESM3 to 98 billion parameters with its novel architecture, multi-modal inputs, and low precision computation requirements requires careful consideration of training stability challenges and the use of appropriate techniques to address these challenges.

We observed training instability early in the architectural innovation phase, which we addressed through several changes. We apply layer normalization to the query and key vectors within the attention mechanism (82). We observe longer warm up helps (83). Another source of instability is the masking rate in pre-training tasks. We found that a very high masking rate is more likely to cause training divergences than a lower one, especially early in the training. Choosing a masking schedule biased towards lower mask rates improved both performance and training stability. Interestingly, the introduction of conditioning from other modalities also improves training stability, perhaps suggesting that stability is related to the degree of underspecification of a task.

During the architectural innovation phase, we encountered training instability, which we resolved through various modifications. One of the changes we made was to apply layer normalization to the query and key vectors within the attention mechanism. We also found that a longer warm-up period was beneficial. Additionally, we discovered that the masking rate in pre-training tasks was a source of instability. We found that a high masking rate was more likely to cause training divergences, especially early in the training. Therefore, we chose a masking schedule that favored lower mask rates, which improved both performance and training stability. Interestingly, we found that introducing conditioning from other modalities also improved training stability, which may suggest that stability is related to the degree of underspecification of a task.

User:

An incorrectly set learning rate is another source of instability. To ensure the right balance between learning effectiveness and stability, we optimized the learning rate on smaller models and scaled it according to best practices as outlined in $(84,85)$. We find empirically that the initialization has a small effect on model stability, and the majority of stabilization can be gained from simply scaling the learning rate at the appropriate rate. By applying the rules in both width $-\mu \mathrm{P}$ and depth $-\mu \mathrm{P}$, we can simply scale the learning rate inversely proportional to the square root of the number of parameters, and find this results in stable training.

The learning rate is a crucial parameter in training machine learning models. If it is set incorrectly, it can lead to instability in the training process. To ensure that the learning rate is set correctly, we optimized it on smaller models and scaled it according to best practices. We found that the initialization of the model has a small effect on stability, and that most of the stabilization can be achieved by scaling the learning rate appropriately. By applying the rules in both width and depth, we can scale the learning rate inversely proportional to the square root of the number of parameters, which results in stable training.

User:

Following these modifications, we successfully trained our 98-billion-parameter model without any issues related to training instability.

The statement suggests that the team was able to successfully train a model with 98 billion parameters without encountering any problems related to instability during the training process. This is likely due to the modifications made to the model or the training process, which helped to ensure stability and prevent issues such as overfitting or underfitting. The statement is likely intended for an expert in the field of machine learning or artificial intelligence, who would be familiar with the technical details of training large-scale models and the challenges that can arise during the process.

We stage training to alter dataset composition, train on longer contexts that would be too expensive for the entire pre-training, or introduce features such as the taxonomy track (A.1.9.2.

I'm sorry, but the given text is incomplete and lacks context. Can you please provide more information or clarify your request? User: