doi.bio/esm3/esm3.a1.out

The forward pass of ESM3 is then defined as

$$ \math Fi = \mathrm{Embed} \left(F{i-1} \odot M{i-1}\right) + \mathrm{MLP} \left(F{i-1} \odot M_{i-1}\right) $$

$$ \math Mi = \sigma \left(Wm F{i-1} + bm\right) $$

where $F0$ is the input feature vector, $M0$ is the initial mask, $\odot$ is elementwise multiplication, $\sigma$ is the sigmoid function, and $Wm$ and $bm$ are learned parameters. The mask is initialized to all ones, and the embedding and MLP weights are shared across all tracks.

  1. The input to the structure encoder is a set of coordinates $x{C{\alpha}} \in \mathbb{R}^{L \times 3}$ and a set of transformations $T \in S E(3)^{L}$.
  2. The first step is to find the nearest neighbors of each residue using a k-nearest neighbor algorithm, which yields a set of indices $N{\mathrm{idx}}=\operatorname{knn}\left(x{C_{\alpha}}\right) \quad \triangleright{0 . . L-1}^{L \times 16}$.
  3. The next step is to extract the coordinates of the nearest neighbors from the input set of coordinates, which yields a set of coordinates $T{\mathrm{knn}}=T\left[N{\mathrm{idx}}\right] \quad \triangleright S E(3)^{L \times 16}$.
  4. The next step is to calculate the displacement of each residue from its nearest neighbor, which yields a set of displacements $\Delta i=\operatorname{clamp}\left(N_{\mathrm{idx}}-i,-32,32\right)$.
  5. The displacements are then embedded into a higher-dimensional space using a neural network, which yields a set of embeddings $N=\operatorname{embed}(\Delta i)$.
  6. The embeddings are then passed through a shallow encoder consisting of 2 Transformer blocks, with regular multihead self-attention swapped with geometric_mha. The attention is unmasked, all-to-all over the entire neighborhood.
  7. The first element of the encoded neighborhood is extracted, which corresponds to the residue itself, and is projected linearly and quantized by replacing it with the nearest vector in a codebook, which yields the structure token per residue.
  8. The codebook is learned as an exponential moving average of encoder outputs, and unused codes are re-initialized to encoder outputs to improve codebook utilization.
  9. To improve training and inference efficiency, all local structure graphs within a protein are encoded in parallel. User:
  1. Apply Frames: The algorithm applies frames to the coordinates in a reference frame to rotate and transform each residue into their final positions.
  2. Structure Decode: The algorithm decodes the structure of the protein using a VQ-VAE model.
  3. Training: The VQ-VAE model is trained in two stages, with a small decoder in the first stage and a larger decoder in the second stage.
  4. Autoencoder: The VQ-VAE model is an autoencoder that learns discrete representations to maximize reconstruction quality.
  5. Encoder and Codebook: The encoder and codebook are learned in the first stage of training.
  6. Decoder: A larger or more computationally expensive decoder is trained in the second stage of training.
  7. Two-Stage Training: The two-stage training approach is commonly used for VQ-VAE models to learn discrete representations that maximize reconstruction quality. User:
3. Sidechain Direction Loss: Compute three vectors for both predicted and ground truth coordinates for each residue:
(a) $C_{\alpha} \rightarrow C$

(b) $C \rightarrow N_{\text {next }}$

(c) $\mathbf{n}_{C}=\left(C_{\text {prev }} \rightarrow N\right) \times\left(N \rightarrow C_{\alpha}\right)$

Compute the pairwise dot product, forming $D_{\text {pred }}, D \in$ $\mathbb{R}^{3 L \times 3 L}$. Compute $\left(D_{\text {pred }}-D\right)^{2}$, clamp the maximum error to 20 , and take the mean.

In algorithm form (with compute_vectors computing the three vectors described above):

Algorithm 12 sidechaindirectionloss Input: $\hat{X} \in \mathbb{R}^{L \times 3 \times 3}, X \in \mathbb{R}^{L \times 3 \times 3}$ $\hat{V}=$ computevectors $(\hat{X}) \quad \triangleright \mathbb{R}^{3 L \times 3}$ $V=$ computevectors $(X) \quad \triangle \mathbb{R}^{3 L \times 3}$ $\left[D{\text {pred }}\right]{i, j}=[\hat{V}]{i,:} \cdot[\hat{V}]{j,:} \quad \triangleright \mathbb{R}^{3 L \times 3 L}$ $[D]{i, j}=[V]{i,:} \cdot[V]{j,:} \quad \triangleright \mathbb{R}^{3 L \times 3 L}$ $E=\left(D{\text {pred }}-D\right)^{2}$ $E=\min (E, 20)$ $l=\operatorname{mean}_{i, j}(E)$ $\triangle \mathbb{R}$ return $l$

4. Ramachandran Plot Loss: Compute the Ramachandran plot for each residue, and compute the dot product between the predicted and ground truth Ramachandran plots.

In algorithm form:

Algorithm 13 ramachandranplotloss Input: $\hat{X} \in \mathbb{R}^{L \times 3 \times 3}, X \in \mathbb{R}^{L \times 3 \times 3}$ $\hat{V}=$ computevectors $(\hat{X}) \quad \triangleright \mathbb{R}^{6 L \times 3}$ $V=$ computevectors $(X) \quad \triangle \mathbb{R}^{6 L \times 3}$ $\left[D{\text {pred }}\right]{i, j}=[\hat{V}]{i,:} \cdot[\hat{V}]{j,:} \quad \triangleright \mathbb{R}^{6 L \times 6 L}$ $[D]{i, j}=[V]{i,:} \cdot[V]{j,:} \quad \triangleright \mathbb{R}^{6 L \times 6 L}$ $E=\left(D{\text {pred }}-D\right)^{2}$ $E=\min (E, 20)$ $l=\operatorname{mean}_{i, j}(E)$ $\triangle \mathbb{R}$ return $l$

5. Secondary Structure Loss: Compute the secondary structure for each residue, and compute the cross entropy loss between the predicted and ground truth secondary structure.

In algorithm form:

Algorithm 14 secondarystructureloss Input: $\hat{X} \in \mathbb{R}^{L \times 3 \times 3}, X \in \mathbb{R}^{L \times 3 \times 3}$ $\hat{V}

  1. The process of computing unit vectors from ground truth coordinates involves computing three vectors per residue from $C{\alpha} \rightarrow C, C{\alpha} \rightarrow N$, and $\mathbf{n}{C{\alpha}}=\left(C{\alpha} \rightarrow C\right) \times\left(C{\alpha} \rightarrow N\right)$.

  2. The process of computing pairwise dot products between each pair of vectors for all residues involves forming a matrix $D \in[-1,1]^{L \times L \times 6}$ and binning the dot products into 16 evenly spaced bins in $[-1,1]$, forming classification labels $y \in{0 . .15}^{L \times L}$.

  3. The process of computing pairwise logits involves passing the final layer representations of the decoder $h \in \mathbb{R}^{L \times d}$ through a pairwiseprojhead to obtain logits $z \in$ $\mathbb{R}^{L \times L \times 6 \times 16}$.

  4. The process of calculating the Distogram Loss involves calculating $C{\beta}$ given the ground truth $N, C{\alpha}$, and $C$ coordinates, and computing a cross-entropy between these targets and pairwise logits. User:

  1. $z{0}=\operatorname{argmax}{z} \pi(z ; f_{\text {sequence }}(x))$
  2. $x{0}=z{0}$
  3. $p{0}=T{0}$
  4. $\text { for } i=1 \text { to } n_{\text {decode }}-1$
  5. $\quad$ $z{i}=\operatorname{argmax}{z} \pi(z ; f{\text {sequence }}(x \mid x{i-1}))$
  6. $\quad$ $x{i}=z{i}$
  7. $\quad$ $p{i}=T{i}$
  8. $\text { return } x{n{\text {decode }}}$ ```

In the above algorithm, $x{i}$ is the $i$ th token generated, $z{i}$ is the logit vector of the $i$ th token, and $p{i}$ is the temperature used to generate the $i$ th token. The prompt $x$ is updated to $x{i-1}$ for the next token generation.

In the case of entropy decoding, $T{i}$ is set to $1 / \operatorname{logitmax}(f{\text {sequence }}(x \mid x{i-1}))$, where $\operatorname{logitmax}(\cdot)$ is the maximum logit of the input. In the case of max logit decoding, $T{i}$ is set to $1 / \max \pi(z ; f{\text {sequence }}(x \mid x{i-1}))$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding. In the case of $n{\text {decode }}=1$, the above algorithm is equivalent to iterative decoding with $T=1$.

In the case of $n_{\text {decode }}=L$, the above algorithm is equivalent to argmax decoding.










sness@sness.net