
Researchers from Apple have released SimpleFold, a protein structure prediction model which uses exclusively standard Transformer layers. The results seem to show that SimpleFold is a little less accurate than methods such as AlphaFold2, but much faster and easier to integrate into standard LLM-like workflows. SimpleFold also shows very good scaling performance, in line with other Transformer models like ESM2. So what is powering this seemingly simple development?
What is simple about SimpleFold?
Compared to AlphaFold2, SimpleFold basically drops two complex architectural pieces. The first, as done by AlphaFold3 (and described in a previous blog post), is to replace Invariant Point Attention (IPA) blocks with standard attention blocks. This means that the structure module is not guaranteed to be SE(3)-invariant, and all relevant symmetries must be learned by the Transformer (more on that later).
The second, perhaps more radical simplification is that SimpleFold drops all pairwise processing. This means that all comparisons between tokens must be done using only attention. That is, all edges in the graph must be computable through inner products of the nodes and aggregated using softmax and AdaNorm. Architecturally, this is a big concession: in the ablations of ESMFold, the authors found that removing the pairwise processing was even more detrimental than using a trivial structure module. However, removing the pairwise processing comes with a huge benefit since standard attention can be computed extremely efficiently.
The illusion of simplicity
SimpleFold is the latest in a series of models which use standard Transformers for structural protein tasks. Other models include AlphaFold3, Proteina, and, arguably, ESM2. But if Transformers are so good at learning the “symmetries in the underlying data generation process”, why have we spent so much effort designing custom models like SE(3)-GNNs and IPA? The answer, as discussed in our recent paper, is that learning 3D with standard attention is actually very complex.
To begin, consider the the problem of 1D positional encoding. Since Transformers do not typically contain a notion of token order, tokens must be imbued with a way to distinguish nearby tokens within attention using only inner products. The clever solution proposed in the original Transformers paper is to use sinusoidal functions of the absolute position. There are different ways to understand how this works but one useful one is to notice that the inner products of these nodes look like cos(pi-pj) which, when appropriately rescaled, can be approximated by 1-(pi-pj)2/2. These activations are then fed into a softmax which contains an exponent, so the attention paid can be approximately a Gaussian function of the relative position between tokens. Usefully, linear maps of these embeddings can be used to generate specific relative offsets which allows tokens to attend to specific relative positions rather than just nearby tokens. The same logic can be used to understand Rotary Positional Encoding (RoPE).
When we extend to 3D, we need to worry about invariance to rotational symmetries. For instance, if you try to use standard positional encoding on the x, y, and z coordinates then you can plot the minimum and maximum attention paid to different tokens at the same distance away. As the distance increases, we find that these are not the same which can be hugely problematic since protein structure models need reliable ways of determining which amino acids are nearby.

However, if we zoom into just the small distances, the rotational invariance seems to be pretty good. This is because the small angle cosine approximation holds better which gives us a sum of quadratic functions of the relative distance in each coordinate which is itself just a quadratic function of the relative distance in Euclidean space.

So in order for Transformers to reason in 3D they need to learn (on their own!) an analog of positional encoding along 3 spatial dimensions which is highly restricted by SE(3)-invariance and which must be relatively insensitive to layer normalization. If you are interested in how this happens in the wild, check out our paper, “Transformers trained on proteins can learn to attend to Euclidean distance“.
Structural reasoning is hard
As we have seen, it is already very complicated for Transformers to learn an SE(3)-invariant distance measurement in attention. But models like AlphaFold2 explicitly model protein frames which allow them to attend using virtual atoms positioned along the sidechain of each amino acid. This capability, which stems from SE(3)-equivariance is likely even harder for Transformers to natively represent since frame operations are not always linear. Additionally, functions of relative distance using, for instance radial basis encodings are extremely difficult for Transformers to learn. In most protein GNNs these are encoded in the edges, and in AlphaFold2/3 they are likely handled in the pair representation, which is dropped in SimpleFold.
So is attention all you need? My best guess is it depends on resolution. Long-range contacts which can be inferred from evolutionary information or coarse-grained physics can likely be learned using a mix of linear and 3D positional reasoning. Finer details which require equivariant modelling or particular positional offsets will probably still require custom architectures. Combining the two approaches to leverage the scalability of Transformers with the expressiblity of GNNs is an exciting area of research.
