Tag Archives: attention

3 approaches to linear-memory Transformers

Transformers are a very popular architecture for processing sequential data, notably text and (our interest) proteins. Transformers learn more complex patterns with larger models on more data, as demonstrated by models like GPT-4 and ESM-2. Transformers work by updating tokens according to an attention value computed as a weighted sum of all other tokens. In standard implentations this requires computing the product of a query and key matrix which requires O(N2d) computations and, problematically, O(N2) memory for a sequence of length N and an embedding size of d. To speed up Transformers, and to analyze longer sequences, several variants have been proposed which require only O(N) memory. Broadly, these can be divided into sparse methods, softmax-approximators, and memory-efficient Transformers.

Continue reading