Tag Archives: attention

Attention Is All You Need – A Moral Case

It turns out that giving neural networks attention gives you some pretty amazing results. The attention mechanism allowed neural language models to ingest vast amounts of data in a highly parallelised manner, efficiently learning what to pay the most attention to in a contextually aware manner. This computational breakthrough launched the LLM-powered AI revolution we’re living through. But what if attention isn’t just a computational trick? What if the same principle that allows transformers to focus on what matters from a sea of information also lies at the heart of consciousness, perception, and even morality itself? (Ok, maybe this is a bit of a stretch, but hear me out.)

To understand the connection, we need to look at how perception really works. Modern neuroscience reveals that experience is fundamentally subjective and generative. We’re not passive receivers of objective reality through our senses, we’re active constructors of our own experience. According to predictive processing theory, our minds constantly generate models of reality, and our sensory input is then used to provide an ‘error’ of these predictions. But the extraordinary point here is that we never ‘see’ these sensory inputs, only our mind’s best guess of how the world should be, updated by sensory feedback. As consciousness researcher Anil Seth puts it “Reality is a controlled hallucination… an action-oriented construction, rather than passive registration of an objective external reality”, or in the words of Anaïs Nin, half a century earlier, “We do not see things as they are, we see things as we are.”

Continue reading

3 approaches to linear-memory Transformers

Transformers are a very popular architecture for processing sequential data, notably text and (our interest) proteins. Transformers learn more complex patterns with larger models on more data, as demonstrated by models like GPT-4 and ESM-2. Transformers work by updating tokens according to an attention value computed as a weighted sum of all other tokens. In standard implentations this requires computing the product of a query and key matrix which requires O(N2d) computations and, problematically, O(N2) memory for a sequence of length N and an embedding size of d. To speed up Transformers, and to analyze longer sequences, several variants have been proposed which require only O(N) memory. Broadly, these can be divided into sparse methods, softmax-approximators, and memory-efficient Transformers.

Continue reading