2 Comments

Really clean and effective explanation! I definitely had a very intuitively weak understanding of the attention mechanism until now.

Would be interesting to see why specifically causal attention seems to be the norm. From my initial understanding any tokens before the end of the input sequence could technically take into account future tokens. Intuitively it seems to make sense too - the context of any given word can be affected by a future word.

Maybe it's the performance impact of having to recalculate the attention? Even then it feels like it would be possible to do this intermittently at least or maybe exclude some transformer blocks. Pretty sure I'm missing something with that.

Expand full comment

Yep I think you've basically hit the nail on the head there - recalculating all of the query, key, and value vectors on each pass through the network is really computationally expensive if you plan on using the model in a chat context. With the causal attention setup, since future tokens can't impact the vector representations of past tokens, we only need to compute each of these q, k, v vectors once and can then cache them for future use.

Imagine we're generating a sequence with n tokens. For causal attention, since we only to compute the q,k,v vectors once for each token, we only have n attention operations to perform. In non-causal attention, the first token's q,k,v vectors will need to be recomputed for every token added to the sequence (that is, n times). The 2nd vector will need to be recomputed for every token added after it (that is, n-1 times). This gives us n + (n-1) + ... + 1 = (n^2+n)/2 attention operations.

At the scale that companies like OpenAI and Anthropic are offering inference services, going from O(n) to O(n^2) attention operations would be a massive hit to their bottom line.

For users who don't care as much about inference speed (e.g. a user hosting their own local model), I definitely would be curious to see how causal attention compares to non-causal attention. Definitely possible that it could provide better results!

Expand full comment