Discussion about this post

User's avatar
Alex's avatar

Really clean and effective explanation! I definitely had a very intuitively weak understanding of the attention mechanism until now.

Would be interesting to see why specifically causal attention seems to be the norm. From my initial understanding any tokens before the end of the input sequence could technically take into account future tokens. Intuitively it seems to make sense too - the context of any given word can be affected by a future word.

Maybe it's the performance impact of having to recalculate the attention? Even then it feels like it would be possible to do this intermittently at least or maybe exclude some transformer blocks. Pretty sure I'm missing something with that.

Expand full comment
1 more comment...

No posts