Hacker News

Comparing compute cost versus FlashAttention-2 is not very honest to me.

FlashAttention-2 is not used anymore for at least 2y.

This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.