Comparing compute cost versus FlashAttention-2 is not very honest to me.
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
Comparing compute cost versus FlashAttention-2 is not very honest to me.
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.