DFlash immediately came to my mind.

There are several Mac implementations of it that show > 2x faster Qwen3.5 already.