For high Ram (unified), and relatively middling to lowish Tflops and bandwidth GB/s, usually MoEs are most hopeful. The current top-1 in the (iq, tok/s, @ context depth) ranks for me (M2 Max, 96gb) is DeepSeek-V4-Flash REAP25 <65gb gguf + ds4-server + pi agent. Not better than cloud API ofc, but useful enough to endure if I need to. E.g on a non-Internet 4h flight the battery (local llm draws 60w) held long enough. REAP supporting ds4 branch here
https://github.com/ljubomirj/ds4/tree/reap-compact-support
DS4F dropping to unusable <10 tok/s only at 784K context (!!) makes a big difference.