I also have an M1 Max 64GB: Qwen 3.6 benefits from MTP (after rounds of parameter optimization). MLX was unstable (haven't tried it recently), faster at TG but slower at PP, so inconclusive.
I also have an M1 Max 64GB: Qwen 3.6 benefits from MTP (after rounds of parameter optimization). MLX was unstable (haven't tried it recently), faster at TG but slower at PP, so inconclusive.
Yeah. I have not really tinkered much with parameter optimisation for the 35B model with MTP. Would be interested to see what you've found.
I'm using the GGUF too; it appears slightly faster in llama.cpp now than current LM Studio but it's not clear to me if that is down to LM Studio having a little more code overhead, older llama.cpp under the hood, or just parameter differences.
[dead]