Hacker News

On the MoE versions of these models the MTP versions have only marginal benefit. In my trials the speed-up is <20% (not the ~2x that happens with some other setup/models) and usually more like 10%. Ie. something like 13 -> 15 token/s... on my device.

I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.

In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).

YMMV for sure.