> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
> I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.
I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.
I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.
This finally hit the "wow, this is useable" level of both speed and intelligence.
I wasn't familiar with Unclothe, so I had to look it up..
Are you sure you did not mean Unsloth?
They likely did, and this autocorrect slip might suggest why OP is using local models :)
Indeed, a clear Freudian slip. The one where you say one thing, but you mean your mother.
Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).
llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.
At 128 tokens, you’re benchmarking the overture, not the opera.
I thought the same thing when I started using locals, but the reality is that - for a given context depth - the token generation speed doesn't change whether it's 128 or 8000, it just lengthens the benchmark run time.
This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.