Hacker News

egorfine 3 hours ago [ - ]

I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.

regexorcist an hour ago [ - ]

Curious if you tested llama.cpp and still found oMLX faster? I haven't tried the latter myself, might give it a go.

egorfine an hour ago [ - ]

Oh yeah I did test various solutions and different settings and quants

Llama is about 1/3 slower on Apple Silicon.