I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)
I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)
Does your inference framework target the NPU or just GPU/CPU?
It's linking llama.cpp and using Metal, so I presume GPU/CPU only.
I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)