oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.

Agreed (not sure what you mean by UI-based hosting).

oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.

Whay about of the tons of caches that just pile up until you notice that you must delete them manually?