Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.

Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.

> There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.

iPhones have models for text extraction and in-painting in the Photos App.

Both don’t have knobs to tune them, but, I think, they are decent for their intended audience (definitely not flawless, but I don’t think that exists anywhere, even if dropping the ‘local’ requirement)

For scene segmentation, iOS has models for detecting persons (https://developer.apple.com/documentation/Vision/segmenting-...).

It also has models for detecting faces, face features, body and hand poses, or for picking the ‘best’ selfie from a set.

(And dust removal is fairly niche compared to these, I think. Or do I overlook some common use case for it that many people want?)

I have the feeling that the cloud based providers are just using the freely available segmentation models. It's just speculation, but it doesn't seem to be top priority for them, so they'd just bolt on anything that works.

A problem is also that the cloud solutions need a complex UI to surface segmentation to the user. But the point you have there is that those models are probably not prime time ready yet, surfacing them would actually reveal they are not as powerful as the user expects. Destroying the illusion that AI can just do anything at will.

You can do all of this locally on a cheap video card. Search for fooocus or automatic1111 for a couple of setups that are fairly low friction to get going. Amuse AI is another one. It's not quite state of the art and also censored, but it's by far the least friction (especially if you have an AMD card) - it's pretty much plug and play. ComfyUI is the advanced do-everything workhorse. However, it's anything but comfy if you don't already have a lot of knowledge about this domain. I'd generally recommend fooocus for a balance between usability and power/flexibility.

The million image gen services online are mostly just making bank off ignorance. People don't realize that their own cheap video cards are more than enough to do everything they're paying a service an orders of magnitude markup for.

The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.

PaddleOCR? Qwen3-VL 30B-A3B?