Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.
> There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.
iPhones have models for text extraction and in-painting in the Photos App.
Both don’t have knobs to tune them, but, I think, they are decent for their intended audience (definitely not flawless, but I don’t think that exists anywhere, even if dropping the ‘local’ requirement)
I have the feeling that the cloud based providers are just using the freely available segmentation models. It's just speculation, but it doesn't seem to be top priority for them, so they'd just bolt on anything that works.
A problem is also that the cloud solutions need a complex UI to surface segmentation to the user. But the point you have there is that those models are probably not prime time ready yet, surfacing them would actually reveal they are not as powerful as the user expects. Destroying the illusion that AI can just do anything at will.
You can do all of this locally on a cheap video card. Search for fooocus or automatic1111 for a couple of setups that are fairly low friction to get going. Amuse AI is another one. It's not quite state of the art and also censored, but it's by far the least friction (especially if you have an AMD card) - it's pretty much plug and play. ComfyUI is the advanced do-everything workhorse. However, it's anything but comfy if you don't already have a lot of knowledge about this domain. I'd generally recommend fooocus for a balance between usability and power/flexibility.
The million image gen services online are mostly just making bank off ignorance. People don't realize that their own cheap video cards are more than enough to do everything they're paying a service an orders of magnitude markup for.
The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.
I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.
(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)
Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.
> There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.
iPhones have models for text extraction and in-painting in the Photos App.
Both don’t have knobs to tune them, but, I think, they are decent for their intended audience (definitely not flawless, but I don’t think that exists anywhere, even if dropping the ‘local’ requirement)
For scene segmentation, iOS has models for detecting persons (https://developer.apple.com/documentation/Vision/segmenting-...).
It also has models for detecting faces, face features, body and hand poses, or for picking the ‘best’ selfie from a set.
(And dust removal is fairly niche compared to these, I think. Or do I overlook some common use case for it that many people want?)
I have the feeling that the cloud based providers are just using the freely available segmentation models. It's just speculation, but it doesn't seem to be top priority for them, so they'd just bolt on anything that works.
A problem is also that the cloud solutions need a complex UI to surface segmentation to the user. But the point you have there is that those models are probably not prime time ready yet, surfacing them would actually reveal they are not as powerful as the user expects. Destroying the illusion that AI can just do anything at will.
You can do all of this locally on a cheap video card. Search for fooocus or automatic1111 for a couple of setups that are fairly low friction to get going. Amuse AI is another one. It's not quite state of the art and also censored, but it's by far the least friction (especially if you have an AMD card) - it's pretty much plug and play. ComfyUI is the advanced do-everything workhorse. However, it's anything but comfy if you don't already have a lot of knowledge about this domain. I'd generally recommend fooocus for a balance between usability and power/flexibility.
The million image gen services online are mostly just making bank off ignorance. People don't realize that their own cheap video cards are more than enough to do everything they're paying a service an orders of magnitude markup for.
The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.
PaddleOCR? Qwen3-VL 30B-A3B?
how many times have you edited a photo you took on your phone in the last 7 days?
I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.
Good on you. I've laughed at many dumbass gags but I've only been a passive consumer of them.
Become the dumbass change you want to see in the world
I'm not nearly creative enough.
Some smartphones have a feature that detects if you're taking a picture of a menu/letter/etc and will automatically crop and unskew it for you.
Half a dozen at least.
(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)
Personally, about 9 times. Would be higher if it was even easier and cheaper