Genuine question: is this solving a real problem?

IME, the bottleneck when using diffusion models isn't storage space or memory, it's generation time. Lots of models will run on 8-12 GB 1080-generation GPUs onwards, or on Macs with similar memory, which are probably the bottom end from a GPU power perspective anyway. I also note that these models are marginally slower than the small FLUX.2 model they're based on.

Okay, maybe this allows running a local model on something that has a reasonably powerful GPU and limited memory, like an iPhone, but is that really a common requirement?

We are in an era of extreme demand for GPU and limited supply. Every inference we push to the edge frees cloud resources for other tasks. Every efficiency gain increases what we can achieve with existing resources. If images can be rendered with half as much compute, we need half as many GPUs.

It's useful progress. Decent-fidelity local-scale inference means that you can create a product that generates throwaway images frequently without worrying about cost. Thus far every product I've seen that generates images is metered, which severely limits the value. I don't know if this is actually at the "decent fidelity" point yet.

I think the value of it is currently more academic than useful in the real world. Everything at the frontier is still only marginally Good Enough (in image generation, most of it is shit even from the best models), so things far behind the frontier in terms of capability (as a tiny 1-bit model necessarily must be) are unusable.

But, getting remarkably higher density of capability per unit of compute is a big thing. It means the frontier can get better and cheaper to operate and less resource hungry, and it means what can be accomplished at the edge, on personal laptops or phones, becomes a broader spectrum of tasks.

And, for privacy, there are a lot of things that should run on-device and not everyone has big dedicated GPUs.

It’s like asking how did Memoji generation on iPhone solved a real problem?

It does not need to directly solve any particular problem to be overall good for consumers, by putting pressure to all those subscription based solutions… at least it’s private and does not require you to provide all your data…

Genuine question: doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?

Where are you getting the 1 Gigabyte number from?

Their 1-bit quantized Diffusion Transformer is just under 1 GB. You also need the text-encoder (4-bit quantized) and VAE (unquantized) for inference and their combined weight is ~3.42 GB.

TBF, even at that size it's no less mind blowing.

Yeah, it's pretty incredible. And I guess that's mostly what's behind the question: whether this is more of an impressive research/technique demonstrator, or a real product advancement solving a need.

[dead]

> doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?

I can make this into a 5-lines Python program. I’m not saying the images will match the description, but that isn’t part of your spec ;)

For free users, I guess local generation is going to be faster than waiting in a queue.

ideally if ternary models work, the math is extremely easy for computers (addition/subtraction vs 16 bit multiplication)

Not quite as I understand it. The ternary approach bonsai uses leverages a FP16 scaling factor that each value in the ternary maps to. You're still using 16 bit multiplication, it's just that the weights are far more compressed.

fair, i think i was referring more to 1.58 bit architecture in general since the original paper (Figure 3) shows that we eliminate FP16 multiplication and addition just for INT8 addition. I need to dive deeper into bonsai overall if it differs

https://arxiv.org/pdf/2402.17764

Yes its a huge deal because these are starting to get bound by memory bandwidth not compute. therefore one bit wirfhts stream way faster leading to substantially better results. At least thats what Id guess!