Currently am testing this out (using the Replicate endpoint: https://replicate.com/black-forest-labs/flux-kontext-pro). Replicate also hosts "apps" with examples using FLUX Kontext for some common use cases of image editing: https://replicate.com/flux-kontext-apps

It's pretty good: quality of the generated images is similar to that of GPT-4o image generation if you were using it for simple image-to-image generations. Generation is speedy at about ~4 seconds per generation.

Prompt engineering outside of the examples used on this page is a little fussy and I suspect will evolve over time. Changing styles or specific aspects does indeed work, but the more specific you get, the more it tends to ignore the specifics.

It seems more accurate than 4o image generation in terms of preserving original details. If I give it my 3D animal character and ask it for a minor change like changing the lighting, 4o will completely mangle the face of my character, it will change the body and other details slightly. This Flux model keeps the visible geometry almost perfectly the same even when asked to significantly change the pose or lighting

anything is more accurate than the llms at generating images. chatgpt, google gemini, all of them... they're not optimized for image generation. it's why veo is an entirely different model from google for example. and even veo isn't the best video model either. people dedicated to images and video are just spending more time here (such as black forest labs). as a result, those specialized models are better.

gpt-image-1 (aka "4o") is still the most useful general purpose image model, but damn does this come close.

I'm deep in this space and feel really good about FLUX.1 Kontext. It fills a much-needed gap, and it makes sure that OpenAI / Google aren't the runaway victors of images and video.

Prior to gpt-image-1, the biggest problems in images were:

  - prompt adherence
  - generation quality
  - instructiveness (eg. "put the sign above the second door")
  - consistency of styles, characters, settings, etc. 
  - deliberate and exact intentional posing of characters and set pieces
  - compositing different images or layers together
  - relighting
Fine tunes, LoRAs, and IPAdapters fixed a lot of this, but they were a real pain in the ass. ControlNets solved for pose, but it was still awkward and ugly. ComfyUI was an orchestrator of this layer of hacks that kind of got the job done, but it was hacky and unmaintainable glue. It always felt like a fly-by-night solution.

OpenAI's gpt-image-1 solved all of these things with a single multimodal model. You could throw out ComfyUI and all the other pre-AI garbage and work directly with the model itself. It was magic.

Unfortunately, gpt-image-1 is ridiculously slow, insanely expensive, highly censored (you can't use a lot of copyrighted characters or celebrities, and a lot of totally SFW prompts are blocked). It can't be fine tuned, so you're suck with the "ChatGPT style" and (as is called by the community) the "piss filter" (perpetually yellowish images).

And the biggest problem with gpt-image-1 is because it puts image and text tokens in the same space to manipulate, it can't retain the exact precise pixel-precise structure of reference images. Because of that, it cannot function as an inpainting/outpainting model whatsoever. You can't use it to edit existing images if the original image mattered.

Even with those flaws, gpt-image-1 was a million times better than Flux, ComfyUI, and all the other ball of wax hacks we've built up. Given the expense of training gpt-image-1, I was worried that nobody else would be able to afford to train the competition and that OpenAI would win the space forever. We'd be left with only hyperscalers of AI building these models. And it would suck if Google and OpenAI were the only providers of tools for artists.

Black Forest Labs just proved that wrong in a big way! While this model doesn't do everything as well as gpt-image-1, it's within the same order of magnitude. And it's ridiculously fast (10x faster) and cheap (10x cheaper).

Kontext isn't as instructive as gpt-image-1. You can't give it multiple pictures and ask it to copy characters from one image into the pose of another image. You can't have it follow complex compositing requests. But it's close, and that makes it immediately useful. It fills a much-needed gap in the space.

Black Forest Labs did the right thing by developing this instead of a video model. We need much more innovation in the image model space, and we need more gaps to be filled:

  - Fast
  - Truly multimodal like gpt-image-1
  - Instructive 
  - Posing built into the model. No ControlNet hacks. 
  - References built into the model. No IPAdapter, no required character/style LoRAs, etc. 
  - Ability to address objects, characters, mannequins, etc. for deletion / insertion. 
  - Ability to pull sources from across multiple images with or without "innovation" / change to their pixels.
  - Fine-tunable (so we can get higher quality and precision) 
 
Something like this that works in real time would literally change the game forever.

Please build it, Black Forest Labs.

All of those feature requests stated, Kontext is a great model. I'm going to be learning it over the next weeks.

Keep at it, BFL. Don't let OpenAI win. This model rocks.

Now let's hope Kling or Runway (or, better, someone who does open weights -- BFL!) develops a Veo 3 competitor.

I need my AI actors to "Meisner", and so far only Veo 3 comes close.

When I first saw gpt-image-1 I was equally scared that OpenAI had used its resources to push so far ahead that more open models would be left completely in the dust for the significant future.

Glad to see this release. It also puts more pressure onto OpenAI to make their model less lobotomized and to increase its output quality. This is good for everyone.

>Given the expense of training gpt-image-1, I was worried that nobody else would be able to afford to train the competition

OpenAI models are expensive to train because it’s beneficial for OpenAI models to be expensive and there is no incentive to optimize when they’re gonna run in a server farm anyway.

Probably a bunch of teams never bothered trying to replicate Dall-E 1+2 because the training run cost millions, yet SD1.5 showed us comparable tech can run on a home computer and be trained from scratch for thousands or fine tuned for cents.

Your comment is def why we come to HN :)

Thanks for the detailed info

Thought the SAME thing

this breakdown made my day thank you!

Im building a web based paint/image editor with ai inpainting etc

and this is going to be a great model to use price wise and capability wise

completely agree so happy its not any one of these big co’s controlling the whole space!

What are you building? Ping me if you want a tester of half-finished breaking stuff

Thanks for the detailed post!

Honestly love Replicate for always being up to date. It’s amazing that not only do we live in a time of rapid AI advancement, but that every new research grade model is immediately available via API and can be used in prod, at scale, no questions asked.

Something to be said about distributors like Replicate etc that are adding an exponent to the impact of these model releases

I have no affiliation with either company but from using both a bunch as a customer: Replicate has a competitor at https://fal.ai/models and FAL's generation speed is consistently faster across every model I've tried. They have some sub-100 ms image gen models, too.

Replicate has a much bigger model selection. But for every model that's on both, FAL is pretty much "Replicate but faster". I believe pricing is pretty similar.

Founder of Replicate here. We should be on par or faster for all the top models. e.g. we have the fastest FLUX[dev]: https://artificialanalysis.ai/text-to-image/model-family/flu...

If something's not as fast let me know and we can fix it. ben@replicate.com

Hey Ben, thanks for participating in this thread. And certainly also for all you and your team have built.

Totally frank and possibly awkward question, you don't have to answer: how do you feel about a16z investing in everyone in this space?

They invested in you.

They're investing in your direct competitors (Fal, et al.)

They're picking your downmarket and upmarket (Krea, et al.)

They're picking consumer (Viggle, et al.), which could lift away the value.

They're picking the foundation models you consume. (Black Forest Labs, Hedra, et al.)

They're even picking the actual consumers themselves. (Promise, et al.)

They're doing this at Series A and beyond.

Do you think they'll try to encourage dog-fooding or consolidation?

The reason I ask is because I'm building adjacent or at a tangent to some of this, and I wonder if a16z is "all full up" or competitive within the portfolio. (If you can answer in private, my email is [my username] at gmail, and I'd be incredibly grateful to hear your thoughts.)

Beyond that, how are you feeling? This is a whirlwind of a sector to be in. There's a new model every week it seems.

Kudos on keeping up the pace! Keep at it!

That feels like the VC equivalent of buying a market-specific fund, so fairly par for the course?

A16Z invested in both. It's wild. They've been absolutely flooding the GenAI market for images and videos with investments.

They'll have one of the victors, whoever it is. Maybe multiple.

That's less on the downstream distributors, more on the model developers themselves realizing that ease-of-accessibility of the models themselves on Day 1 is important for getting community traction. Locking the model exclusively behind their own API won't work anymore.

Llama 4 was another recent case where they explicitly worked with downstream distributors to get it working Day 1.

In my quick experimentation for image-to-image this feels even better than GPT-4o: 4o tends to heavily weight the colors towards sepia, to the point where it's a bit of an obvious tell that the image was 4o-generated (especially with repeated edits); FLUX.1 Kontext seems to use a much wider, more colorful palette. And FLUX, at least the Max version I'm playing around with on Replicate, nails small details that 4o can miss.

I haven't played around with from-scratch generation, so I'm not sure which is best if you're trying to generate an image just from a prompt. But in terms of image-to-image via a prompt, it feels like FLUX is noticeably better.

> Generation is speedy at about ~4 seconds per generation

May I ask on which GPU & VRAM?

edit: oh unless you just meant through huggingface's UI

The open weights variant is "coming soon" so the only option is hosted right now.

It is through Replicate's UI listed, which goes through Black Forest Labs's infra so would likely get the same results from their API.

[dead]