Thank you! The images are created via Flux (Schnell, due to cost). An LLM (currently Gemini Flash) creates the prompt, Flux generates the images, a rembg lambda that I open-sourced trims the background out, and then a vision-based LLM (also currently Gemini Flash) grades the resulting output for prompt adherence, background removal artifacts, etc. A lot get thrown away but the cost is so low that even a 25-50% success rate is adequate.

Background removal lambda if you want to check that out: https://github.com/joshdickson/rembg-lambda