Very interesting discrepancy in the attached example:

- "removing the kettlebell" led to removing the visual representation of the kettlebell as well the deformation it makes on the pillow

- "removing the hands" removed the childs hands from the tops, but did not then lead to the tops falling over!

Others like the colliding cars are in some weird gray area between the two.

One should note as these tools proliferate, there is a lot of artistic expression that we are giving up to these imprecise natural language parsing engines.

I think we'll regain the artistic expression/control in the future if the tools become popular.

We saw this with StableDiffusion. First it was just text-to-image. Now we have controlnets which can control the precise placement of objects. We've got depth maps, body positioning, all kinds of things. I can even sketch on my tablet and tunr that into an image.

I'm sure we can add options for what this physically affects or doesn't affect.