The example given in the article is acceptance criteria for a login/password entry flow. This is fairly easy to spec-out in terms of AC and TDD.

I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.

I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.

Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."

When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"

I insisted that it change the culling configuration and then it worked fine.

These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.