The issue is not the models, the issue is that this method ws tried before, and humans suck at writing what they want. Developing in small increments allowing feedback was an answer to this issue.

If you made models able to code to long spec, you would be left with the hard issue of having to write them.

An interesting question for me is "can the LLMs predict what humans want?".

Like if you show the LLM a page, can the LLM review the page and then spit out a review that is close to what a human would say about the page?

Yes, my current nightmare is I have a very long queue of specs to write and need to work with non technical staff to help them put in words what it is they actually want.

Software was always that way, though.