That's why you give them the ability to actually execute the code in a sandbox. Then it's not AI checking AI, you're mixing something deterministic into the loop.

That may certainly increase the agent's ability to get it right, but there will always be cases where the code it generates mimics the correct response, i.e. produces the output asked for, without actually working as intended, as LLMs tend to want to please as much as be correct.

However I think it would remove the case of the bit outright making up non-existent stuff. It could still always be just plain wrong, but in a more human sort of way. A real support person may be wrong about some precise detail of what they’re recommending, but unlikely to just make up something plausible.

Not much harm done. The end user sees the response and either spots that it's broken or finds out it's broken when they try to run it.

They take a screenshot and make fun of the rubbish bot on social media.

If that happens rarely it's still a worthwhile improvement over today. If it happens frequently then the documentation bot is junk and should be retired.

youre hand wavibng all the other million use cases where returning false information isnt OK.

the return may still not reflect the sandbox reality.