>You can usually tell when the code isn't right because it doesn't work or doesn't pass a test

Tests (as usually written, in unit-test form) only tell you that it's not completely broken, they're not a good indicator of it working well otherwise "vibecoded slop" wouldn't be a thing. And the tests themselves are usually vibecoded too which doesn't help much in detecting issues off the happy path.

>you verify that your AI CEO is giving you the right information or planning its business strategy effectively

The same could be said for human CEOs. A lot of them don't really have good success rates either.

> Tests (as usually written, in unit-test form) only tell you that it's not completely broken, they're not a good indicator of it working well otherwise "vibecoded slop" wouldn't be a thing

You can certainly end up with vibecoded slop that passes all the tests, but it won't pass other forms of evaluation (necessarily true, otherwise you could not identify it as vibecoded slop.)

> The same could be said for human CEOs. A lot of them don't really have good success rates either.

This is part of my point. The tight feedback loop that enables us to judge a model's efficacy in software, doesn't exist for the role of CEO.