It's been a lot of trial & error. A quick aside: running these tests/evals/call them what you will at scale has been fascinating to me. Going back and trawling through the logs has been like speed-running through hundreds of usability tests with people, full of the same types of "aha! Of course you'd try and do that, why didn't I think of that already?" moments of insight and inspiration.

Which is also how we've gone about working out how to improve the CLI. It's usually one or more of:

* rethinking the subcommands and hierarchy to something more obvious and aligned to the task

* providing clear documentation upfront (i.e, in the skills file)

* keeping help text concise, but not too concise. You can't assume the reader is already a power user and it's simply looking for a reminder/reference. So include usage examples for common use cases

* where possible on errors, suggest the likely commands the person meant.

* In general offer affordances on what likely next steps will be. This goes for help output, success, and errors.

> cli help text is usually massive

That doesn't have to be true.

> could eat a lot of the savings on retries

This doesn't have to be true either. You don't need to give the same full help output on every single error, once they've got it once they've got it. Also the size of the entire help output for most CLIs is generally insignificant compared to even just a couple of source files in most repos.