I’ve recently created many Claude skills to do repeatable tasks (architecture review, performance, magic strings, privacy, SOLID review, documentation review etc). The pattern is: when I’ve prompted it into the right state and it’s done what I want, I ask it to create a skill. I get codex to check the skill. I could then run it independently in another window etc and feed back to adjust…but you get the idea.

And almost every time it screws up we create a test, and often for the whole class of problem. More recent it’s been far better behaved. Between Opus, skills, docs, generating Mermaid diagrams, tests it’s been a lot better. I’ve also cleaned up so much of the architecture so there’s only one way to do things. This keeps it more aligned and helps with entropy. And they’ll work better as models improve. Having a match between code, documents and tests means it’s not just relying on one source.

Prompts like this seem to work: “what’s the ideal way to do this? Don’t be pragmatic. Tokens are cheaper than me hunting bugs down years later”

Can you tell me more about how you do tests? How do they look like? What testing tools or frameworks do you use?