this is the core problem rn with developing anything that uses an LLM. It’s hard to evaluate how well it works and nearly impossible to evaluate how well it generalizes unless the input is constrained so tightly that you might as well not use the LLM. For this I’d probably write a bunch of test tasks and see how well it performs with and without the skill. But the tough part here is that in certain codebases it might not need the skill. The whole environment is an implicit input for coding agents. In my main codebase right now there are tons of playwright specs that Claude does a great job copying / improving without any special information.

edit with one more thought: In many ways this mirrors building/adopting dev tooling to help your (human) junior engineers, and that still feels like the good metaphor for working with coding agents. It's extremely context dependent and murky to evaluate whether a new tool is effective -- you usually just have to try it out.

Also, if you figure out a good prompt today you don't know how long it will last, because of model updates outside your control