I'm using Playwright so much right now. All of the good LLMs appear to know the API really well.

Using Claude Code I'll often prompt something like this:

"Start a python -m http.server on port 8003 and then use Playwright Python to exercise this UI, there's a console error when you click the button, click it and then read that error and then fix it and demonstrate the fix"

This works really well even without adding an extra skill.

I think one of the hardest parts of skill development is figuring out what to put in the skill that produces better results than the model acting alone.

Have you tried iteratively testing the skill - building it up part by part and testing along the way to see if the different sections genuinely help improve the model's performance?

this is the core problem rn with developing anything that uses an LLM. It’s hard to evaluate how well it works and nearly impossible to evaluate how well it generalizes unless the input is constrained so tightly that you might as well not use the LLM. For this I’d probably write a bunch of test tasks and see how well it performs with and without the skill. But the tough part here is that in certain codebases it might not need the skill. The whole environment is an implicit input for coding agents. In my main codebase right now there are tons of playwright specs that Claude does a great job copying / improving without any special information.

edit with one more thought: In many ways this mirrors building/adopting dev tooling to help your (human) junior engineers, and that still feels like the good metaphor for working with coding agents. It's extremely context dependent and murky to evaluate whether a new tool is effective -- you usually just have to try it out.

Also, if you figure out a good prompt today you don't know how long it will last, because of model updates outside your control

there's a console error when you click the button

Chrome Devtools also has an MCP server that you can connect an LLM to, and it's really good for debugging frontend issues like that.

One thing that I see skills having the advantage is when they include scripts for specific tasks that the LLM has a difficult time generating the right code.

Also the problem with the LLM being trained to use foo tool 1.0 and now foo tool is on version 2.0.

The nice thing is that scripts on a skill are not included in the context and also they are deterministic.

Yeah you can definitely do this with prompts since LLMs know the API really well. I just got tired of retyping the same instructions and wanted to try out the new Skills.

I did test by comparing transcripts across sessions to refine the workflow. As I'm running into new things I'm continuing to do that.

I'm surprised Anthropic didn't release skills with a `skill-creation` skill.

But they did.

Did they!? Damn I missed it.

I was looking into creating one and skimmed the available ones and didn't see it.

EDIT:

Just looked again. In the docs they have this section: ``` Available Skills

Pre-built Agent Skills The following pre-built Agent Skills are available for immediate use:

    PowerPoint (pptx): Create presentations, edit slides, analyze presentation content
    Excel (xlsx): Create spreadsheets, analyze data, generate reports with charts
    Word (docx): Create documents, edit content, format text
    PDF (pdf): Generate formatted PDF documents and reports
These Skills are available on the Claude API and claude.ai. See the quickstart tutorial to start using them in the API. ```

Is there another list of available skills?

Their repo here: https://github.com/anthropics/skills

This is the skill creation one: https://github.com/anthropics/skills/blob/main/skill-creator...

You can turn on additional skills in the Claude UI from this page: https://claude.ai/settings/capabilities

Nice, thanks!

I get so many LLM death spirals with playwright.

When it works, its totally magic, but I find it gets hung up on things like not finding the active playwright window or being able to identify elements on the screen.