Are we sure that unrestricted free-form Markdown content is the best configuration format for this kind of thing? I know there is a YAML frontmatter component to this, but doesn't the free-form nature of the "body" part of these configuration files lead to an inevitably unverifiable process? I would like my agents to be inherently evaluable, and free-text instructions do not lend themselves easily to systematic evaluation.

>doesn't the free-form nature of the "body" part of these configuration files lead to an inevitably unverifiable process?

The non-deterministic statistical nature of LLMs means it's inherently an "inevitably unverifiable process" to begin with, even if you pass it some type-checked, linted, skills file or prompt format.

Besides, YAML or JSON or XML or free-form text, for the LLM it's just tokens.

At best you could parse the more structured docs with external tools more easily, but that's about it, not much difference when it comes to their LLM consumption.

The modern state of the art is inherently not verifiable. Which way you give it input is really secondary to that fact. When you don't see weights or know anything else about the system, any idea of verifiability is an illusion.

Sure. Verifiability is far-fetched. But say I want to produce a statistically significant evaluation result from this – essentially testing a piece of prose. How do I go about this, short of relying on a vague LLM-as-a-judge metric? What are the parameters?

You 100% need to test work done by AI, if it's code it needs to pass extensive tests, if it's just a question answered, it needs to be the common conclusion of multiple independent agents. You can trust a single AI as much as a HN or reddit comment, but you can trust a committee of 4 as a real expert.

More generally I think testing AI by using its web search, code execution and ensembling is the missing ingredient to increased usage. We need to define the opposite of AI work - what validates it. This is hard, but once done you can trust the system and it becomes cheaper to change.

How would you evaluate it if the agent were not a fuzzy logic machine?

The issue isnt the LLM, its that verification is actually the hard part. In any case, its typically called “evals” and you can probably craft a test harness to evaluate these if you think about it hard enough

Would a structured skills file format help you evaluate the results more?

Yes. It would make it much easier to evaluate results if the input contents were parameterized and normalized to some agreed-upon structure.

Not to mention the advantages it would present for iteration and improvement.

"if the input contents were parameterized and normalized to some agreed-upon structure"

Just the format would be. There's no rigid structure that gets any preferrential treatment by the LLM, even if it did accept. In the end it's just instructions that are no different in any way from the prompt text.

And nothing stops you from making a "parameterized and normalized to some agreed-upon structure" and passing it directly to the LLM as skills content, or parsing it and dumping it as skills regular text content.

At least MCPs can be unit tested.

With Skills however, you just selectively append more text to prompt and pray.

The DSPy + GEPA idea for this mentioned above[1] seems like it could be a reasonable approach for systematic evaluation of skills (not agents as a whole though). I'm going to give this a bit of a play over the holiday break to sort out a really good jj-vcs skill.

[1]: https://news.ycombinator.com/item?id=46338371

Then rename your markdown skill files to skills.md.yaml.

There you go, you're welcome.