This is interesting for offloading "tiered" workloads / priority queue with coding agents.

If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.

Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].

Also I expect both Nvidia and Google to deploy custom silicon for inference [2]

1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...

2: https://www.tomshardware.com/tech-industry/semiconductors/nv...

Note that Batch APIs are significantly higher latency than normal AI agent use. They're mostly intended for bulk work where time constraints are not essential. Also, GPT "Codex" models (and most of the "Pro" models also) are currently not available under OpenAI's own batch API. So you would have to use non-agentic models for these tasks and it's not clear how well they would cope.

(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)

I built something similar using an MCP that allows claude to "outsource" development to GLM 4.7 on Cerebras (or a different model, but GLM is what I use). The tool allows Claude to set the system prompt, instructions, specify the output file to write to and crucially allows it to list which additional files (or subsections of files) should be included as context for the prompt.

Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.

Why use MCP instead of an agent skill for something like this when MCP is typically context inefficient?

MCP is fine if your tool definition is small. If it's something like a sub-agent harness which is used very often, then in fact it's probably more context efficient because the tools are already loaded in context and the model doesn't have to spend a few turns deciding to load the skill, thinking about it and then invoking another tool/script to invoke the subagent.

Models haven't been trained enough on using skills yet, so they typically ignore them

Is that true? I had tool use working with GPT-4 in 2023, before function calling or structured outputs were even a thing. My tool instructions were only half a page though. Maybe the long prompts are causing problems?

They're talking about "skills" which are not the same thing as tools. Most models haven't been trained on the open SKILL spec, and therefore aren't tuned to invoke them reliable when the need occurs.