While I agree that the harness is part of it, I think it's also a lack of epistemic understanding or awareness for what it means to actually solve a problem vs just get something kinda working; maybe if Claude Code or other harnesses made web search more likely or had a better way to make technical documentation and specs available to models, it would be better solvable there.

I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).

That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still

There are still billion dollar opportunities in the harness/LLM space.

Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.

It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.

I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.

[0] https://hermes-agent.nousresearch.com

[flagged]