I like the small-surface-area approach. The question I’d use to evaluate this is how well the harness records/replays tool calls and failure modes, since that is where debugging agent behavior usually gets messy.
I like the small-surface-area approach. The question I’d use to evaluate this is how well the harness records/replays tool calls and failure modes, since that is where debugging agent behavior usually gets messy.