I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.
To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).
This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.
Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).
I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway).
It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.
I've practiced a healthy skepticism of the recent boom but I can't reason why the long horizon time wouldn't stretch to 8 hours or a week worth's of effort from next year. After Opus-4.5, governments and organizations should really figure out a path out of this storm because we're in it now.
Doubling time has been 7 months for a while, so you should expect 8h not 1 week next year.
Predictions over historical data in a landscape with fragile priors doesn't seem like a strong metric to me (it's a useful approximation at best)
It's significantly accelerated to 4 months since the beginning of 2025, which puts 1 week within reach if things stay on trend. But yes 7 months is the more reliable long-term trend.
Can we attribute the acceleration to something specific, that might not actually continue growth? For example, agentic coding and reasoning models seem to have made a huge leap in abilities, but wouldn't translate to an ongoing exponential growth.
METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.
"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.
In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.
The “50% time horizon” feels most actionable when you pair it with an expected-value model. For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost). A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.
Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure!
>which human
The second graph has this under it:
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...
Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did.
I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time. Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time.
How are you guys even doing long tasks with plain Codex or Claude code?
I use Claude code and I get hit with a permissions prompt every 2 seconds for anything I try to do.
Sure I can turn off all dangerous permissions but it'd probably honestly stop and claim it's finished well before it actually is in most cases from my experience.
To be fair I haven't tried codex so maybe it's better at this but I'm my experience almost every model stops at some point and claims victory or stops and tells me something like "next we'll continue on with XYZ" at which point I have to prompt it to continue.
Codex (at least 5 and 5.1) is bad at asking for permission. Whenever it wants to run pre-commit or platformio, it tries to do that, that fails because of the sandbox, and then Codex decides something is wrong with the cache directory and keeps asking for permission to sudo chown ~/.cache, every time.
I have to specifically tell it to request permission for the command it wants to run, and then it works. Very annoying, and very annoying that it can't persist the permission, like Claude Code can, so it doesn't have to ask again every single time.
You have to use --yolo or --dangerously-skip-permissions options.
Thankfully the cloud versions (Claude Code for web, Codex Cloud) run like that already, and are relatively safe in that if anything goes wrong it happens on someone else's computer.
Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements: - Use Typescript instead of JavaScript - Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc. - Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase - Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals - Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option - Try tail recursion instead of a switch over the state in the tokenizer - Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer - "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code
Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests
Thanks for the feedback - I pasted it into a Claude Code session on my phone, here's the resulting PR: https://github.com/simonw/justjshtml/pull/7
I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.
I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.
I had it run a micro benchmark too against the before and after - here's the code it used for that: https://github.com/simonw/justjshtml/blob/a9dbe2d7c79522a76f...
After applying your suggestions: It pushed back against the tail recursion suggestion:> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.
My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction.
I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?
I've not had that problem at all with GPT-5.2 running in Codex CLI.
I use prompts like this:
I have not tried it in Codex CLI, I’ll give that a shot and see if it changes things.
I find that surprising. GPT 5.2 is the model I've had working the longest. It frequently works more than 4 hours nonstop, while earlier models would stop to ask if they should continue every 10 minutes. 5.1 and earlier ignores it if I ask it to continue until a task is done, but 5.2 will usually finish it.
What agent framework are you using? It can differ from one to the next on the same model.
I am using it in Zed.
You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead.
Heh, I mostly use AI in the opposite direction to write tests because:
1. That’s the part of development work I hate the most and never really clicked with me
2. AI to to this point seems to be better at writing tests than code
Take this with the grain of salt that:
1. I suck
2. My work is mostly in the realm of infrastructure where testing has always been weird and a little dumb
AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently.
But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work.
If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution.
Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable.
Simon have you got to the point where you just don’t read the article?
Others have pointed out your interpretation of long task is not the same as the article.
Maybe this is the negative effects of excessive LLM usage that are spoken about.
They were right. I hadn't read enough of the article to understand what was meant by multi-hour tasks. I upvoted them for pointing that out.
>> Maybe this is the negative effects of excessive LLM usage that are spoken about.
> I upvoted them for pointing that out.
I'm also curious about what you think about the GPs question. TBH, responding after reading half an article was a common thing for most people pre-LLM anyway.
Yeah, show me a Hacker News user who's never posted a comment on a story without properly reading it (or even without clicking the link). LLMs have nothing to do with it.
If I had piped the article through an LLM first, I wouldn't have made the embarrassing mistake in that comment!
What's more amazing is how fast your account empties when they do that.
it's $200/month for the "unlimited" plan.
It's amazing how fast your account hits usage limits.
I think GP was being sarcastic: they did say that the plans were "unlimited".
I read
and quite differently.