Hacker News

anentropic 14 hours ago [ - ]

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

networked 12 hours ago [ - ]

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.

andai 11 hours ago [ - ]

This is interesting, TFA lists Opus at 59. Which is the same as Claude Code with opus on the page you linked here. But it has Droid agent with Opus scoring 69. Which means the CC harness harness loses Opus 10 points on this benchmark.

I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)

pitched 13 hours ago [ - ]

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

varispeed 12 hours ago [ - ]

Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.

YetAnotherNick 12 hours ago [ - ]

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

esafak 10 hours ago [ - ]

What do you mean? It tests whether the model knows the tools and uses them.

YetAnotherNick 9 hours ago [ - ]

Yeah it's a knowledge benchmark not agentic benchmark.

esafak 9 hours ago [ - ]

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

YetAnotherNick 9 hours ago [ - ]

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.