Hacker News

YetAnotherNick 12 hours ago [ - ]

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

esafak 10 hours ago [ - ]

What do you mean? It tests whether the model knows the tools and uses them.

YetAnotherNick 9 hours ago [ - ]

Yeah it's a knowledge benchmark not agentic benchmark.

esafak 9 hours ago [ - ]

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

YetAnotherNick 9 hours ago [ - ]

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.