I'd love to see a benchmark that tests different LLMs for slop, not necessarily limited to code. That might be even more interesting than ARC-AGI.
I'd love to see a benchmark that tests different LLMs for slop, not necessarily limited to code. That might be even more interesting than ARC-AGI.
See the writing benchmarks here https://eqbench.com/creative_writing_longform.html
Note this is the same first author
Not a benchmark per se, but there is a "Not x, but y" Slop Leaderboard:
https://www.reddit.com/r/LocalLLaMA/comments/1lv2t7n/not_x_b...
100% of LLM output is slop. Done.