But how is that testable? If your test is validating the rigidity, water resistance, etc, they will all pass even if the underlying material is a bad choice. Or the glue will degrade in six months.

You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.

https://senior-swe-bench.snorkel.ai/blog/2026-06-16-how-it-w...

That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.