Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?