I think DeepSWE is flawed in a different way: the tasks look like someone took a bunch of big highly technical PRs they found really well done, and inverted it into specs for agents to autistically execute. This is not really how people use agents in practice IMO. And it's why DeepSWE is so generous to OAI models, rigid task execution is the thing they're best at. I think FrontierCode matches the vibes a lot better.