Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.
Doesn't that sound like may be the harness was the problem?
Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.
Doesn't that sound like may be the harness was the problem?
I was using the same harness for each run, the difference is from when I was running the harness locally on my machine before I pushed up the full runs.