The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.

OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.

It’s an engineering result, not a scientific one.