I personally agree with your aim to replicate this, because I suspect the outcomes will be surprising to all.

Here’s my sketch of a plan: You’d need controlled environments, impartial judges, time, and well defined experiments.

The controlled environment would be a set of static models run locally or on cloud GPUs; the impartial judge would be static analysis and security tools for various stacks.

Time: Not the obvious, “yes it would take time to do this”. But a good spread of model snapshots that have matures; along with zero days.

Finally: The experiments would be the prompts and tests; choosing contentious, neutral, and favorable (but to whom) groups, and choosing different stacks and problem domains.