They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.
This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.
10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.
That's something you can't replicate without access to the network output pre token sampling.