Great post! I wonder why MCTS is not more popular as a test time compute harness. Did you compare performance of MCTS (without distillation) against other methods (eg best of N) with the same compute budget?

I didn't compare with the harness (focused on distillation) but the original ToT paper has a section on it: https://arxiv.org/abs/2305.10601