Dumb (?) question but how is Google's approach here different than Mixture of Experts? Where instead of training different experts to have different model weights you just count on temperature to provide diversity of thought. How much benefit is there in getting the diversity of thought in different runs of the same model versus running a consortium of different model weights and architectures? Is there a paper contrasting results given fixed computation between spending that compute on multiple runs of the same model vs different models?
MOE is just a way to add more parameters/capacity to a model without making it less efficient to run, since it's done in a way that not all parameters are used for each token passing through the model. The name MOE is a bit misleading since the "experts" are just alternate paths though part of the model, not having any distinct expertise in the way the name might suggest.
Just running the model multiple times on the same input and selecting the best response (according to some judgement) seems a bit of a haphazard way of getting much diversity of response, if that is really all it is doing.
There are multiple alternate approaches to sampling different responses from the model that come to mind, such as:
1) "Tree of thoughts" - generate a partial response (e.g. one token, or one reasoning step), then generate branching continuations of each of those, etc, etc. Compute would go up exponentially according to number of chained steps, unless heavy pruning is done similar to how it is done for MCTS.
2) Separate response planning/brainstorming from response generation by first using a "tree of thoughts" like process just to generate some shallow (e.g. depth < 3) alternate approaches, then use each of those approaches as additional context to generate one or more actual responses (to then evaluate and choose from). Hopefully this would result in some high level variety of response without the cost of of just generating a bunch of responses and hoping that they are usefully diverse.
Mixture of Experts isn't using multiple models with different specialties, it's more like a sparsity technique, where you massively increase the number of parameters and use only a subset of the weights in each forward pass.