Agents are probabilistic systems. A common mechanism to get a reliable answer from systems that can have variable output is to run them several times (ideally in separate, isolated instances) and then have something vote on the best result or use the most common result. This happens in things like rockets and aviation where you have multiple systems giving an answer and an orchestrator picking the result.
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
An LLMs "wrong" decision is either systemic or biased. They learn "common sense" from human input (i.e. shared datasets, reinforcement learning). If a decision is flat out wrong for you, asking 10 LLMs is unlikely to help.
But then, if an agent picks the best response, how would you know that that is reliable?
You could get the agents to output something structured and then use a deterministic test if you're worried about that.
Obviously you have multiple agents justify why they picked a certain response and then create another agent that picks the solution with the best justification.
touché