This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.

I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.

I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934

Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.

Although that work is not public, you can play with the generally available product here!

[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.

LLMs display a form of abductive reasoning which is not the same as judgement. The only thing in the universe we know that can display judgement is a human. However many tasks we presume to require human judgement do not and abductive reasoning will perform as well as a human. This in theory acts as a filter if used right reducing the tasks of human judgement to those that can’t be automated with similar or better precision and recall. The trick then is using LLMs and other techniques to reduce the problem space for the human to the kernel of quandary that requires human judgement and to isolate the salient information to reduce the cognitive load as much as possible. Many many mundane tasks can be automated in this way, and many complex tasks can be facilitated to greatly magnify the effectiveness of the human in the middle’s time.

Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?

LLMs have positional, response length and hedge word biases (and that's just what's rigorously demonstrated in papers) that wash out differences between high performing answers as you approach the limit of your objective. Imagine if you were trying to optimize a function and the measurement function emitted random biased noise, at some point you wouldn't be able to accurately identify the impact of your changes.

Indeed - human judges suck on average. And you can prompt an llm judge to look for particular kinds of problems, then throw the ensemble of judges at an output to nitpick. (Essentially, bake in a diversity of biases through a collection of prompts.)