LLM as a judge for agent usually has context overload and even if you have a really good prompt for your evaluation, LLMs hallucinate because there is just too much information to ingest. So we created an agentic pipeline to basically do evaluations on rubrics which have better results and dont miss intricacies due to the overloaded context.
I'm reading: the difference is that this is an agent as a judge rather than an LLM as a judge, paired with more structured judging parameters. Is that right? Is the agent just a loop over each criterium, or is it also reflecting somehow on its judging or similar?