I've tried Grok, Gemini and ChatGPT. There have been 2 times now where Gemini and ChatGPT confidently gave me an incorrect answer whereas Grok was correct. I'm now paying for Grok Lite or whatever it is $10 plan.
The first question was around setting up timers for a Fox ESS battery in Home Assistant and disconnecting Fox ESS from the cloud. The second was around cornering speed in Sunnypilot and Frogpilot.
Somewhat niche but if an AI is confidently telling you something wrong it's hard to work with.
>if an AI is confidently telling you something wrong it's hard to work with.
But they all do that. It just comes with the territory. Grok will absolutely do the same thing another time you try it.
> Grok will absolutely do the same thing another time you try it.
True; it's just not happened yet. It will at some point though. With the Sunnypilot example it right out told me that it is not possible on that fork which I appreciated. The others all seem to hallucinate some setting.
It is really, really genuinely concerning how many people think there are profound measurable differences between these things.
Like yeah tonally I guess there are. But with regard to references and information? You’re literally just using three different slot machines and claiming one is hot.
I suppose though I shouldn’t be that surprised then since Vegas and every other casino on Earth has been built on duping people in that exact way.
> You’re literally just using three different slot machines and claiming one is hot.
It's a fair point. I haven't tested many queries across them all and checked their answers, but if I want to ask one of them a question - right now its Grok just because I trust its answers more.
It's not a methodology problem, it's a test-ability problem. LLMs are not deterministic. You can ask the same question to the same LLM five times and you'll likely get at least 3 answers.
Again. Slot machine.
You can meaningfully test if one slot machine hits the jackpot more often than another, just that the methodology should involve a large number of repeats rather than a few anecdotes. There are some LLM leaderboard sites that do it with blind comparisons.
humans make poor scientists. most people have already made a decision before they run any tests.
the smartest among them just make the tests complicated and biased; the less intelligent just cherry pick.
of course, would you really expect anyone to do real rsearch in this economy?