This is software development, not sales. We rely on our tooling.

If I’m using a calculator to verify my math, I don’t want to use a second calculator to verify the first one.

I am sorry to be the one to tell you but it was already the case that you cannot trust LLMs to solve all your problems 100% of the time.

It was always random. This is no different than any other randomness that already exists in LLMS.

If you are concerned just do benchmarks and see if it is valuable for your usecase regardless.