Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?

Who makes the rules?