I've used general purpose LLM AI (e.g. run-of-the-mill Claude, GPT etc) heavily to draft legal documents. The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever. You can get really comfortable after checking its output a few times and getting no false citations, and then BAM, it'll put three in the next motion it writes.

Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.

LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.

Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.

> The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever.

Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?

So, all of these cases are public records. The federal level stuff is all available quite openly on the web. The state stuff is a mixed nightmare of fifty different systems at the appellate level (which is the stuff that is usually cited). At trial court level you have (literally) 3000 different systems, most of which are not accessible for LLMs.

But yes, 100% LLMs should be able to check themselves. Another poster below brought up the other issue is that you can check the citation and it's 100% correct, but that it doesn't legally apply to what you are writing, and/or it doesn't mean what the LLM thinks it means in the limited context it has taken it from.

It depends on the jurisdiction. I'm based in France and all cases here are now freely available online to people and agents [1], but it's very recent for lower courts. However, I recently had to work on Texas case law and we had to purchase access to a (very expensive [2]) database since most of it wasn't public.

[1] https://www.legifrance.gouv.fr/

[2] https://legal.thomsonreuters.com/en/westlaw/plans-and-pricin...

US in a nutshell

It’s a band aid solution because the model can get stuck in a refutation loop, where it argues a point by pulling up a contradicting source ad infinitum. The holy grail, which has not been yet reached, is figuring out how to dynamically align the model to be consistent with all the sources in the first place (and this is a problem of provenance rather than model design)

I’ve been doing ai legal research via caselaw api with Claude code for at least a year and I’ve never seen that happen.

>The biggest trap is the hallucinated citation

The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?

I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.

I've made the same experience with programming AI. It is very convenient, but convenient doesn't mean unlikely. The universe appears to have given us a convenient thing here.

Just because the citation exists, what the LLM says it stands for and what it actually stands for are not the same.

For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.

(If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)

Right, I know what you mean. If the parties are only breezing over the motion then it looks great and 95% of the time you'll get away with it, even though really it's ethically dubious. And that's a super hard one for a human to catch when reviewing LLM output. Especially because (certainly for me) you tend to get lazier and lazier reviewing the LLM output as they get "smarter."

I'm assuming you've just used some off-the-shelf ones like Claude or GPT? All the lawyers I know are just using those. I'd love to know what Lexis and Westlaw and other companies are serving that might mitigate some of these issues with better custom tuning or a better harness.

I think the paralegal analogy is right, but with one important difference: a human paralegal usually knows when they are unsure, or at least can be trained to flag uncertainty

Right. And a paralegal can stop, they're not usually sycophantic to a ridiculous level where they are trying to find a solution at all costs, regardless of how much they have to bend things to fit.

Chatgpt regularly hallucinates entire cases whole cloth or fabricates an entirely different fact pattern for a given case. Perplexity does much better at citing its sources and providing accurate quotes, at least in my experience.

Seems companies like Thomson Reuters or other legal services have incentive to build LLM with RAG over legal cases texts and robust hallucinations detection on reference

A legal professional can be personally liable for not finding the most recent case-law.

The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.

I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.

Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.

"The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation."

But they can perform live websearches or go directly to a DB specified.

You definitely want your AI to search legal databases, and not draw from "memory". This is where AI offerings from Thomson or Lexis could shine, especially in jurisdictions where case law is not freely available online.

Or you can just have Claude code search westlaw / vlex/ courtlistener