As a software engineer I have some intuition for what the risks are of letting agents do some tasks vs others.
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
I've used general purpose LLM AI (e.g. run-of-the-mill Claude, GPT etc) heavily to draft legal documents. The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever. You can get really comfortable after checking its output a few times and getting no false citations, and then BAM, it'll put three in the next motion it writes.
Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.
LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.
Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.
> The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever.
Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?
It depends on the jurisdiction. I'm based in France and all cases here are now freely available online to people and agents [1], but it's very recent for lower courts. However, I recently had to work on Texas case law and we had to purchase access to a (very expensive [2]) database since most of it wasn't public.
[1] https://www.legifrance.gouv.fr/
[2] https://legal.thomsonreuters.com/en/westlaw/plans-and-pricin...
US in a nutshell
It’s a band aid solution because the model can get stuck in a refutation loop, where it argues a point by pulling up a contradicting source ad infinitum. The holy grail, which has not been yet reached, is figuring out how to dynamically align the model to be consistent with all the sources in the first place (and this is a problem of provenance rather than model design)
I’ve been doing ai legal research via caselaw api with Claude code for at least a year and I’ve never seen that happen.
>The biggest trap is the hallucinated citation
The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?
I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.
I've made the same experience with programming AI. It is very convenient, but convenient doesn't mean unlikely. The universe appears to have given us a convenient thing here.
Just because the citation exists, what the LLM says it stands for and what it actually stands for are not the same.
For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.
(If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)
Right, I know what you mean. If the parties are only breezing over the motion then it looks great and 95% of the time you'll get away with it, even though really it's ethically dubious. And that's a super hard one for a human to catch when reviewing LLM output. Especially because (certainly for me) you tend to get lazier and lazier reviewing the LLM output as they get "smarter."
I'm assuming you've just used some off-the-shelf ones like Claude or GPT? All the lawyers I know are just using those. I'd love to know what Lexis and Westlaw and other companies are serving that might mitigate some of these issues with better custom tuning or a better harness.
I think the paralegal analogy is right, but with one important difference: a human paralegal usually knows when they are unsure, or at least can be trained to flag uncertainty
Seems companies like Thomson Reuters or other legal services have incentive to build LLM with RAG over legal cases texts and robust hallucinations detection on reference
A legal professional can be personally liable for not finding the most recent case-law.
The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.
I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.
Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.
"The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation."
But they can perform live websearches or go directly to a DB specified.
You definitely want your AI to search legal databases, and not draw from "memory". This is where AI offerings from Thomson or Lexis could shine, especially in jurisdictions where case law is not freely available online.
Or you can just have Claude code search westlaw / vlex/ courtlistener
Chatgpt regularly hallucinates entire cases whole cloth or fabricates an entirely different fact pattern for a given case. Perplexity does much better at citing its sources and providing accurate quotes, at least in my experience.
I think this is probably true for most skilled professions. AI is best used in the hands of folks already knowledgeable in the skills/professions they are using it for.
I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.
I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.
I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
I would agree with this point and as I explained in a comment replying to the GP comment above, that atrophy is far more dangerous in the legal field than it is with code because legal documents do not benefit from the structural safeguards available for code, like automated testing, static typing, static analysis tools, etc. IME with legal LLMs so far, they are easily in that most dangerous valley where they can lull you into a false sense of security while still introducing extremely dangerous mistakes that are frequently difficult to detect without very careful reading.
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
> I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence
Thinking the AI is right 80/90% of the time is already a sign of being lulled into overconfidence. The actual percentage is much lower in my experience. I'm willing to grant the AI is "somewhat right" that often but is that really what we settle for?
Am I secretly the only person who ever actually cared about being very accurate. Is AI just an excuse everyone else is using so they can stop pretending? This is so incredibly frustrating
> If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
If you are willing to do that with your own code you should probably not be trusted to work on software
> I wouldn't really want to rely on my lawyer to use AI to write code for me.
Yet that is exactly what a lot of C-Suiters (many of whom are lawyers), are doing.
Vice versa there is also a lot of irresponsible programmers doing stupid things with ai. Irresponsible people stay irresponsible, AI just make them more productive at being irresponsible.
The problem is the low levels have no influence whatsoever. The higher ups force crap down and none ever comes back.
Corporations are DEMANDING legal ai because it is so much more efficient.
Lawyers creating legal stuff, via LLMs is OK. Programmers creating software through LLMs is OK.
Mixing them, is, not, in my experience, OK. In the future, I am sure that LLMs will reach the point, where their output will be beyond reproach, but we're not there, yet.
That means that someone that knows the context and content, needs to vet the output, before sending it on.
> In the future, I am sure that LLMs will reach the point, where their output will be beyond reproach, but we're not there, yet.
I have no doubt that you're right, but will it be because they are close to infallible or because we have let ourselves become lazy and reliant?
My money is on lazy and reliant based on the trends I'm actually seeing
> sysadmin
Another domain where LLMs are very effective at confidently leading people down a messy path. I have a roommate using LLMs to guide him through setting up some ollama stuff in my WSL (I happen to have the half-decent GPU here) and after multiple rounds of the bot trying to get him to do things that were redundant if not in the wrong direction entirely (and vaguely insulting as a matter of course), I had to write "ground truths" along these lines, and probably more as I find them:
[Yes, it did that] [roommate + bot spent 45 minutes on trying to configure their way through NAT when not having to do that is almost the entire point of tailscale. It was just (essentially) like, "You're absolutely right. We have tailscale set up, so we don't need to be able to ssh to that other interface at all. Not troubleshooting that would have saved 45 whole minutes. Oh well, now what?"]Maybe it's just me, but I'm not inclined to trust the judgment of something that can't keep this kind of thing straight, which I know is to some degree a matter of having all the needed info in the context window. But maybe it would be able to do that if it didn't waste tokens telling me to cd into the same directory that I'm already in every 2 minutes, or chmod .ssh/ again, or (when it really needs to burn some tokens) blow away the .venv and pull a bunch of modules again just to "start clean".
im not so sure
i think devs overestimate their own role and underestimate others
i am seeing lawyers and doctors roll out their own software with AI
but we dont have their training and experience
So a software engineer could diagnose an illness with ai, even if they happen to be right that doesn't really prove much about how bad of an idea it could be in a long tail scenario.
Also worth remembering that LLMs have jagged intelligence. They are probably better software developers than anything. Is there a complement to Gell Mann Amnesia- where you assume it’s good at other jobs because it’s good at yours?
Did you not read the article ? Where does it talk about software development/engineering?
It's like that in engineering, for sure. My background is in aerospace and there are lots of things that a reasonably technically-inclined random can probably do passably. It takes an engineer to know which tasks those are, though.
I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.
Agreed, and it's the same in software. Probably the biggest time-sink right now as a tech lead is people going from idea to fully-fleshed-out PR, and then having to go back to have a discussion of "was this the right thing to do". It causes frustration all around (being a "no" much more, and having someone tell you your finished work isn't valuable).
IME so far (as both a lawyer and a software engineer), LLM error rates when drafting code and legal documents are reasonably comparable, but it's more problematic in the legal context because legal documents do not benefit from many of the structural safeguards available for code. For legal documents, there are no automated tests, no static typing, no test environments, no logging/observability instrumentation, no sandboxing.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
There is also one thing I would like to add, and you can correct me if you disagree: coding benefits much more from thorough planning. Now, I exclusively work by first writing a plan that has well-defined steps and goals, which can of course change over time.
It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).
This is a very good comment. But notice how even in software engineering there is still disagreement about these structural safeguards.
So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.
But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.
So what are we suppossed to train the LLMs towards? Somebody still has to decide what "good" is.
Hidden gem of a comment, thanks for writing
Well this is largely the fault of law itself. especially english style law. A legal, parseable code, in which not every single tiny municipality (some less than 1 square mile) has their own set of rules and laws, not all published or available - but which citizens are expected to abide by of course - how could we expect AI to do well and not some typical TV southern lawyer who knows the judge?
I could not agree more. A simple example: it boggles my mind how every state organizes their statutes in entirely dissimilar ways. I'm not sure there's a need for every state to have slightly different wording for a murder statute in the first place, but even assuming there is, why do they all have to be scattered around in different code sections instead of every state just following some consistent convention like always putting the murder statute at Title V, Section 1.4 (or whatever the case may be, that's just a random invented example).
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
[dead]
> Some things seem harmless, i.e. drafting a will
Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.
I recently had to file to become an estate admin with no will at all. And it was literally cheaper for me to fly 3000 miles to do it in person than it was to pay a lawyer. Because lawyers are frankly greedy scumbags half the time. They don't offer an appropriate cost for the service..instead the conversation immediately goes to "how much" money is in the accounts and suddenly they want a percentage of your father's estate for filing two pieces of paper.
And in my experience if you do actually pay a lawyer for something they will act like you're not worth their time and will literally role their eyes at you when you're trying to explain the minor details of a case because they are too lazy to listen and zone in like I would when doing my job.
Most people don't have anything that could even be called an "estate".
Judging from reported figures, roughly 80-90% of households in the US [1] have a household net worth of at least $0. That means that most people do in fact have an estate.
Median household net worth is in fact somewhere in the $100k-200k range, which is definitely something that could be meaningfully called an "estate." (Most of this tends to be the house, the median net equity in which is about $190k as of 2022).
Source: https://www2.census.gov/library/publications/2024/demo/p70br...
[1] This doesn't mean "homeowners," rather it's a recognition that assets for married or cohabitating couples are usually commingled.
It’s just the legal term. If you have a relative die with a bit of stuff and an ancient car, they have an estate and someone needs to execute it even if the total value is less than most lawyers care about.
Everyone has an estate. Only thing is that you have to die first.
Ummm, not quite.
An "estate" is a legal term for property, assets, and liabilities a person leaves behind upon their death. A family member is a top practitioner in the field of estate planning and resolution, and some of the messiest estates they have handled are pro-bono cases of exactly the type of people you would put in italicized "most people": poor, not really able to upkeep a house they inherited from a relative which hadn't had title properly transferred on a previous death because they didn't have money for an attny, now can't get a loan to fix the roof...
Yeah, if you are homeless, carless, and have only the clothes on your back and a shopping cart of stuff, you don't have an estate. Everyone in the middle class in the US has an estate. Much of the time it passes automatically to their spouse on death, but it's still an estate.
And if you are concerned about where it goes, get a GOOD attny. There are many bad ones hanging out their shingle as "Trust & Estate" attnys, and some of the next messiest cases are fixing problems made by those not-so-good attnys.
And NO, AI is not good enough.
I wouldn't consider drafting a will to be harmless. If its done poorly the next of kin could have to deal with a huge headache and potentially months or years of probate proceedings.
I had a very well crafted will from my parents, one of whom was a very good lawyer hiring other good lawyers. It was still a pain in the ass for many of the reasons they were trying to make it easy for us.
One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.
Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.
The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.
I would think that LLMs would be better at avoiding foot-guns. That’s a situation where you have a list of well known rules and potential pit falls, and the work of the lawyer is to apply those to a fact pattern. That’s something that has been hard to automate programmatically, because the fact patterns are similar but different. LLMs, however, seem to excel at applying general principles to differing fact patterns.
Instead, the LLMs create entirely new foot guns like citing non-existent cases. You can't go more than a week without encountering another news report of a lawyer submitting an AI-generated legal brief rife with bogus case citations, which even includes briefs submitted to state supreme courts.
e.g., https://www.npr.org/2026/04/03/nx-s1-5761454/penalties-stack...
I would categorize this in the "expertise that people internalize but never figure out how to verbalize" department, and that is a department we have no way to teach an LLM because if nobody is writing out those unspoken, subconscious rules then the LLM has nothing to read about them in its training data.
This is often called tacit knowledge. https://en.wikipedia.org/wiki/Tacit_knowledge
My favorite example of this is knowing how to untangle a big pile of cables. There are robots now which can untie a single knotted cable, but I don't think any can do a pile of cables yet. https://www.youtube.com/watch?v=vp-94rsherE
> and that is a department we have no way to teach an LLM because if nobody is writing out those unspoken, subconscious rules then the LLM has nothing to read about them in its training data.
I think on the contrary, LLM providers accumulate huge logs of interaction with their users, which elicit that tacit knowledge and mine it and humans cooperate willingly in order to solve their tasks. Just imagine the corpus of sessions for scientific research, education or software development, it is probably the largest such collection ever to exist. Trillions of HITL tokens per day flow into those logs, carrying our perspectives, choices, original ideas and tacit knowledge. I call this the "human-AI experience flywheel". It's the new stackoverflow, next model generation is based on interaction data from previous one.
Good point. Same probably applies to code as well, coders much tell us why they write the cde the way they did. And if they have comments in their code, those are highly untrustworthy because noboy fixes comments if the code works.
I don't know the source off hand, but I've seen llms hallucinating case citations in order to "prove" their premises.
can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"
But can an LLM come up with questions like what the definition of is is? Seems to me there's a lot of "depends on how you read it" type of stuff that lawyers excel at finding novel interpretations. So what coders thinking of as rules are much less straight forward to understand when it comes to laws
I think that’s a different task than the one OP is referring to. To your example, I’m not familiar with the capability of LLMs in that regard. I have struggled with using the AI features of westlaw when it comes to that sort of argument. (Basically, making an argument that strays from typical route, because that’s the position you happen to find yourself representing.)
I'd only be guessing, but I'd imagine that trying to simulate being a lawyer for someone trying to do something shady would really push an LLM. Imagine being a lawyer for Trump. Could it ever come up with the arguments that his lawyers have? God help us all if they do
As someone who's been sued frivolously...
Believe it or not...
A lot can go wrong if you have real life human lawyers draft a legal document.
I think that's actually a perfect analogy to AI writing code. Drafting a will seems like not a big deal, until that will is accepted as "good enough" and is then in court and under fire.
> drafting a will
Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.
I think that's the right intuition. Legal AI feels especially dangerous because the output can look competent while hiding jurisdiction-specific footguns
This is why I can’t see how college grads are going to survive the AI apocalypse. domain experts driving LLMs are super powerful because they can spot where they make mistakes. Juniors don’t have that insight and the LLMs then cost them productivity.
> domain experts driving LLMs are super powerful because they can spot where they make mistakes
I don’t know if that’ll be true for long. I just had my colleague who’s a very competent engineer IMO hand me a frontier model vibed PR to review (after reviewing it himself, he claims) which contained random variable assignments, conditionals that do nothing, etc. He’d never do such a thing before. People become too comfortable and get confirmation bias as well.
There will still need to be a lawyer in the loop to review and stamp and take accountability.
However, the good news is that a whole bunch of laywer positions in drafting docs and research will be able to be eliminated due to AI.
I'm afraid since claude cheats in benches, what will it do with law?
Hmm, what’s the law equivalent of using docker to bypass sudo?
can you make really convincing but flawed arguments that are historically able to win despite competent opposition?
Cheat.
Or worse, use historical data to determine the laws of today.
The same in every other domains. It’s happening now, not in a future tense
> drafting a will
Tell me you've never been the executor of an estate in the United States without telling me.
I think going through this process has made me uniquely qualified to write one.
there’s really no limit to how many times and ways you can review something with AI, except dollars.
cannot IMAGINE letting ai write my will rn.
I imagine it's really hard to spot a comma in the wrong place, or a missing sentence in a 10 page contract unless you wrote it yourself, or you assembled it from some battle tested templates.
To give you some example of what can happen if you use AI in legal battle you can look at Valve vs Rothchild case [1].
TL;DR Its never a good idea and it will bite you.
1. https://finance.yahoo.com/news/valve-wins-trial-against-pate...