It's more complicated than that. Quite a bit more.
Commercial use counts _against_ a fair use defense, but is not dispositive: it's not accurate at all to say it "generally does not cover" commercial use. This is the "purpose and character" test, one of four in contemporary (United States) fair use doctrine.
Purpose and character also includes the degree to which a use is _transformative_. It's clear that the degree to which a training run mulching texts "transforms" them is very high. This counts toward a fair use finding for purpose and character.
> is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”
The "amount and substantiality" test. Your case for "all of it" can't possibly be sustained: the models aren't big enough. It's amount _and_ substantiality: this has come up in the publication of concordances, where a relatively large amount of a copyrighted work appears, but it's chopped up and ordered in a way which is no longer substantially the same. Courts have ruled that this kind of text is fair use, pretty consistently. It's not an LLM, of course, but those have yet to be ruled on.
Also worth knowing that courts have never accepted reading or studying a work as incorporation, and are unlikely to change course on the question. It's taken for granted that anyone is allowed to read a copyrighted work in as much detail as they wish, in the course of producing another one. Model training isn't reading either, but the question is to what degree it resembles study. I'd say, more than not.
Specifically:
> it’s impossible to make a useful model without the whole book and all of the artistry that went into it
Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that.
"Effect upon the work's value" is probably the most interesting one. For some things, extreme, for others, negligible. I suspect this is the one courts are going to spend the most time on as all of these questions are litigated.
Ultimately, model training is highly out-of-distribution for the common law questions involving fair use. It was not anticipated by statute, to put it mildly. The best solution to that kind of dilemma is more statute, and we'll probably see that, but, I don't think you'll be happy with the result, given what I'm replying to. Just a guess on my part.
It is of course true that it is unsettled law, and that fair use is more complicated than my offhand comment suggested.
> Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that.
This I think misses the thrust of my argument, though. Its hard to find an exact human analogy, because neither the technology nor the scale at which it operates is remotely human.
I see it less as “writing his biography without reading the plaintiff’s” and it’s more “using the same style and metaphors to make thousands of copies of very similar biographies, with certain bits tweaked,” like turning an existing work into mad lib.
I don’t know how the courts will eventually rule on it, but it certainly feels like theft to me.
It's fascinating how intuitions differ. To me, it doesn't feel like theft at all. For one thing, theft is depriving another of something, and has therefore never been a good metaphor for infringement; hackers used to be the most insistent about this principle, and it's weird to see a doctrine which was cooked up in a literal AI lab get thrown out the window for literal AI.
But pretending you said "infringement", for me it comes all the way back to the Constitution: "To promote the Progress of Science and useful Arts". I cannot possibly twist the development of large language models into something which violates the spirit of that purpose. I don't see how anyone can.
Your point about the scale is valid, and the alienness of it, sure. But you haven't made the case that the vastness of the scale should affect the conclusion.
Something I left out in the first post is that copyright is meant to protect expression, and not ideas: this is the deciding factor in the 'nature of the copyrighted work' test for fair use. More expression, more protection: more ideas, less.
I think the visual arts have a strong case that image generators directly infringe expression: I'm not convinced that authors do, and I think software should never have been protected under copyright because the ideas-to-expression ratio is all wrong for the legal structure. There's clearly no scale case to be made for ideas: "but what if it's _all_ the ideas" fails, because the ideas are not protected at all. Nor should they be, that's what patents are for, and why patents are very different from copyright.
LLMs are remarkably good at 'the facts of the matter', hallucination not withstanding. They're very poor at authorial 'voice transfer', something image generators are far too good at. It's when I start asking myself "well what even _is_ this 'expression' thing anyway?" that I conclude that we're out over our skis on the LLMs-and-IP question: precedent can't tell us enough, and that leaves legislation.