Is it actually possible to determine how much the weights were influenced by each work?
I might recall reading some interpretability paper years ago that trained a special model that could attribute each answer to a part of the corpus (like Wikipedia, ArXiV, or "Blogs") but it had a non-zero effect on performance and wasn't nearly as straightforward as weights go in, attribution comes out.
It’s very possible to determine similar works that existed earlier, and from that, recover attribution.
The “downside” is you may attribute similar works that weren’t inspirations, but coincidental. But I think that’s an upside: when someone discovers something novel and great but their work fails because of bad luck or non-novel details, then the discovery is finally recognized in another work, I think they should still be attributed.
>Is it actually possible to determine how much the weights were influenced by each work?
It will be very possible once they become the owners of the intellectual property being infringed. Think about how it was "impossible" to implement DRM on music and movies in the early days of youtube. Now, Google owns the content and platform, and suddenly their "rolling cypher" which involves no encryption at all is supposedly enforcable DRM.
The Silicon Valley tech bros play the same game every time. They violate the law, say it's just too darn difficult to obey the law without stifling progress, and then they get away with it until they kill all the competition. At which point, the law is once again applicable to anyone that might try to challenge them.
Remember how Amazon destroyed all the other retailers when they had a decade of no sales tax while brick & mortar had to obey it. "Calculating sales tax for 50 different states?! That's impossible!!!" What a load of shit...
Now, knowing that they're going to do this playbook again, how do you think it's going to play out? We've already seen it. Anthropic steals your copyrighted code, puts together their claude code project, the code for that project leaks, but now THEY own it! They sent DMCA takedowns on that AI generated code. AI generated code enjoys no copyright protection, it cannot be DMCAed under the law, there's no copyright on it. But Anthropic claims there is, and Github will obey the takedown, and nobody has the money to step up and stop them.
See where this is going? Once they achieve market dominance, they will claim that all the code generated by claude belongs to Anthropic, your prompts belong to you, but THEIR machine generated THEIR code and you only purchase a license to it with your tokens. A limited license. It might be revokable, it might expire, maybe you need to pay an annual fee to keep using THEIR code Claude generated for you. And if you actually just write code on your own, without Claude? Well, prepare to be sued like a network printer is sued by the RIAA because that's going to happen too. They will have their robot scour your code for "fair use" training and discover that it's just too similar to something their machine generated a year earlier. Sorry open source programmer, here's your legalese nasty gram. It appears you owe Anthropic some money.
I do not defend the current state of things where a select few companies get to shamelessly violate the law with the entire legal framework bending around the weight of the money trapped in this speculative bubble.
I believe LLMs are at the very least an under-researched technology or less charitably, an ongoing effort to strip intellectual workers of their rights and privileges.
What I am saying is the reasonable demand for attribution runs counter to the nature of these systems as we know them. There is no magical "release the attribution" button Anthropic could press if they wanted to. Unlike per-state taxes, are actual PhDs working on, at universities and private labs, because transparency has been the public number one demand since day one, and yet all that exists after 4 years of funding are only the first incomplete steps.
The most likely outcome of imposing this obligation is commercial LLM providers quickly folding, finding a loophole/displaying false attribution, or settling for notably worse performance. That is of course not counting how these companies will be on the hook for a civilizational amount of licensing fees.
(Per the DRM point, I believe we can agree the goal of simultaneously displaying a piece of media in the physical world and somehow protecting the viewer from storing it is effectively impossible, without hiring a trusted guard to hold the viewer at gunpoint if they dare touch the trusted viewing apparatus or pull out their phone, at least in its strict form)
I am personally okay with shutting down an industry that cannot legally exist in its current form, especially one so openly hostile to every field of human endeavor. But no matter your position on that, we must keep in mind no "ethical" or "legal" AI industry can exist without making either adjective meaningless.