I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?

The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics

We use pop-cultural references to communicate all the time these days. Those don't necessarily come from only the most commonly known sections of these works, so the AI would necessarily need the full work (or a functional transformation of the work) to be able to hit the theoretical maximum of the ability to decode about and reason using such references. To exclude copyrighted works from the training set is to expect it to decode from the outside what amounts to humanity's own in-group jokes.

That's my formal argument. The less formal one is that copyright protection is something that smaller artists deserve more than rich conglomerates, and even then, durations shouldn't be "eternity and a day". A huge chunk of what is being "stolen" should be in the commons anyway.

"Your honor, if I hadn't robbed that bank I wouldn't have gotten all that money!"

I truthfully cannot think of a single model that satisfies your criteria.

And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.

> And the question is also pretty clear: did $company steal other peoples work?

Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.

Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.

There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.