I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.

Most of those items map to a job description.

If you think the data story isn't a complicated beast, then consider:

If you wanted an "open" dataset, would you want it before or after it was processed? There are a lot of cleaning, categorizing, feature extraction steps. The data typically undergoes a lot of analysis, extra annotation, bucketing, and transformation.

If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?

Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.

Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?

Did you scrub it of PII? Are you sure?

And to clarify, we're not even talking about trained models at this point.