They’ll never reveal the data, because that would reveal this is all built on stolen work.
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:
https://huggingface.co/allenai/OLMo-2-0325-32B
Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:
https://huggingface.co/allenai/OLMo-2-0325-32B
Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.