Hacker News

> No sign of what source material it was trained on though right?

out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..

marci 3 days ago [ - ]

When you're trully open source, you can make ethings like this:

Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.

https://allenai.org/blog/olmotrace

kreijstal 3 days ago [ - ]

you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.

marci 3 days ago [ - ]

That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.

ToValueFunfetti 3 days ago [ - ]

Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is

anonymoushn 3 days ago [ - ]

If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.

m00x 3 days ago [ - ]

Many are speculating it was trained by o1/o3 for some of the initial reasoning.

fulafel 3 days ago [ - ]

Are there any widely used models that publish this? If not, then no I guess.

DANmode 3 days ago [ - ]

Depending on how you use "randomly", they absolutely can..?