> No sign of what source material it was trained on though right?
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
> No sign of what source material it was trained on though right?
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
When you're trully open source, you can make ethings like this:
Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.
https://allenai.org/blog/olmotrace
you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.
That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.
Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is
If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.
Many are speculating it was trained by o1/o3 for some of the initial reasoning.
Are there any widely used models that publish this? If not, then no I guess.
Depending on how you use "randomly", they absolutely can..?