I very much appreciate that the authors not only published their code (https://github.com/llm-random/llm-random) but included the dataset they used (available on Huggingface - https://huggingface.co/datasets/c4) as well as the training process and hyperparameters they used so others can replicate and build on their work. The only thing really missing is the weights which would be nice to have on huggingface as well.

It's very confusing to me that you are praising the authors of a published scientific paper for almost making their work reproduceable.

If we had a proper data version control, wherein the git commit hash was tied directly to the output data hash and hosted on IPFS (and the make system checked ipfs like it does local files for the cache) then it would be absolutely reproducible.

And the wonderful thing is, every person that used git clone on this repo and ran it would be serving the NN weights.

But alas, this unfortunately hasn't been done yet.

That's not what confusing means.

Feigned confusion

The weights aren't needed to make it reproducable. The code and training data are needed. Hopefully if you used those, you'd ultimately reach the same result.

Even in the days where this was standard, that is not the case entirely.

There is a whole other world between "released code" and "getting the results as seen in the paper".

Unfortunately. The reproducibility crisis is very much well and alive! :'( Much more to go into but it is a deep rabbit hole, indeedy. :'((((

I guess I'm saying that if there are reproducibility problems without the weights, then there's still a reproducibility problem with them. A paper with weights that magically work, when training on the same data and algorithm doesn't work is a paper that isn't reproducible.

IMO, having the weights available sometimes just papers over a deeper issue.

Training, especially on large GPU clusters, is inherently non-deterministic. Even, if all seeds are fixed.

This boils down to framework implementations, timing issues and extra cost of trying to ensure determinism (without guarantees).

Random initialization would keep you from producing the exact same results.

Yes, but there's a difference between exact results and reproducible results. I should get similar performance, otherwise there is an issue.

It's a sad world where our standards are that low. But they are that low for good reasons.

If anything CS papers are far more reproducible than most papers. Maybe that is sad, but I think most scientists and researchers are trying their best.

I understand where you're coming from but what they provided DOES make their work reproducible. You can use the data, source code, and recipe to train the model and get the weights.

It would be nice if they provided the weights so it could be USABLE without the effort or knowledge required.

We (I think) would all like to see more _truly_ open models (not just the source code) that enable collaboration in the community.

Only if they also include the random seed they used for the initial weights, otherwise you may be able to reproduce similar performance but will not likely obtain their same weights.

But that's a lot like saying that my recipe for muffins isn't reproducible because it doesn't say exactly which batch of which field my flour comes from. I mean, of course you won't get the same muffins, but if your muffins taste just as good it's still a win.

If this work is valuable, the random seed shouldn't affect the outcome thaaat much.