I guess I'm saying that if there are reproducibility problems without the weights, then there's still a reproducibility problem with them. A paper with weights that magically work, when training on the same data and algorithm doesn't work is a paper that isn't reproducible.

IMO, having the weights available sometimes just papers over a deeper issue.

Training, especially on large GPU clusters, is inherently non-deterministic. Even, if all seeds are fixed.

This boils down to framework implementations, timing issues and extra cost of trying to ensure determinism (without guarantees).