Hacker News

Where's the training data and training scripts since you are calling this open source?

Edit: it seems "open source" was edited out of the parent comment.

doesn't it get tiring after a while? using the same (perceived) gotcha, over and over again, for three years now?

no one is ever going to release their training data because it contains every copyrighted work in existence. everyone, even the hecking-wholesome safety-first Anthropic, is using copyrighted data without permission to train their models. there you go.

necovek 11 hours ago [ - ]

There is an easy fix already in widespread use: "open weights".

It is very much a valuable thing already, no need to taint it with wrong promise.

Though I disagree about being used if it was indeed open source: I might not do it inside my home lab today, but at least Qwen and DeepSeek would use and build on what eg. Facebook was doing with Llama, and they might be pushing the open weights model frontier forward faster.

JumpCrisscross 7 hours ago [ - ]

> There is an easy fix already in widespread use: "open weights"

They're both correct given how the terms are actually used. We just have to deduce what's meant from context.

There was a moment, around when Llama was first being released, when the semantics hadn't yet set. The nutter wing of the FOSS community, to my memory, put forward a hard-line and unworkable definition of open source and seemed to reject open weights, too. So the definition got punted to the closest thing at hand, which was open weights with limited (unfortunately, not no) use restrictions. At this point, it's a personal preference that's at most polite to respect if you know your audience has one.

necovek an hour ago [ - ]

The point is that "open source" by now has an established and widespread definition, and a "source" hints that it is something a thing is built from that is open.

Is this really a debate we still need to be having today? Sounds like grumpiness with Open Source Initiative defining this ~25 years ago when this term was rarely used as such.

If we do not accept a well defined term and want to keep it a personal preference, we can say that about any word in a natural language.

dannyw 8 hours ago [ - ]

Yeah, open weights is really good, especially when base models (not just the instruction tuned) weights are released like here.

Tepix 10 hours ago [ - ]

Nvidia did with Nemo.

niea_11 10 hours ago [ - ]

And they got sued :

https://www.reuters.com/technology/nvidia-is-sued-by-authors...

mike_hearn 25 minutes ago [ - ]

Every lab has been sued whether they released training data or not.

fragmede 11 hours ago [ - ]

it's not a gotcha but people using words in ways others don't like.

a96 9 hours ago [ - ]

It's not about likes, it's a flat out lie.

woctordho 8 hours ago [ - ]

They are exactly open source. The training data is the internet. Don't say it's on the internet. It IS the internet.

The training scripts are in Megatron and vLLM.

bl4ckneon 11 hours ago [ - ]

Aww yes, let me push a couple petabytes to my git repo for everyone to download...

necovek 11 hours ago [ - ]

An easier thing would be to say "open weights", yes.

0-_-0 11 hours ago [ - ]

Weights are the source, training data is the compiler.

injidup 10 hours ago [ - ]

You got it the wrong way round. It's more akin to.

1. Training data is the source. 2. Training is compilation/compression. 3. Weights are the compiled source akin to optimized assembly.

However it's an imperfect analogy on so many levels. Nitpick away.

mirekrusin 10 hours ago [ - ]

It's dataset [0] released under some source available license or OSI license, ie. open dataset or open source dataset.

[0] https://news.ycombinator.com/item?id=47758408