LLMs are making open source programs both more viable and more valuable.
I have many programs I use that I wish were a little different, but even if they were open source, it would take a while to acquaint myself with the source code organization to make these changes. LLMs, on the other hand, are pretty good at small self-contained changes like tweaks or new minor features.
This makes it easier to modify open source programs, but also means that if a program isn't open source, I can't make these changes at all. Before, I wasn't going to make the change anyway, but now that I actually can, the ability to make changes (i.e. the program is open source) becomes much more important.
Open-weights only are also not enough, we need control of the dataset and training pipeline.
The average user like me wouldn't be able to run pipelines without serious infrastructure, but it's very important to understand how the data is used and how the models are trained, so that we own the model and can assess its biases openly.
I view it as more or less irrelevant. LLMs are fundamentally black boxes. Whether you run the black box locally or use it remotely, whether you train it yourself or use a pretrained version, whether you have access to the training set or not, it's completely irrelevant to control. Using an LLM means giving up control and understanding of the process. Whether it's OpenAI or the training data-guided algorithm that controls the process, it's still not you.
Now, running local models instead of using them as a SaaS has a clear purpose: the price of your local model won't suddenly increase ten fold once you start depending on it, like the SaaS models might. Any level of control beyond that is illusory with LLMs.
I on the other hand think it's irrelevant if a technology is a blackbox or not. If it's supposed to fit the opensource/FOSS model of the original post having access to precursors is just as important as having access to the weights.
It's fine for models to have open-weights and closed data. It's only barely fitting the opensource model IMHO though.
The point of FOSS is control. You want to have access to the source, including build instructions and everything, in order to be able to meaningfully change the program, and understand what it actually does (or pay an expert to do this for you). You also want to make sure that the company that made this doesn't have a monopoly on fixing it for you, so that they can't ask you for exorbitant sums to address an issue you have.
An open weight model addresses the second part of THIS, but not the first. However, even an open weight model with all of the training data available doesn't fix the first problem. Even if you somehow got access to enough hardware to train your own GPT-5 based on the published data, you still couldn't meaningfully fix an issue you have with it, not even if you hired Ilya Sutskever and Yann LeCun to do it for you: these are black boxes that no one can actually understand at the level of a program or device.
I'm not an expert on this tech, so I could be talking out my ass, but what you are saying here doesn't ring completely true to me. I'm an avid consumer of stable-diffusion based models. The community is very easily able to train adaptations to the network that push it in a certain direction, to the point you consistently get the model to produce specific types of output (e.g. perfectly replicating the style of a well known artist).
I have also seen people train "jailbreaks" of popular open source LLMs (e.g. Google Gemma) that remove the condescending ethical guidelines and just let you talk to the thing normally.
So all in all I am skeptical of the claim that there would be no value in having access to the training data. Clearly there is some ability to steer the direction of the output these models produce.
Golden Gate Claude, and abliterated models, plus Deepseek's censoring of Tianamen Square,
combined with Grok's alternate political views imply that these boxes are somewhat translucent, especially to leading experts like Ilya Sutskever. In order for Grok to hold alternative views, and to produce NSFW dialog while ChatGPT refuses to implies that there's additional work that happens during training to align models. Getting access to the source used to train the models would let us see into that model's alignment. It's easy enough to ask ChatGPT how to make cocaine, and get a refusal, but what else is lying in wait that hasn't been discovered yet? It's hard to square the notion that these are black boxes that no understands whatsoever, when the original LLama models, which also contain the same refusal, have been edited, at the level of a program, into abliterated models which happily give you a recipe. Note: I am not Pablo Escobar and cannot comment on the veracity of said recipe, only that it no longer refuses.
> having access to precursors is just as important as having access to the weights
They probably can't give you the training set as it would amount to publication of infringing content. Where would you store it, and what would you do with it anyway?
It is an interesting question. Of course everyone should have equal access to the data in theory, but I also believe nobody should be forced to offer it for free to others and I don't think I want to spend tax money having the government host and distribute that data.
I'm not sure how everyone can have access to the data without necessitating another taking on the burden of providing it.
I think torrent is a very good way to redistribute this type of data. You can even selectively sync and redistribute.
I'm also not saying anyone should be forced to disclose training data. I'm only staying that a LLM that's only openweight and not open data/pipeline barely fits the opensource model of the stack mentioned by OP.
LLMs are making open source programs both more viable and more valuable.
I have many programs I use that I wish were a little different, but even if they were open source, it would take a while to acquaint myself with the source code organization to make these changes. LLMs, on the other hand, are pretty good at small self-contained changes like tweaks or new minor features.
This makes it easier to modify open source programs, but also means that if a program isn't open source, I can't make these changes at all. Before, I wasn't going to make the change anyway, but now that I actually can, the ability to make changes (i.e. the program is open source) becomes much more important.
So you’re just storing bunch of forks of open source projects with some AI-generated changes applied to them?
Open-weights only are also not enough, we need control of the dataset and training pipeline.
The average user like me wouldn't be able to run pipelines without serious infrastructure, but it's very important to understand how the data is used and how the models are trained, so that we own the model and can assess its biases openly.
Good luck understanding the biases in a petabyte of text and images and video, or whatever the training set is.
Do you disagree it's important to have access to the data, ease of assessment notwithstanding?
I view it as more or less irrelevant. LLMs are fundamentally black boxes. Whether you run the black box locally or use it remotely, whether you train it yourself or use a pretrained version, whether you have access to the training set or not, it's completely irrelevant to control. Using an LLM means giving up control and understanding of the process. Whether it's OpenAI or the training data-guided algorithm that controls the process, it's still not you.
Now, running local models instead of using them as a SaaS has a clear purpose: the price of your local model won't suddenly increase ten fold once you start depending on it, like the SaaS models might. Any level of control beyond that is illusory with LLMs.
I on the other hand think it's irrelevant if a technology is a blackbox or not. If it's supposed to fit the opensource/FOSS model of the original post having access to precursors is just as important as having access to the weights.
It's fine for models to have open-weights and closed data. It's only barely fitting the opensource model IMHO though.
The point of FOSS is control. You want to have access to the source, including build instructions and everything, in order to be able to meaningfully change the program, and understand what it actually does (or pay an expert to do this for you). You also want to make sure that the company that made this doesn't have a monopoly on fixing it for you, so that they can't ask you for exorbitant sums to address an issue you have.
An open weight model addresses the second part of THIS, but not the first. However, even an open weight model with all of the training data available doesn't fix the first problem. Even if you somehow got access to enough hardware to train your own GPT-5 based on the published data, you still couldn't meaningfully fix an issue you have with it, not even if you hired Ilya Sutskever and Yann LeCun to do it for you: these are black boxes that no one can actually understand at the level of a program or device.
I'm not an expert on this tech, so I could be talking out my ass, but what you are saying here doesn't ring completely true to me. I'm an avid consumer of stable-diffusion based models. The community is very easily able to train adaptations to the network that push it in a certain direction, to the point you consistently get the model to produce specific types of output (e.g. perfectly replicating the style of a well known artist).
I have also seen people train "jailbreaks" of popular open source LLMs (e.g. Google Gemma) that remove the condescending ethical guidelines and just let you talk to the thing normally.
So all in all I am skeptical of the claim that there would be no value in having access to the training data. Clearly there is some ability to steer the direction of the output these models produce.
Golden Gate Claude, and abliterated models, plus Deepseek's censoring of Tianamen Square, combined with Grok's alternate political views imply that these boxes are somewhat translucent, especially to leading experts like Ilya Sutskever. In order for Grok to hold alternative views, and to produce NSFW dialog while ChatGPT refuses to implies that there's additional work that happens during training to align models. Getting access to the source used to train the models would let us see into that model's alignment. It's easy enough to ask ChatGPT how to make cocaine, and get a refusal, but what else is lying in wait that hasn't been discovered yet? It's hard to square the notion that these are black boxes that no understands whatsoever, when the original LLama models, which also contain the same refusal, have been edited, at the level of a program, into abliterated models which happily give you a recipe. Note: I am not Pablo Escobar and cannot comment on the veracity of said recipe, only that it no longer refuses.
https://www.anthropic.com/news/golden-gate-claude
https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-a...
> having access to precursors is just as important as having access to the weights
They probably can't give you the training set as it would amount to publication of infringing content. Where would you store it, and what would you do with it anyway?
If it's infringing content, it's not open and it's not FOSS. For a fully open stack for local LLMs you need open data too.
It is an interesting question. Of course everyone should have equal access to the data in theory, but I also believe nobody should be forced to offer it for free to others and I don't think I want to spend tax money having the government host and distribute that data.
I'm not sure how everyone can have access to the data without necessitating another taking on the burden of providing it.
I think torrent is a very good way to redistribute this type of data. You can even selectively sync and redistribute.
I'm also not saying anyone should be forced to disclose training data. I'm only staying that a LLM that's only openweight and not open data/pipeline barely fits the opensource model of the stack mentioned by OP.
Local Org mode, local LLM, all orchestrated with Emacs, all free software.
If only I were retired and had infinite time!
Is local not infeasible for models of useful size (at least on a typical dev machine with <= 64GB RAM and a single GPU)
Maybe this is of interest https://laurentcazanove.com/blog/obsidian-rag-api
Seems like this might be possible with opencode? Haven't played much.
LM Studio + aider
Apple then.
That rules out the open source part.