Well that historical content and code still exists right? Are you just saying “what if we’re in a world of walled gardens now that OSS dies because people don’t want their work stolen” in which case: these companies will get data and they don’t need OSS anymore. It’s already webcrawled or licensed or commissioned, they pay people to generate novel traces when they need it or at the very least sets of prompts and tests for verification. Then synthetic data gets added to the training set, the ones that are verified.

That sounds like it would reduce the blazing progress of the last decades to a snail's pace, some twilight where software is just average, as it always was and always will be. That people will always do the thing the opposite of which is now incentivized doesn't convince me, basically. If just using the LLM gets you ahead in a time of severe pressure, then most people will do that, and by the time anyone realizes they kinda need a FEW people to actually be able to reason about something from start to finish, it might be to late.

We're not such a smart species. It's not like we managed so far. We're just adding unsolved problems, and distract ourselves with even bigger problems. The world could have been fed and clothed by the mid 20th century and we could have solved climate change by the 1980s (talking out of my ass here but with confidence in my general point with that), but instead we now throw everything into the furnace. in the hopes it will create a deus ex machina, like in that very bad Isaac Asimov story. I think we are absolutely capable of lobotomizing ourselves (as a species) like a toddler playing with an electrical socket shocking itself. I don't say this to be snarky, I honestly think we're that unserious and ignorant about what we do and the environment we do it in.

But I also really should look into what you answered about LLM learning from themselves, I heard it mentioned before but I still have no real clue. I will try to rectify that. I mean, I really, really want to be wrong on this, only a monster wouldn't.

> by the time anyone realizes they kinda need a FEW people to actually be able to reason about something from start to finish, it might be to late.

I dont think it will be "too late" by any reasonable definition. All those things are learnable and companies that will really need to overcome it, will. But, they wont be open with their knowledge. Learning/training will be expensive and once people acquire it, they wont share it like open sources and programming tech blogs did.

This is super hilarious :-)))

Do you think creating the orders of magnitude of content the internet produced organically and which LLM creators are stealing is cheap? If they actually have to pay for content creation while competing with content creators on the you know, content creation front via LLM-generation, the entire business model of LLMs collapses.

You can't have the mountains of data needed for LLMs in the decades to come, if your LLMs put the writers and artists out of work.

It’s literally how these models are trained today. They of course use open source data but that’s no longer the most important source, it’s high quality prompts and verifiable tests and a lot of inference compute. They also have massive flywheels from users from which they can mine good data or at the very least again good prompts which can be just as important.