This brings up a tangential question for me.
Clearly, companies view the context fed to these tools as valuable. And it certainly has value in the abstract, as information about how they're being used or could be improved.
But is it really useful as training data? Sure, some new codebases might be fed in... but after that, the way context works and the way people are "vibe coding", 95% of the novelty being input is just the output of previous LLMs.
While the utility of synthetic data proves that context collapse is not inevitable, it does seem to be a real concern... and I can say definitively based on my own experience that the _median_ quality of LLM-generated code is much worse than the _median_ quality of human-generated code. Especially since this would include all the code that was rejected during the development process.
Without substantial post-processing to filter out the bad input code, I question how valuable the context from coding agents is for training data. Again, it's probably quite useful for other things.
The human/computer interaction is probably more valuable than any code they could slurp up. Its basically CCTV of people using your product and live-correcting it, in a format you can feed back into the thing to tell it to improve. Maybe one day they will even learn to stop disabling tests to get them to pass.
There is company, maybe even a YC company, which I saw posting about wanting to pay people for private repos that died on the vine, and were never released as products. I believe they were asking for pre-2022 code to avoid LLM taint. This was to be used as training data.
This is all a fuzzy memory, I could have multiple details wrong.
I suspect the product telemetry would be more useful - things like success of interaction vs requiring subsequent editing, success from tool use, success from context & prompt tuning parameters would be for valuable to the product than just feeding more bits into the core model.