I would assume it's important to know what's in that training set too
Because I get reliable generation out of "niche" languages already
Is it code with lots of SQL injections used in a different domain to your own?
It's maybe not good to conflate quantity with quality
This is dated, but a professor told me that LLMs are really really good a generating bad pandas code because it's been trained on so much of it!