I agree that the training sets for LLMs have much more training data for Python than for Rust. But C++ has existed before Python I believe. So I doubt there is 2 orders of magnitude of Python code more than C++.

You miss how many fewer programmers were there in the early years, how much of that code was ever public, and even if it was, how useful it was, as C++ has changed drastically since, say, what we used to write in 2001.

It's not just a question of whether there is more actual code in a given language, but how much is available in the public and private training data.

I've done work on reviewing and fine-tuning training data with a couple of providers, and the amount of Python code I got to see at least out-distanced C++ code by far more than 2 orders of magnitude. It could be a heavily biased sample, but I have no problems believing it also could be representative.

Python is pretty old, so I had a quick look.

https://en.wikipedia.org/wiki/C%2B%2B#History

In 1985, the first edition of The C++ Programming Language was released, which became the definitive reference for the language, as there was not yet an official standard.[31] The first commercial implementation of C++ was released in October of the same year.[28]

In 1998, C++98 was released, standardizing the language, and a minor update (C++03) was released in 2003.

https://en.wikipedia.org/wiki/History_of_Python

The programming language Python was conceived in the late 1980s,[1] and its implementation was started in December 1989[2] by Guido van Rossum at CWI in the Netherlands as a successor to ABC capable of exception handling and interfacing with the Amoeba operating system.[3]

Python reached version 1.0 in January 1994.

Of course it's hard to say how much that is reflected in code available and is any of the old code still valid input for modern use. It does broadly look like c++ is older, in general.

> But C++ has existed before Python I believe.

Sure, C++ is 42 years old, Python is “only” 34. Both are older than the online code hosts (or even the web itself) from which the code for AI training data is sourced, so age probably isn't a key factor in how much code of each is there, popularity with projects hosted in accessible public code repos is more relevant.