This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
At some point wilful incompetence becomes malice. You really shouldn't allow network requests from your CI runners unless you have something that cannot be solved in another way (hint: you don't).