I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
At some point wilful incompetence becomes malice. You really shouldn't allow network requests from your CI runners unless you have something that cannot be solved in another way (hint: you don't).
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
You shouldn't download that data on demand at build time. Dependencies on state you don't control are bad even without the bandwidth issues.
Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.
Optimising CI pipelines has been a strong aspect of my career so far.
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.
This is exactly it - you can cache all the wrong things easily, cache only the code you wanted changed, or cache nothing but one small critical file nobody knows about.
No wonder many just turn caching entirely off at some point and never turn it back on.
CI is great for software reliability but it should not be allowed to make network requests.
CI itself doesn't have to be a waste. The problem is most people DGAF about caching.
You don't need caching if your build can run entirely offline in the first place.