Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
At some point wilful incompetence becomes malice. You really shouldn't allow network requests from your CI runners unless you have something that cannot be solved in another way (hint: you don't).
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
You shouldn't download that data on demand at build time. Dependencies on state you don't control are bad even without the bandwidth issues.
Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.
Optimising CI pipelines has been a strong aspect of my career so far.
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.
This is exactly it - you can cache all the wrong things easily, cache only the code you wanted changed, or cache nothing but one small critical file nobody knows about.
No wonder many just turn caching entirely off at some point and never turn it back on.
CI is great for software reliability but it should not be allowed to make network requests.
CI itself doesn't have to be a waste. The problem is most people DGAF about caching.
You don't need caching if your build can run entirely offline in the first place.
I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.
Can we identify requests from CI servers reliably?
You can identify requests from Github's free CI reliably which probably covers 99% of requests.
For example GMP blocked GitHub:
https://www.theregister.com/2023/06/28/microsofts_github_gmp...
This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.
I try to stick to GitHub for GitHub CI downloads.
E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.
Sure, have a js script involved in generating a temporary download url.
That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.
There is really no reason to add a JS dependency for this - whatever server-side component expires old URLs can just as well update the download page with the new one.
Having some kind of lightweight auth (API key, even just email-based) is a good compromise