For Python (or PyPI) this is easier, since their data is available on Google BigQuery [1], so you can just run
SELECT * FROM `bigquery-public-data.pypi.distribution_metadata` ORDER BY length(version) DESC LIMIT 10
The winner is: https://pypi.org/project/elvisgogo/#historyThe package with most versions still listed on PyPI is spanishconjugator [2], which consistently published ~240 releases per month between 2020 and 2024.
[1] https://console.cloud.google.com/bigquery?p=bigquery-public-...
Regarding spanishconjugator, commit ec4cb98 has description "Remove automatic bumping of version".
Prior to that commit, a cronjob would run the 'bumpVersion.yml' workflow four times a day, which in turn executes the bump2version python module to increase the patch level. [0]
Edit: discussed here: https://github.com/Benedict-Carling/spanish-conjugator/issue...
[0] https://github.com/Benedict-Carling/spanish-conjugator/commi...
i love the package owner’s response in that issue xD
Tangential, but I've only heard about BigQuery from people being surprised with gargantuan bills for running one query on a public dataset. Is there a "safe" way to use it with a cost limit, for example?
Yes you can set price caps. The cost of a query is understandable ahead of time with the default pricing model ($6 per TB of data processed in a query). People usually get caught out by running expensive queries recursively. BigQuery is very cost effective and can be used safely.
You can tell someone has worked in the cloud for too long when they start to think of $6 per database query as a reasonable price.
We really need to go back to on-premise. We have surrendered our autonomy to these megacorps and now are paying for it - quite literally in many cases.
Surely most queries should process much less than 1 TB of data?
My 3TB, 41 billion row table costs pennies to query day to day. The billing is based on the data processed by the query, not the table size. I pay more for storage.
Can you actually set "price caps"?
Most of the cloud services allow you to set alerts that are notorious for showing up after you've accidentally spend 50k USD. So even if you had a system that automated shutdown of services when getting the alert, you are SOL.
Running ripgrep on my harddrive would cost me $48 at that price point.
BigQuery data is stored (I assume) in column oriented files with indices, so a typical query reads only a tiny fraction of the stored data.
I decided my life could not possibly go on until I knew what "elvisgogo" does, so I downloaded the tarball and poked around. it's a pretty ordinary numpy + pandas + matplotlib project that makes graphs from csv. one line jumped out at me: str_0 = ['refractive_index','Na','Mg','Al','Si','K','Ca','Ba','Fe','Type'] the university of st. andrews has a laser named "elvis" that goes on a remote controlled submarine: https://www.st-andrews.ac.uk/~bds2/elvislaser.htm I was hoping it'd be about go-go dancing to elvis music, but physics experiments on light in seawater is pretty cool too.
You can also query for free at clickpy.clickhouse.com. If you click on any of the links on the visuals you can see the query used.
The underlying dataset is hosted at sql.clickhouse.com e.g. https://sql.clickhouse.com/?query=U0VMRUNUIGNvdW50KCkgICBGUk...
disclaimer: built this a a while ago but we maintain this at clickhouse
oh and rubygems data is also there.
Here [0] is the partial query on the ClickHouse dataset, with different results due to a quota error [1].
[0] https://sql.clickhouse.com?query=U0VMRUNUIHByb2plY3QsIE1BWCh...
[1] Quota read limit exceeded. Results may be incomplete.
We have mvs you can use to avoid this
https://sql.clickhouse.com/?query=U0VMRUNUIHByb2plY3QsIE1BWC...
takes 0.1s
> spanishconjugator [2], which consistently published ~240 releases per month between 2020 and 2024
They also stopped updating major and minor versions after hitting 2.3 in Sept 2020. Would be interesting to hear the rationale behind the versioning strategy. Feels like you might as well use a datetimestamp for the version.
deps.dev has a similar bigquery dataset across a couple more languages if someone wanted to do analysis across the other ecosystems they support.