It seems a lot of people havent heard of it, but I think its worth plugging https://perma.cc/ which is really the appropriate tool for something like Wikipedia to be using to archive pages.
It seems a lot of people havent heard of it, but I think its worth plugging https://perma.cc/ which is really the appropriate tool for something like Wikipedia to be using to archive pages.
It costs money beyond 10 links, which means either a paid subscription or institutional affiliation. This is problematic for an encyclopedia anyone can edit, like Wikipedia.
Wikimedia could pay, they have an endowment of ~$144M [1] (as of June 30, 2024). Perma.cc has Archive.org and Cloudflare as supporting partners, and their mission is aligned with Wikimedia [2]. It is a natural complementary fit in the preservation ecosystem. You have to pay for DOIs too, for comparison [3] (starting at $275/year and $1/identifier [4] [5]).
With all of this context shared, the Internet Archive is likely meeting this need without issue, to the best of my knowledge.
[1] https://meta.wikimedia.org/wiki/Wikimedia_Endowment
[2] https://perma.cc/about ("Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.")
[3] https://community.crossref.org/t/how-to-get-doi-for-our-jour...
[4] https://www.crossref.org/fees/#annual-membership-fees
[5] https://www.crossref.org/fees/#content-registration-fees
(no affiliation with any entity in scope for this thread)
If the WMF had a dollar for every proposal to spend Endowment-derived funds, their Endowment would double and they could hire one additional grant-writer
If the endowment is invested so that it brings very conservative 3% a year, it means that it brings $4.32M a year. By doubling that, rather many grant writers could be hired.
Does Wikipedia really need to outsource this? They already do basically everything else in-house, even running their own CDN on bare metal, I'm sure they could spin up an archiver which could be implicitly trusted. Bypassing paywalls would be playing with fire though.
Archive.org is the archiver, rotted links are replaced by Archive.org links with a bot.
https://meta.wikimedia.org/wiki/InternetArchiveBot
https://github.com/internetarchive/internetarchivebot
Yeah for historical links it makes sense to fall back on IAs existing archives, but going forward Wikipedia could take their own snapshots of cited pages and substitute them in if/when the original rots. It would be more reliable than hoping IA grabbed it.
Not opposed, Wikimedia tech folks are very accessible in my experience, ask them to make a GET or POST to https://web.archive.org/save whenever a link is added via the Wiki editing mechanism. Easy peasy. Example CLI tools are https://github.com/palewire/savepagenow and https://github.com/akamhy/waybackpy
Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).
This already happens. Every link added to Wikipedia is automatically archived on the wayback machine.
[citation needed]
Ironic, I know. I couldn't find where I originally heard this years ago, but the InternetArchiveBot page linked above says "InternetArchiveBot monitors every Wikimedia wiki for new outgoing links" which is probably referring to what I said.
I didn't know you can just ask IA to grab a page before their crawler gets to it. In that case yeah it would make sense for Wikipedia to ping them automatically.
Why wouldn't Wikipedia just capture and host this themselves? Surely it makes more sense to DIY than to rely on a third party.
Why would they need to own the archive at all? The archive.org infrastructure is built to do this work already. It's outside of WMF's remit to internally archive all of the data it has links to.
Spammers and pirates just got super excited at that plan!
There are various systems in place to defend against them, I recommend against this, poor form against a public good is not welcome.
[dead]
The 3 listed alternatives there seem to have nothing to do with digital archiving. Here's a better alternative to g2 that doesn't login-wall you:
https://alternativeto.net/software/freezepage/