Hacker News

It seems a lot of people havent heard of it, but I think its worth plugging https://perma.cc/ which is really the appropriate tool for something like Wikipedia to be using to archive pages.

mroe https://en.wikipedia.org/wiki/Perma.cc

ronsor 4 hours ago [ - ]

It costs money beyond 10 links, which means either a paid subscription or institutional affiliation. This is problematic for an encyclopedia anyone can edit, like Wikipedia.

toomuchtodo 4 hours ago [ - ]

Wikimedia could pay, they have an endowment of ~$144M [1] (as of June 30, 2024). Perma.cc has Archive.org and Cloudflare as supporting partners, and their mission is aligned with Wikimedia [2]. It is a natural complementary fit in the preservation ecosystem. You have to pay for DOIs too, for comparison [3] (starting at $275/year and $1/identifier [4] [5]).

With all of this context shared, the Internet Archive is likely meeting this need without issue, to the best of my knowledge.

[1] https://meta.wikimedia.org/wiki/Wikimedia_Endowment

[2] https://perma.cc/about ("Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.")

[3] https://community.crossref.org/t/how-to-get-doi-for-our-jour...

[4] https://www.crossref.org/fees/#annual-membership-fees

[5] https://www.crossref.org/fees/#content-registration-fees

(no affiliation with any entity in scope for this thread)

RupertSalt 3 hours ago [ - ]

If the WMF had a dollar for every proposal to spend Endowment-derived funds, their Endowment would double and they could hire one additional grant-writer

nine_k 3 hours ago [ - ]

If the endowment is invested so that it brings very conservative 3% a year, it means that it brings $4.32M a year. By doubling that, rather many grant writers could be hired.

jsheard 4 hours ago [ - ]

Does Wikipedia really need to outsource this? They already do basically everything else in-house, even running their own CDN on bare metal, I'm sure they could spin up an archiver which could be implicitly trusted. Bypassing paywalls would be playing with fire though.

toomuchtodo 4 hours ago [ - ]

Archive.org is the archiver, rotted links are replaced by Archive.org links with a bot.

https://meta.wikimedia.org/wiki/InternetArchiveBot

https://github.com/internetarchive/internetarchivebot

jsheard 4 hours ago [ - ]

Yeah for historical links it makes sense to fall back on IAs existing archives, but going forward Wikipedia could take their own snapshots of cited pages and substitute them in if/when the original rots. It would be more reliable than hoping IA grabbed it.

toomuchtodo 4 hours ago [ - ]

Not opposed, Wikimedia tech folks are very accessible in my experience, ask them to make a GET or POST to https://web.archive.org/save whenever a link is added via the Wiki editing mechanism. Easy peasy. Example CLI tools are https://github.com/palewire/savepagenow and https://github.com/akamhy/waybackpy

Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).

Gander5739 3 hours ago [ - ]

This already happens. Every link added to Wikipedia is automatically archived on the wayback machine.

3 hours ago [ - ]

[deleted]

RupertSalt 2 hours ago [ - ]

[citation needed]

Gander5739 2 hours ago [ - ]

Ironic, I know. I couldn't find where I originally heard this years ago, but the InternetArchiveBot page linked above says "InternetArchiveBot monitors every Wikimedia wiki for new outgoing links" which is probably referring to what I said.

jsheard 4 hours ago [ - ]

I didn't know you can just ask IA to grab a page before their crawler gets to it. In that case yeah it would make sense for Wikipedia to ping them automatically.

ferngodfather 4 hours ago [ - ]

Why wouldn't Wikipedia just capture and host this themselves? Surely it makes more sense to DIY than to rely on a third party.

huslage 3 hours ago [ - ]

Why would they need to own the archive at all? The archive.org infrastructure is built to do this work already. It's outside of WMF's remit to internally archive all of the data it has links to.

3 hours ago [ - ]

[deleted]

RupertSalt 4 hours ago [ - ]

Spammers and pirates just got super excited at that plan!

toomuchtodo 4 hours ago [ - ]

There are various systems in place to defend against them, I recommend against this, poor form against a public good is not welcome.

4 hours ago [ - ]

[deleted]

4 hours ago [ - ]

[deleted]

ouhamouch 4 hours ago [ - ]

[dead]