By the way, if you install the official Web Archive browser extension, you can configure it to automatically archive every page you visit
By the way, if you install the official Web Archive browser extension, you can configure it to automatically archive every page you visit
This a good suggestion with the caveat that entire domains can and do disappear: https://help.archive.org/help/how-do-i-request-to-remove-som...
That's especially annoying when a formerly useful site gets abandoned, a new owner picks up the domain, then gets IA to delete the old archives as well.
Or even worse, when a domain parking company does that: https://archive.org/post/423432/domainsponsorcom-erasing-pri...
recently I've come to believe even IA and especially archive.is are ephermal. I've watched sites I've saved disappear without a trace, except in my selfhosted archives.
A technological conundrum, however, is the fact that I have no way to prove that my archive is an accurate representation of a site at a point in time. Hmmm, or maybe I do? Maybe something funky with cert chains could be done.
There are timestamping services out there, some of which may be free. It should (I think) be possible to basically submit the target site's URL to the timestamping service, and get back a certificate saying "I, Timestamps-R-US, assert that the contents of https://targetsite.com/foo/bar downloaded at 12:34pm on 29/5/2025 hashes to abc12345 with SHA-1", signed with their private key and verifiable (by anyone) with their public key. Then you download the same URL, and check that the hashes match.
IIUC the timestamping service needs to independently download the contents itself in order to hash it, so if you need to be logged in to see the content there might be complications, and if there's a lot of content they'll probably want to charge you.
Websites don't really produce consistent content even from identical requests though.
But you also don't need to do this: all you need is a service which will attest that it saw a particular hashsum at a particular time. It's up to other mechanisms to prove what that means.
> Websites don't really produce consistent content even from identical requests though.
Often true in practice unfortunately, but to the extent that it is true, any approach that tries to use hashes to prove things to a third party is sunk. (We could imagine a timestamping service that allows some kind of post-download "normalisation" step to strip out content that varies between queries and then hash the results of that, but that doesn't seem practical to offer as a free service.)
> all you need is a service which will attest that it saw a particular hashsum at a particular time
Isn't that what I'm proposing?
sign it with gpg and upload the sig to bitcoin
edit: sorry, that would only prove when it was taken, not that it wasn’t fabricated.
hash the contents
signing it is effectively the same thing. question is how to prove that what you hashed is what was there?
you can't, because unless you're not the only one with a copy, your hash cannot be verified (since both hash and claim comes from you).
One way to make this work is to have a mechanism like bitcoin (proof of work), where the proof of work is put into the webpage itself as a hash (made by the original author of that page). Then anyone can verify that the contents wasn't changed, and if someone wants to make changes to it and claim otherwise, they'd have to put in even more proof of work to do it (so not impossible, but costly).
I think there was a way to preserve TLS handshake information in a way that something something you can verify you got the exact response from the particular server? I can’t look it up now though, but I think there was a Firefox add-on, even.
what if instead of the proof of work being in the page as a hash, that the distributed proof of work is that some subset of nodes download a particular bit of html or json from a particular URI, and then each node hashes that, saves the contents and the hash to a blockchain-esque distributed database. Subject to 51% attack, same as any other chain, but still.
> you can configure it to automatically archive every page you visit
What?? I am a heavy user of the Internet Archive services, not just the Wayback Machine, including official and "unofficial" clients and endpoints, and I had absolutely no idea the extension could do this.
To bulk archive I would manually do it via the web interface or batch automate it. The limitations of manually doing it one by one are obvious, and the limitations of doing it in batches requires, well, keeping batches (lists).