storage is cheap, but if you wanted to improve this:
1. find a way to dedup media
2. ensure content blockers are doing well
3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved
4. add fts and embeddings over the pages
1 and partly 3 - I use btrfs with compression and deduping for games and other stuff. Works really well and is "invisible" to you.
dedup on btrfs requires to setup a cronjob. And you need to pick one of the dedup too. It's not completely invisible in my mind bwcause of this ;)
>storage is cheap
It is. 1.1TB is both:
- objectively an incredibly huge amount of information
- something that can be stored for the cost of less than a day of this industry's work
Half my reluctance to store big files is just an irrational fear of the effort of managing it.
> - something that can be stored for the cost of less than a day of this industry's work
Far, far less even. You can grab a 1TB external SSD from a good name for less than a days work at minimum wage in the UK.
I keep getting surprised at just how cheap large storage is every time I need to update stuff.