Hacker News

Self hosting a web search engine is probably quite a feat

It's actually not that hard now, once you get useful content. When I worked on Search (~2009ish), the primary index was called 4BBase, because it was the top 4 billion webpages (actually more like 5.5B during my time, but it had been around for a few years). A typical webpage is about 100K, and HTML compresses at 80-90% compression rates, so you're looking at 10-20K/page. The index would take about 50-100 TB.

Even after the recent AI run-up, disk prices are about $20/TB for a 20TB, so you can store this index on 3-5 hard disks that will cost you about $1200-2000. For self-hosted use you don't need to serve them in 50ms, so you don't need to put the whole thing in RAM like Google did, you can serve off of disk.

ElasticSearch uses basically the same data structures and gives you the same infrastructure that Google's ~late-00s search stack did, and is actually more advanced in some respects (like ad-hoc queries, debuggability, and updateability), so software isn't much of an issue.

The big part missing that can't really be replicated today is the huge web of authentic hyperlinks. The reason Google was so good at search was because many humans effectively "tagged" a given webpage with a series of short, descriptive words and phrases. When they went to search for a page, Google could mine this huge treasure trove of backlinks to identify exactly what the page was good for, even if those search terms never appeared on the page. SEO and link farms kinda killed this, as did the rise of social media walled gardens, and so the Google of 2009 basically wouldn't work today anyway. Maybe if you pulled old versions of Common Crawl or archive.org you could reconstruct it, but the relevant pages are often offline anyway today.

opengrass 6 hours ago [ - ]

If an ex Googler compares Elastic Search to the old company then it mustbe something good.

BrunoBernardino 10 hours ago [ - ]

You can self-host Marginalia [1] or Hister [2], for example. Takes up some space, but it's totally doable. Your biggest problem (assuming you have disk space) will be crawling.

[1] : https://github.com/MarginaliaSearch/MarginaliaSearch

[2] : https://github.com/asciimoo/hister

marginalia_nu 9 hours ago [ - ]

Emphasis on "doable".

At least if we're speaking a more generalist web search it requires dedicated hardware, that's pretty costly. Marginalia's production server cost about $20k back when RAM and SSDs were cheap. It used to run on $5k of PC hardware before, but that was very limiting.

So no data center, but at the same time, not everyone has that sort of cash to throw around.

hootz 11 hours ago [ - ]

I believe it is a thing. Saw it somewhere, like a peer to peer search engine.