Hacker News

onion2k 14 hours ago [ - ]

You could probably extract a lot from https://commoncrawl.org/

ks2048 5 hours ago [ - ]

Here's the latest (although, it looks truncated at those having > 1M pages),

https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/...

ccgreg 4 hours ago [ - ]

The complete list hides in the web graph:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

and the specific file that's every host we've seen in the latest 3 crawls is:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...