>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.

This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.

    SELECT 
      id,
      text,
      `by` AS username,
      FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
    FROM 
      `bigquery-public-data.hacker_news.full`
    WHERE 
      type = 'comment'
      AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
    ORDER BY 
      time DESC
    LIMIT 
      100

https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...

My favorite which is also up to date is the ClickHouse playground.

For example:

  SELECT * FROM hackernews_history ORDER BY time DESC LIMIT 10;
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...

I subscribe to this issue to keep up with updates:

https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...

And ofc, for those that don't know, the official API https://github.com/HackerNews/API

I didn't know there was an official API! This explains why the data is so readily available in many sources and formats. That's very cool.

With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.

1. Create a table with styles by authors:

    CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name
2. Calculate and insert style vectors (the insert takes 27 seconds):

    INSERT INTO hn_styles WITH 128 AS vec_size,
    cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
    arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
    SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by
3. Find nearest authors (the query takes ~50 ms):

    SELECT name FROM hn_styles ORDER BY cosineDistance(vec, (SELECT vec FROM hn_styles WHERE name = 'antirez')) LIMIT 25

        ┌─name────────────┬─────────────────dist─┐
     1. │ antirez         │                    0 │
     2. │ geertj          │ 0.009644324175144714 │
     3. │ mrighele        │ 0.009742538810774581 │
     4. │ LukaAl          │ 0.009787061201638525 │
     5. │ adrianratnapala │ 0.010093164015005152 │
     6. │ prmph           │ 0.010097599441156513 │
     7. │ teilo           │ 0.010187607877663263 │
     8. │ lukesandberg    │  0.01035981357655602 │
     9. │ joshuak         │ 0.010421492503861374 │
    10. │ sharikous       │  0.01043547391491162 │
    11. │ lll-o-lll       │  0.01051205287096002 │
    12. │ enriquto        │ 0.010534816136353875 │
    13. │ rileymat2       │ 0.010591026237771195 │
    14. │ afiori          │ 0.010655186410089112 │
    15. │ 314             │ 0.010768594792569197 │
    16. │ superice        │ 0.010842615688153812 │
    17. │ cm2187          │  0.01105111720031593 │
    18. │ jorgeleo        │ 0.011159407590845771 │
    19. │ einhverfr       │ 0.011296755160620009 │
    20. │ goodcanadian    │ 0.011316316959489647 │
    21. │ harperlee       │ 0.011317367800365297 │
    22. │ seren           │ 0.011390119122640763 │
    23. │ abnry           │ 0.011394133096140235 │
    24. │ PraetorianGourd │ 0.011508457949426343 │
    25. │ ufo             │ 0.011538721312575051 │
        └─────────────────┴──────────────────────┘

…i can’t believe i’ve been running a script to ingest the data for the last six hours. thank you.