>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.
This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100
https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...
My favorite which is also up to date is the ClickHouse playground.
For example:
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...I subscribe to this issue to keep up with updates:
https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...
And ofc, for those that don't know, the official API https://github.com/HackerNews/API
I didn't know there was an official API! This explains why the data is so readily available in many sources and formats. That's very cool.
With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.
1. Create a table with styles by authors:
2. Calculate and insert style vectors (the insert takes 27 seconds): 3. Find nearest authors (the query takes ~50 ms):…i can’t believe i’ve been running a script to ingest the data for the last six hours. thank you.