A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?

I support this question. Any comments ?

NB: That's easy to downvote without commenting...

The Internet is written in ink. You should assume that any and all public posts you make have already been replicated and archived by countless parties in countless ways by the time you hit delete. HN public postings are no different.

The HN API [1] has been around in various forms for years and includes the same public data that's used to generate the public pages on the HN site, but rather than returning HTML pages designed for human consumption, the API returns the data in a JSON serialized form [2] designed for machine consumption [3].

When the HN API went live, it reduced the overhead and redundant work from all the programmers having to independently crawl and parse site. The HN BigQuery dataset is the same data returned by the HN API, Google just took the next step and did the work of loading it into BigQuery.

[1] https://github.com/HackerNews/API

[2] https://en.wikipedia.org/wiki/Category:Data_serialization_fo...

[3] https://en.wikipedia.org/wiki/Machine_to_machine