A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?
I support this question. Any comments ?
NB: That's easy to downvote without commenting...
The HN API [1] has been around in various forms for years and includes the same public data that's used to generate the public pages on the HN site, but rather than returning HTML pages designed for human consumption, the API returns the data in a JSON serialized form [2] designed for machine consumption [3].
When the HN API went live, it reduced the overhead and redundant work from all the programmers having to independently crawl and parse site. The HN BigQuery dataset is the same data returned by the HN API, Google just took the next step and did the work of loading it into BigQuery.
[1] https://github.com/HackerNews/API
[2] https://en.wikipedia.org/wiki/Category:Data_serialization_fo...