The effort definetely gets my upvote.

What did you use to crawl the pages and how long did it take? Curious about your experience doing it and crawler integration with Seekstorm.

Is is possible to easilly expand the index with embeddings (vectors) and perform semantic search in parallel?

Your pricing indicates that hosting similar index as this demo would cost $500/month. Wondering what kind of infrastructure is supporting the demo? Thanks!

ps. Small quirk: https://deephn.org/?q=how+to+be+productive&filter=%7B%22hash...

First three tags seem not to be relevant to the post itself.

Crawling speed is between 100...1000 pages per second. We crawled about 4 million linked unique web pages

The pricing for an Index like DeepHN would be $99/month: While we are indexing 30 million Hacker news posts, for DeepHN we are combining a single HN story with all its comments and its linked webpage into a single SeekSorm document. So that we index just 4 million SeekStorm documents.

Yes, it would be possible to expand index with embeddings (vectors) and perform semantic search. This would we an auxiliary step between crawling and indexing.

Crawling speed is between 100...1000 pages per second.

Stupid question, but you were crawling news.ycombinator.com, right?

Its robots.txt (https://news.ycombinator.com/robots.txt) contains

  Crawl-delay: 30
Why did you not follow that?

(I'm not being accusatory. I'm both curious about web crawling in general and have personally been archiving the front page and "new" every 60 seconds or so... (Obviously there's no reason for me to retrieve them more often, but my curiosity persists.))

>> you were crawling news.ycombinator.com, right?

No, for retrieving the Hacker News Posts we were using the public Hacker News API, which returns the posts in JSON format: https://github.com/HackerNews/API

The crawling speed of 100...1000 pages per second refers to crawling the external pages linked from Hacker news posts. As they are from different domains we can achieve a high crawling speed while being a polite crawler with a low crawling rate per domain.