What does HackerNews think of browsertrix-crawler?

Run a high-fidelity browser-based crawler in a single Docker container

Language: JavaScript

(Disclaimer: I work at Webrecorder)

Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then recording the request/response traffic in the WARC format (by default in Webrecorder tools, then packaged into a portable WACZ file: https://specs.webrecorder.net/wacz/1.1.1/). We have custom behaviors for some social media and video sites to make sure that content is appropriately captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well. The crawler also has some job queuing functionality, supports multiple workers/browsers, and is highly configurable to set timeouts, page limits, etc.

The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site. replayweb.page can run locally in your browser without needing to send any data to a server or can be hosted, including in an embedded mode.

(edit: fixed typos)

I use browsertrix-crawler[0] for crawling and it does well on JS heavy sites since it uses a real browser to request pages. Even has options to load browser profiles so you can crawl while being authenticated on sites.

[0] https://github.com/webrecorder/browsertrix-crawler