What does HackerNews think of browsertrix-crawler?

Webrecorder: Capture interactive websites and replay them at a later time | Aug 2023

(Disclaimer: I work at Webrecorder)

Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then recording the request/response traffic in the WARC format (by default in Webrecorder tools, then packaged into a portable WACZ file: https://specs.webrecorder.net/wacz/1.1.1/). We have custom behaviors for some social media and video sites to make sure that content is appropriately captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well. The crawler also has some job queuing functionality, supports multiple workers/browsers, and is highly configurable to set timeouts, page limits, etc.

The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site. replayweb.page can run locally in your browser without needing to send any data to a server or can be hosted, including in an embedded mode.

(edit: fixed typos)

Come back, c2.com, we still need you | May 2023

Expand Context ↕

I use browsertrix-crawler[0] for crawling and it does well on JS heavy sites since it uses a real browser to request pages. Even has options to load browser profiles so you can crawl while being authenticated on sites.

[0] https://github.com/webrecorder/browsertrix-crawler

Dweb: Offline Internet Archive | Jan 2023

Expand Context ↕

https://github.com/webrecorder/browsertrix-crawler works pretty well. Same tech as https://archiveweb.page/ but non-interactive