What does HackerNews think of awesome-web-archiving?
An Awesome List for getting started with web archiving
ArchiveBox[1]: Pretty much a self-hosted wayback machine. It can save websites as plain html, screenshot, text, and some other formats. I have my bookmarks archived in it and have a bookmarklet to easily add new websites to it. If you use the docker-compose you can enable a full-text search backend for an easy search setup.
WebRecorder[2]: A browser extension that creates WACZ archives directly in the browser capturing exactly what content you load. I use it on sites with annoying dynamic content that sites like wayback and ArchiveBox wouldn't be able to copy.
ReplayWeb[3]: An interface to browse archive types like WARC, WACZ, and HAR. The interface is just like browsing through your browser. It can be self-hosted as well for the full offline experience.
browsertrix-crawler[4]: A CLI tool to scrape websites and output to WACZ. Its super easy to run with Docker and I use it to scrape entire blogs and docs for offline use. It uses Chrome to load webpages and has some extra features like custom browser profiles, interactive login, and autoscroll/autoplay. I use the `--generateWACZ` parameter so I can use ReplayWeb to easily browse through the final output.
For bookmark and misc webpage archiving then ArchiveBox should be more than enough. Check out this repo for an amazing list of tools and resources https://github.com/iipc/awesome-web-archiving
[1] https://github.com/ArchiveBox/ArchiveBox [2] https://webrecorder.net [3] https://replayweb.page [4] https://github.com/webrecorder/browsertrix-crawler
There are many similar tools there, from archiving to rendering.
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
https://www.wikipedia.org/wiki/List_of_Web_archiving_initiat...
International Internet Preservation Consortium is an active body linking many of them.
There are various resources giving information on tools etc, eg.
https://github.com/iipc/awesome-web-archiving
https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
Or check out some of the other options here:
- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
There are tools like WebRecorder[0] that do this to some extent by recording and replaying all requests. It's certainly a step in the right direction and demonstrates that the approach is viable. This was the only approach I tried that worked for archiving stuff like three.js demos. Worth mentioning there's also an Awesome list[1] that covers various web archival technologies.