What does HackerNews think of awesome-web-archiving?

How to Download All of Wikipedia onto a USB Flash Drive | Oct 2022

Not related to the OP topic or zim but I was looking into archiving my bookmarks and other content like documentation sites and wikis. I'll list some of the things I ended up using.

ArchiveBox[1]: Pretty much a self-hosted wayback machine. It can save websites as plain html, screenshot, text, and some other formats. I have my bookmarks archived in it and have a bookmarklet to easily add new websites to it. If you use the docker-compose you can enable a full-text search backend for an easy search setup.

WebRecorder[2]: A browser extension that creates WACZ archives directly in the browser capturing exactly what content you load. I use it on sites with annoying dynamic content that sites like wayback and ArchiveBox wouldn't be able to copy.

ReplayWeb[3]: An interface to browse archive types like WARC, WACZ, and HAR. The interface is just like browsing through your browser. It can be self-hosted as well for the full offline experience.

browsertrix-crawler[4]: A CLI tool to scrape websites and output to WACZ. Its super easy to run with Docker and I use it to scrape entire blogs and docs for offline use. It uses Chrome to load webpages and has some extra features like custom browser profiles, interactive login, and autoscroll/autoplay. I use the `--generateWACZ` parameter so I can use ReplayWeb to easily browse through the final output.

For bookmark and misc webpage archiving then ArchiveBox should be more than enough. Check out this repo for an amazing list of tools and resources https://github.com/iipc/awesome-web-archiving

[1] https://github.com/ArchiveBox/ArchiveBox [2] https://webrecorder.net [3] https://replayweb.page [4] https://github.com/webrecorder/browsertrix-crawler

SingleFile: Save a complete web page into a single HTML file | Mar 2022

Relevant 'awesome' list for web archiving: https://github.com/iipc/awesome-web-archiving

There are many similar tools there, from archiving to rendering.

Reflections as the Internet Archive turns 25 | Jul 2021

Expand Context ↕

A couple of pointers to the wider world of web archiving:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

https://github.com/iipc/awesome-web-archiving

How Internet Archive Is Ensuring Permanent Access to Open Access Journals | Sep 2020

Expand Context ↕

Example (these have followed the trailblazing Internet Archive):

https://www.wikipedia.org/wiki/List_of_Web_archiving_initiat...

International Internet Preservation Consortium is an active body linking many of them.

https://netpreserve.org/

There are various resources giving information on tools etc, eg.

https://github.com/iipc/awesome-web-archiving

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

Wayback Machine was down | Mar 2020

Expand Context ↕

Try https://github.com/WorldBrain/Memex, it has annotation and lets you review previously seen versions of a site as you're browsing.

Or check out some of the other options here:

- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

- https://github.com/iipc/awesome-web-archiving

Archiving web sites | Nov 2018

Expand Context ↕

While what you say is true, the above method is the only method to archive arbitrary web pages. Yes it depends on user interactions to some extent, but it's possible to reasonably let a page load until fetches stop, and consider it rendered. Generally speaking, you can only archive some preset interactions with a modern web page. You can't hope to capture it all.

There are tools like WebRecorder[0] that do this to some extent by recording and replaying all requests. It's certainly a step in the right direction and demonstrates that the approach is viable. This was the only approach I tried that worked for archiving stuff like three.js demos. Worth mentioning there's also an Awesome list[1] that covers various web archival technologies.

[0] https://github.com/webrecorder/webrecorder

[1] https://github.com/iipc/awesome-web-archiving