What does HackerNews think of ArchiveBox?

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Language: Python

#4 in Firefox
#8 in Python
Probably a little hacky but using a userscript & an archive tool could work. I have a userscript[1] that I've been using for years to archive some news sites but it just uses the Wayback Machine. It can probably be modified to wait more time and send the request to something like ArchiveBox[2] (or any archive tool that has a web interface). The downside is you'd need to whitelist specific sites (or whitelist everything and have a keybind or some content matching to activate the archive function? idk)

[1] https://gist.github.com/antiops/00a37a1de289415fa7cbd9b5d1d2...

[2] ArchiveBox, can have full text search: https://github.com/ArchiveBox/ArchiveBox

It’s not strictly the same but ArchiveBox is very well maintained and recognised for personal archives

https://github.com/ArchiveBox/ArchiveBox

For people interested in this, adjacent solutions would be

- [ArchiveBox/ArchiveBox: Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...](https://github.com/ArchiveBox/ArchiveBox)

- [kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager](https://github.com/kanishka-linux/reminiscence)

- [go-shiori/shiori: Simple bookmark manager built with Go](https://github.com/go-shiori/shiori)

- [xwmx/nb: CLI and local web plain text note‑taking, bookmarking, and archiving with linking, tagging, filtering, search, Git versioning & syncing, Pandoc conversion, + more, in a single portable script.](https://github.com/xwmx/nb)

Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit.

Then there's also ArchiveBox[2] which can convert your browser history into various formats.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/ArchiveBox/ArchiveBox

This tool seems to bundle an extremely primitive hand-rolled browser along with its archiving tool. This is a weird design decision - if you're a fan of the Unix philosophy, then why not separate the two? And, if you're a fan of the modern web standards, why not use a tool like ArchiveBox[1] so you can actually use a modern browser?

[1] https://github.com/ArchiveBox/ArchiveBox

I don't use the archiving feature, but it might be useful to post some of the URLs here that are failing to archive.

I remember seeing in the FAQ [1] that pages above 32 MB won't be archived. That's not much with how bloated pages are these days. I wonder if the page + assets is exceeding that limit?

> Is there a size limit on archived links?

> There is a 32 MB size limit per link, which is typically enough for all but the largest pages. Links larger than 32 MB may not be archived.

As an alternative, ArchiveBox [2] has Pinboard support.

[1]: https://www.pinboard.in/faq/#archiving

[2]: https://github.com/ArchiveBox/ArchiveBox

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.

I looked at Wallabag and went for Pinboard again. Even subscribed for two years.

Seems, it was a mistake.

Archiving stopped around 85% of my bookmarks.

I've had those problems before, especially when the owner played U.S. politics and didn't care much about Pinboard anymore. I still went back.

Pinboard's support doesn't reply to mails, Pinboard's Twitter account doesn't reply, either.

I've asked for a refund (within the one week refund window). No reply.

Maybe he'll reply here. Last time I had such problems he also ignored my mails for a long time and finally replied on HN.

People, look at Wallabag seriously. Maybe Pocket. And maybe Archivebox (https://github.com/ArchiveBox/ArchiveBox).

This looks really useful, as it solves a frequent concern I have: I read lots of articles but gradually important details drift from my recall.

Great to see the suggestions people have had (letting people see the product, giving install details etc) and you've been super responsive.

You mention in replies below about integrating with other tools as a way for it to work. I don't know how feasible this might be, but a tool I've been keen to use that it might fit well with is Archive Box: https://github.com/ArchiveBox/ArchiveBox

If you could integrate with something like that, it would focus on managing the content and this would be about the learning/recall testing. Is that the kind of approach you'd be aiming for?

I’ve not had the chance to try this personally, but have seen ArchiveBox [0] mentioned frequently as a nice bookmark/history archive system.

[0] https://github.com/ArchiveBox/ArchiveBox

I've been running a setup with Recoll and https://github.com/ArchiveBox/ArchiveBox for a few months now [1]. Each morning archivebox scrapes all new links that I've put into my (text-based) notes and saves them as HTML singlefiles. Then Recoll indexes them.

It's very fast and ~4 lines of code. It's surprising how often I rediscover old blog posts & papers that are much better than what Google yields me.

From my experience Recoll isn't very good at searching for aliases sadly.

[1] https://siboehm.com/articles/21/a-local-search-engine

You and GP might find ArchiveBox to have overlap with what you're describing? https://github.com/ArchiveBox/ArchiveBox

Edit: here's the description from their repo

"ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.

You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows.

You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list."

I used Evernote for 2, now it's mostly https://github.com/ArchiveBox/ArchiveBox (still using Evernote in those cases when I need to clip a page behind login). Image handling is a pain point of all text-first note apps, sadly.
I'm pretty fond of using this tool to take trips down memory lane, revisiting lost content I used to enjoy.

Browsing through crawls has this neat side-effect of being able to serendipitously discover things that I missed back in the day just by having everything laid out on the file system.

PSA: There's a lot of holes in most crawls, even for popular stuff. A good way to ensure that you can revisit content later is submitting links to the Wayback Machine with the "Save Page Now" [1] functionality. Some local archivers like ArchiveBox [2] let you automate this. Highly recommended to make a habit of it.

[1] https://web.archive.org/

[2] https://github.com/ArchiveBox/ArchiveBox

Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs

Check out ArchiveBox, a fairly comprehensive locally hosted archiver:

https://github.com/ArchiveBox/ArchiveBox

Thank you for saving that, pirate niki! I think i saved a copy as well, right? Yep, here https://archive.is/jcURO

I've got lots of these archives of issues and comments lying around. It's good to see more!

Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!

https://github.com/ArchiveBox/ArchiveBox

Thanks for showing up niiki, it's good to see you again. See you next time!

This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox