What does HackerNews think of ArchiveBox?
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
[1] https://gist.github.com/antiops/00a37a1de289415fa7cbd9b5d1d2...
[2] ArchiveBox, can have full text search: https://github.com/ArchiveBox/ArchiveBox
- [ArchiveBox/ArchiveBox: Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...](https://github.com/ArchiveBox/ArchiveBox)
- [kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager](https://github.com/kanishka-linux/reminiscence)
- [go-shiori/shiori: Simple bookmark manager built with Go](https://github.com/go-shiori/shiori)
- [xwmx/nb: CLI and local web plain text note‑taking, bookmarking, and archiving with linking, tagging, filtering, search, Git versioning & syncing, Pandoc conversion, + more, in a single portable script.](https://github.com/xwmx/nb)
Then there's also ArchiveBox[2] which can convert your browser history into various formats.
I remember seeing in the FAQ [1] that pages above 32 MB won't be archived. That's not much with how bloated pages are these days. I wonder if the page + assets is exceeding that limit?
> Is there a size limit on archived links?
> There is a 32 MB size limit per link, which is typically enough for all but the largest pages. Links larger than 32 MB may not be archived.
As an alternative, ArchiveBox [2] has Pinboard support.
The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.
For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.
Seems, it was a mistake.
Archiving stopped around 85% of my bookmarks.
I've had those problems before, especially when the owner played U.S. politics and didn't care much about Pinboard anymore. I still went back.
Pinboard's support doesn't reply to mails, Pinboard's Twitter account doesn't reply, either.
I've asked for a refund (within the one week refund window). No reply.
Maybe he'll reply here. Last time I had such problems he also ignored my mails for a long time and finally replied on HN.
People, look at Wallabag seriously. Maybe Pocket. And maybe Archivebox (https://github.com/ArchiveBox/ArchiveBox).
Great to see the suggestions people have had (letting people see the product, giving install details etc) and you've been super responsive.
You mention in replies below about integrating with other tools as a way for it to work. I don't know how feasible this might be, but a tool I've been keen to use that it might fit well with is Archive Box: https://github.com/ArchiveBox/ArchiveBox
If you could integrate with something like that, it would focus on managing the content and this would be about the learning/recall testing. Is that the kind of approach you'd be aiming for?
It's very fast and ~4 lines of code. It's surprising how often I rediscover old blog posts & papers that are much better than what Google yields me.
From my experience Recoll isn't very good at searching for aliases sadly.
Edit: here's the description from their repo
"ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.
You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows.
You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.
It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list."
Browsing through crawls has this neat side-effect of being able to serendipitously discover things that I missed back in the day just by having everything laid out on the file system.
PSA: There's a lot of holes in most crawls, even for popular stuff. A good way to ensure that you can revisit content later is submitting links to the Wayback Machine with the "Save Page Now" [1] functionality. Some local archivers like ArchiveBox [2] let you automate this. Highly recommended to make a habit of it.
I've got lots of these archives of issues and comments lying around. It's good to see more!
Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!
https://github.com/ArchiveBox/ArchiveBox
Thanks for showing up niiki, it's good to see you again. See you next time!
The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox