What does HackerNews think of ArchiveBox?

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Language: Python

#4 in Firefox

#8 in Python

How to Read and Organize Online Articles (Without Driving Yourself Crazy) (2013) | Jun 2023

https://github.com/ArchiveBox/ArchiveBox is also an option for organizing articles.

Dweb: Offline Internet Archive | Jan 2023

ArchiveBox

https://github.com/ArchiveBox/ArchiveBox

Command line and local web note‑taking, bookmarking, archiving application | Oct 2022

Expand Context ↕

Probably a little hacky but using a userscript & an archive tool could work. I have a userscript[1] that I've been using for years to archive some news sites but it just uses the Wayback Machine. It can probably be modified to wait more time and send the request to something like ArchiveBox[2] (or any archive tool that has a web interface). The downside is you'd need to whitelist specific sites (or whitelist everything and have a keybind or some content matching to activate the archive function? idk)

[1] https://gist.github.com/antiops/00a37a1de289415fa7cbd9b5d1d2...

[2] ArchiveBox, can have full text search: https://github.com/ArchiveBox/ArchiveBox

Ask HN: What is going on at archive.ph? | Sep 2022

Expand Context ↕

It’s not strictly the same but ArchiveBox is very well maintained and recognised for personal archives

https://github.com/ArchiveBox/ArchiveBox

Show HN: LinkWarden – A self-hosted bookmark and archive manager | Jun 2022

For people interested in this, adjacent solutions would be

- [ArchiveBox/ArchiveBox: Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...](https://github.com/ArchiveBox/ArchiveBox)

- [kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager](https://github.com/kanishka-linux/reminiscence)

- [go-shiori/shiori: Simple bookmark manager built with Go](https://github.com/go-shiori/shiori)

- [xwmx/nb: CLI and local web plain text note‑taking, bookmarking, and archiving with linking, tagging, filtering, search, Git versioning & syncing, Pandoc conversion, + more, in a single portable script.](https://github.com/xwmx/nb)

Show HN: Self-Hosted wayback machine, pocket | May 2022

It's this comparable to Archivebox? https://github.com/ArchiveBox/ArchiveBox

Show HN: Searchable offline archive of browser history | Apr 2022

https://github.com/ArchiveBox/ArchiveBox

HN thread: https://news.ycombinator.com/item?id=27842740

Ask HN: Full-text browser history search forever? | Mar 2022

Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit.

Then there's also ArchiveBox[2] which can convert your browser history into various formats.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/ArchiveBox/ArchiveBox

Offpunk 1.0: Offline Gemini/Gopher/Web Browsing | Mar 2022

This tool seems to bundle an extremely primitive hand-rolled browser along with its archiving tool. This is a weird design decision - if you're a fan of the Unix philosophy, then why not separate the two? And, if you're a fan of the modern web standards, why not use a tool like ArchiveBox[1] so you can actually use a modern browser?

[1] https://github.com/ArchiveBox/ArchiveBox

Ask HN: What Happened to Pinboard? | Mar 2022

I don't use the archiving feature, but it might be useful to post some of the URLs here that are failing to archive.

I remember seeing in the FAQ [1] that pages above 32 MB won't be archived. That's not much with how bloated pages are these days. I wonder if the page + assets is exceeding that limit?

> Is there a size limit on archived links?

> There is a 32 MB size limit per link, which is typically enough for all but the largest pages. Links larger than 32 MB may not be archived.

As an alternative, ArchiveBox [2] has Pinboard support.

[1]: https://www.pinboard.in/faq/#archiving

[2]: https://github.com/ArchiveBox/ArchiveBox

Help Preserve the Internet with Archiveteam's Warrior | Mar 2022

Expand Context ↕

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.

Wallabag: Self-hostable application for saving web pages and articles | Feb 2022

I looked at Wallabag and went for Pinboard again. Even subscribed for two years.

Seems, it was a mistake.

Archiving stopped around 85% of my bookmarks.

I've had those problems before, especially when the owner played U.S. politics and didn't care much about Pinboard anymore. I still went back.

Pinboard's support doesn't reply to mails, Pinboard's Twitter account doesn't reply, either.

I've asked for a refund (within the one week refund window). No reply.

Maybe he'll reply here. Last time I had such problems he also ignored my mails for a long time and finally replied on HN.

People, look at Wallabag seriously. Maybe Pocket. And maybe Archivebox (https://github.com/ArchiveBox/ArchiveBox).

Show HN: Lurnby, a tool for better learning, is now open source | Feb 2022

This looks really useful, as it solves a frequent concern I have: I read lots of articles but gradually important details drift from my recall.

Great to see the suggestions people have had (letting people see the product, giving install details etc) and you've been super responsive.

You mention in replies below about integrating with other tools as a way for it to work. I don't know how feasible this might be, but a tool I've been keen to use that it might fit well with is Archive Box: https://github.com/ArchiveBox/ArchiveBox

If you could integrate with something like that, it would focus on managing the content and this would be about the learning/recall testing. Is that the kind of approach you'd be aiming for?

Ask HN: How to save bookmarks locally including searchable content? | Feb 2022

I’ve not had the chance to try this personally, but have seen ArchiveBox [0] mentioned frequently as a nice bookmark/history archive system.

[0] https://github.com/ArchiveBox/ArchiveBox

Show HN: Irchiver, your full-resolution personal web archive | Dec 2021

Expand Context ↕

ArchiveBox is popular - https://github.com/ArchiveBox/ArchiveBox

Recoll: A desktop full-text search tool | Oct 2021

I've been running a setup with Recoll and https://github.com/ArchiveBox/ArchiveBox for a few months now [1]. Each morning archivebox scrapes all new links that I've put into my (text-based) notes and saves them as HTML singlefiles. Then Recoll indexes them.

It's very fast and ~4 lines of code. It's surprising how often I rediscover old blog posts & papers that are much better than what Google yields me.

From my experience Recoll isn't very good at searching for aliases sadly.

[1] https://siboehm.com/articles/21/a-local-search-engine

HN 15 Years ago | Oct 2021

Expand Context ↕

https://github.com/ArchiveBox/ArchiveBox

Where did the web archive go? | Aug 2021

Expand Context ↕

Everyone who runs ArchiveBox at home :)

https://github.com/ArchiveBox/ArchiveBox

A Unix-style personal search engine and web crawler for your digital footprint | Jul 2021

Expand Context ↕

You and GP might find ArchiveBox to have overlap with what you're describing? https://github.com/ArchiveBox/ArchiveBox

Edit: here's the description from their repo

"ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.

You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows.

You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list."

Show HN: Obsidian for Mobile – Plain-text knowledge base on the go | Jul 2021

Expand Context ↕

I used Evernote for 2, now it's mostly https://github.com/ArchiveBox/ArchiveBox (still using Evernote in those cases when I need to clip a page behind login). Image handling is a pain point of all text-first note apps, sadly.

Wayback Machine Downloader | Jul 2021

I'm pretty fond of using this tool to take trips down memory lane, revisiting lost content I used to enjoy.

Browsing through crawls has this neat side-effect of being able to serendipitously discover things that I missed back in the day just by having everything laid out on the file system.

PSA: There's a lot of holes in most crawls, even for popular stuff. A good way to ensure that you can revisit content later is submitting links to the Wayback Machine with the "Save Page Now" [1] functionality. Some local archivers like ArchiveBox [2] let you automate this. Highly recommended to make a habit of it.

[1] https://web.archive.org/

[2] https://github.com/ArchiveBox/ArchiveBox

Lmgrep: Lucene-based grep-like utility | Apr 2021

Expand Context ↕

Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs

Everything I Know | Jan 2021

Expand Context ↕

Check out ArchiveBox, a fairly comprehensive locally hosted archiver:

https://github.com/ArchiveBox/ArchiveBox

Make Your Own Internet Archive with Archive Box | Jan 2021

Expand Context ↕

Thank you for saving that, pirate niki! I think i saved a copy as well, right? Yep, here https://archive.is/jcURO

I've got lots of these archives of issues and comments lying around. It's good to see more!

Btw i knew you'd show up on a comment thread where i posted my stuff. You're like obsessed with my project, right? Why, haha? Maybe because you've got you own archive project, the archive box. "Competition". Hahaha. I hope you haven't got to the point of "search alerts" obsessed, hahaha, don't fret too much, niki pirate! Everyone check out pirate's the archive box, he made the trip here, do him a favor!

https://github.com/ArchiveBox/ArchiveBox

Thanks for showing up niiki, it's good to see you again. See you next time!

Make Your Own Internet Archive with Archive Box | Jan 2021

This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox