What does HackerNews think of internetarchive?

A Python and Command-Line Interface to Archive.org

Language: Python

> Can we talk about how cool the Wayback Machine Compare feature is?

I consider myself an Internet Archive power-user; I spend hours playing with CDX queries, t̶r̶o̶l̶l̶i̶n̶g̶ trawling the archive for interesting tidbits that have been lost to time. I have spent countless evenings building scripts based on the internetarchive (https://github.com/jjjake/internetarchive) and iamine (https://github.com/jjjake/iamine) tools (along with a host of others).

I am utterly ashamed I had never heard of the /diff/ feature between pages until I read your article. Thank you for bringing this to my attention! I am continuously impressed by the work they do at IA.

Sure, you can use `git bundle` to export your repository to a file, then upload it to the archive. There's a nice command-line tool for doing the upload (https://github.com/jjjake/internetarchive), so you can easily script this so that it's automatic.
I can't speak formally for the Internet Archive, but the existing content and services are not going to disappear overnight: funding comes from several sources, thought has been put in to organizational structure, and things have been designed to keep core access and preservation infrastructure running with minimal cost and effort (eg, if the economy tanks).

Getting the content coverage people sometimes assume we already have is another matter. Additional funding (thanks for you donation!) go towards additional crawling and keeping up with the endless treadmill of media types and protocols. Eg, headless browser crawling development and deployment to capture javascript-heavy sites (https://github.com/internetarchive/brozzler); this is much more expensive than "classic" crawling.

For more on increasing storage costs and the under-funded state of web archiving in general, I recommend David Rosenthal's blog, eg:

https://blog.dshr.org/2018/05/longer-talk-at-msst2018.html

https://blog.dshr.org/2014/03/the-half-empty-archive.html

Far more effective and robust than hoping the archive is "suck it up for us" is to upload snapshots/dumps/exports yourself! Anybody can create an archive.org account and upload content (recommend https://github.com/jjjake/internetarchive over the HTML form), within reasonable limits. Obviously, care needs to be taken to remove sensitive (and personal) information first.

I do wish somebody would crank out an easy to use desktop GUI tool for this.

edit this will make it better

https://gareth.halfacree.co.uk/2013/04/bulk-downloading-coll...

and

https://github.com/jjjake/internetarchive