What does HackerNews think of duperemove?

Tools for deduping file systems

Language: C

Very useful for identifying files that may need to get deduplicate or that can be removed entirely. Unfortunately, I don't think this will also find identical directories.

If deleting files isn't what you want, I'd suggest looking into deduplicating tools.

ZFS has its own de duplicator built in, which is nice. It should just deduplicate files and individual extents of files by itself once you enable it. Probably not a good idea on very write-heavy disks, but it's an option.

Other file systems with extent level deduplication can use https://github.com/markfasheh/duperemove to not only deduplicaye files, but also deduplicate individual extents. This can be very useful for file systems that store a lot of duplicate content, like different WINE prefixes. For filesystems without extent deduplication, duperemove should try hard linking files to make them take up practically no disks space.

ZFS now has reflink support, which doesn't require lots of RAM, but isn't done automatically while writing. You need to run something like https://github.com/markfasheh/duperemove
With XFS and reflink, out-of-band deduplication is totally possible and is a userspace [1] issue. But XFS is not doing anything to assist in accelerating the identification of duplicate blocks, instead it simply implements ioctls for what is essentially extent sharing.

[1] https://github.com/markfasheh/duperemove

So, I'm asking questions because I'm curious and very probably ignorant. I'm not trying to pick a fight, make points, or get in a dick-waving contest. I also don't know how closely you follow btrfs development. I skim the lists from time to time, so if I'm telling you stuff that you already know, or if my memory is not quite correct, I apologize in advance. Also, if you notice this comment after the comment submission window closes, you can reach me at $MY_HN_USERNAME@gmail

What's wrong with how compression works? From a user's perspective, you either set a mount option, set a bit on a file attribute, or explicitly call for compression with btrfs defrag. [0] If you combine the latter two operations into a single checkbox, this is exactly how NTFS handles things. What am I missing here? Also, I can't agree that modern[1] btrfs handles unexpected power cuts poorly. This just hasn't been my experience. I can, however, agree that log replay isn't complete and still needs work.

I can't speak to the ENOSPC bugs, I haven't run into any in a very, very long time. Some time after 3.14, btrfs got a pool of space called the "Global Reserve" [2] which was intended to address ENOSPC issues. Somewhere around that time, btrfs also grew a better btrfs-specific df function invoked with "btrfs filesystem usage". I don't make extensive use of snapshots, but the numbers I get out of btrfs fi usage almost exactly match the numbers I get from plain old df.

I'm unware of snapshot automounting. Can you help me understand what this is? (A brief Google search wasn't enlightening.) I agree that taking away the ability to have subvolumes outside the FS tree if you didn't set things up just right at FS creation time is pretty bullshit. That certainly needs reworked. Everything else in my limited experience with subvol management seems okay to me. What's seems strange to you?

Btrfs doesn't yet have built-in online dedup, but are you aware of dupremove? [3]

Can you give me a recipe to create an N(N-1)/2 sized random access file? I use BTRFS for some very small random-access databases and large pretty-much-append-only databases and haven't run into this behavior.

I have a couple of things to say about your other comment:

People generally don't write to a mailing list to say what a good time they're having with your software.

It's pretty shit that log replay isn't worked out yet.

I certainly* don't intend to stop making backups. OTOH, I've been "getting lucky" for five years straight while running a FS that many people regard as the most untrustworthy thing in the world on a drive that many people regard as entirely unreliable and untrustworthy. :)

Cheers!

[0] Conceptually, this makes a lot of sense, 'cause all defrag does is re-write the file. NTFS re-writes files when you request that they be compressed, too.

[1] "Modern" means btrfs from the past two years or so.

[2] Much like ext*'s reserved-for-the-superuser space.

[3] https://github.com/markfasheh/duperemove