This article is full of misinformation to the point that I'm not sure there's many redeeming points in it, and if there are, they're drowned out by wrong information:
> Out-of-tree and will never be mainlined
So the article is heavily Linux-focused. Fine. To an end user, having a driver that's not part of the mainline kernel sources is hardly a deal breaker. It still ends up running in kernel space with all the advantages that comes with.
>> Ubuntu ships ZFS as part of the kernel, not even as a separate loadable module. This redistribution of a combined CDDL/GPLv2 work is probably illegal.
Canonical lawyers have obviously disagreed with this assessment. So far so good.
>> Red Hat will not touch this with a bargepole.
Red Hat also doesn't touch btrfs. Or basically anything that's not ext4 and XFS.
>> You could consider trying the fuse ZFS instead of the in-kernel one at least, as a userspace program it is definitely not a combined work.
No, you really should not. zfs-fuse has not been maintained in over a decade, doesn't even remotely come close to supporting the features of modern ZFS, and frankly... it's FUSE. It's slow as molasses.
> Slow performance of encryption
>> ZoL did workaround the Linux symbol issue above by disabling all use of SIMD for encryption, reducing the performance versus an in-tree filesystem.
Only partially true, but the damage is limited to some metadata structures and the bulk of encryption code does use SIMD instructions (eg, the parts that encrypt your file data).
> Rigid
>> This RAID-X0 (stripe of mirrors) structure is rigid, you can’t do 0X (mirror of stripes) instead at all. You can’t stack vdevs in any other configuration.
Hard to say why it's useful. Hardly anyone ever chooses that configuration even in the non-ZFS world. It's the same capacity tradeoff for a different approach.
>> For argument’s sake, let’s assume most small installations would have a pool with only a single RAID-Z2 vdev.
Okay, not an item that can actually be refuted, but the idea that "most small installations" have only a single raidz2 vdev is a stretch. I'd wager there's a heck of a lot more single-disk and two-disk mirror configurations than all the other types combined.
> Can’t add/remove disks to a RAID
Everything here is accurate. Part of it is because of ZFS's original target audience, part of it is really hard math problems that hadn't been solved in a manner where you can actually pull it off before the death of the universe. Mind that mdadm isn't exactly magic either -- it'll frequently refuse some operations to reshape a RAID array, and it's not quite as upfront in the documentation about what those scenarios are.
> RAIDZ is slow
raidz has to make sure all the disks have the same set of data synchronized. This is a major reason people are generally recommended to use mirrors instead of raidz.
> File-based RAID is slow
ZFS does not use file-based RAID, it uses block-based RAID. Yes, it knows what blocks are used and will only need to scrub/resilver those blocks.
>> Sequential read/write is a far more performant workload for both HDDs and SSDs.
Yes it is. Which is why ZFS 2.0 introduced sequential scrubs.
>> It’s especially bad if you have a lot of small files.
ZFS isn't file-based, it doesn't matter if you have a single 2TB file or a million files. It's the same work either way.
> Real-world performance is slow
Comparing to ext4 isn't the most fair thing to do. ext4 is a dumb file system that will give you raw disk performance, every time. If this is the utmost priority, use ext4. ZFS adds compression, checksums, and redundancy to the mix. Extra protection means it's a bit slower.
> Performance degrades faster with low free space
>> It’s recommended to keep a ZFS volume below 80 - 85% usage and even on SSDs. This means you have to buy bigger drives to get the same usable size compared to other filesystems.
Basically every file system gets bad when in high-utilization, and probably should be taken as a sign to either upgrade the storage or start deleting.
The threshold for where ZFS starts getting painful depends on which anecdote you listen to. I've heard from people running up to 95% utilization on an SSD without feeling the burn.
>> ZFS’s problem is on an entirely different level because it does not have a free-blocks bitmap at all.
ZFS has had a spacemap feature since 0.6.4 and an improved version since 0.8.0. If the output of the "zpool list" command has a FRAG value other than a hyphen, your pool is using this feature already.
> Layering violation of volume management
Basically "but muh Unix philosophy" argument. It's tiresome :)
In order for ZFS to do what it does so well, it has to incorporate all these features that used to be in different layers. The reason that resilvering a ZFS pool is so much faster than the whole-disk RAID solutions of yore? Precisely because it knows exactly what blocks are in-use and what are not.
>> If you use ZFS’s volume management, you can’t have it manage your other drives using ext4, xfs, UFS, ntfs filesystems.
You can create volumes and store ext4, xfs, ufs, ntfs filesystems on top of ZFS just fine.
>> And likewise you can’t use ZFS’s filesystem with any other volume manager.
You can, but you probably shouldn't.
> Doesn’t support reflink
Some work in progress is around to do it, but not in a release version yet. Doesn't mean it never will have it.
> High memory requirements for dedupe
Indeed, and there are research projects at a new dedup algorithm to drastically reduce the memory requirement. Don't use dedup unless you really really need it.
> Dedupe is synchronous instead of asynchronous
Dedup is meant to work on live data so it does what you want immediately.
>> (By comparison, btrfs’s deduplication and Windows Server deduplication run as a background process, to reclaim space at off-peak times.)
Sometimes "off-peak times" don't exist, and this just highlights the limitations of the two mentioned technologies: they don't have an online dedup mode. I know at least for btrfs, you have to completely take the file system offline to do a dedup pass after-the-fact (and the end result is the same as the aforementioned reflink).
> High memory requirements for ARC
Basically a long rundown of Linux's own page cache fighting with the ARC. It's actually a fair point, but it's probably overblown. It's nowhere as dire as the section makes it out to be.
>> Even on FreeBSD where ZFS is supposedly better integrated, ZoL still pretends that every OS is Solaris via the Solaris Porting Layer (SPL) and doesn’t use their page cache neither. This design decision makes it a bad citizen on every OS.
FreeBSD's technical implementation of how they ported ZFS doesn't really matter, and this is the second time the article has said "supposedly better integrated" -- it's not supposed, it's literally as well-integrated as ZFS on Solaris is. (Guess what? Solaris still has UFS too, it's pretty much on-par with FreeBSD)
> Buggy
The flimsiest argument of them all :)
Bug trackers track bugs. Some of them aren't even bugs (as the nature of user-submitted bug reports are). It goes more to show the popularity and widespread use of ZFS than anything else.
I'd be far more concerned about a software project that has no bug reports on display at all.
> No disk checking tool (fsck)
>> Yikes.
There is, it's called scrubbing. See "man zpool-scrub" for details.
>> In ZFS you can use zpool clear to roll back to the last good snapshot, that’s better than nothing.
"zpool clear" is an administrative command to wipe away error reports from storage devices. It should only be used when an administrator determines that the problem is not a bad disk.
ZFS pools can be made of many file systems with any snapshots you desire. There is no "the last good snapshot". Maybe zpool checkpoints are what the author is thinking about, but I doubt it.
>> merely rolling back to the last good snapshot as above does not verify the deduplication table (DDT) and this will cause all snapshots to be unmountable
That really should be impossible. I've never even heard of such a thing happening.
>> coupled with the above point (“Buggy”) if ZFS writes bad data to the disk or writes bad metaslabs, this is a showstopper
This is an error that is detected and provided as part of the "zpool status" command (and as mentioned, "zpool clear" can even clear the errors).
>> and so it should have an fsck.zfs tool that does more repair steps than just exit 0.
You could replace it with one that does a scrub, but that can take weeks on some pools :)
> Things to use instead
>> The baseline comparison should just be ext4. Maybe on mdadm.
If you think mdadm+ext4 is comparable to ZFS, you are waaaaay off. Even btrfs can't hold a candle to ZFS and it comes closer than mdadm+ext4.
Here in this section, the author comes across the term "scrubbing" but doesn't really apply it in the way ZFS uses it.
>> Compression is usually not worthwhile
Hard disagree :)
>> Checksumming is usually not worthwhile
Disks lie. All the time. The claims of "physical disk already has CRC checksums" was around over 20 years ago when the ZFS project started, and the fact that disks lie or do not do strong protection is a huge reason that ZFS was created in the first place. The problem in 2022 remains the same as it was in 2000.
> Summary
>> you can achieve all the same nice advanced features
You really can't. It's not even close. ZFS is so far ahead of the game, that even if some alternatives (eg, btrfs) offer a few similar features, they don't even approach it.
>> ZFS also has a lot of tuning parameters to set.
Having tuning parameters and requiring them are two different things.
>> In the future we’re waiting to see what stratis
stratis is a dead-on-arrival joke.
>> bcachefs
Probably the only thing that has a shot at competing with ZFS.
You seem very knowledgeable about this. I'm a hobbyist who runs debian with a 6 drive raidz2 array in the basement. Hardware aside, do you have any housekeeping suggestions that will help me keep it running well?
My approach is a login script that tells me the health of my zpool. My crontab has a "0 2 * * 0 /sbin/zpool scrub tank" and 6 variants of "0 2 2 * * /usr/sbin/smartctl --test=long /dev/sda &> /dev/null". I've learned to resilver a dead drive recently, it's on a UPS, and I have automated iterative backups elsewhere for critical data that I've practiced restoring from to verify my solution works. Never had much luck with email alerts unless I want to get all of crontab's emails sent to me.
I'm using https://habilis.net/cronic/ to make sure I don't mess up the email notification part of the cronjob. It's a simple wrapper script that sends an email in a readable format if a cronjob fails.
I use Sanoid to create snapshots on my home server, and use Syncoid to push those to a cloud VPS with a beefy network drive as an off-site backup. Both tools are available here: https://github.com/jimsalterjrs/sanoid
The free tier of https://cronitor.io/ makes sure I'm alerted if a cronjob fails, or fails to run on time. Especially that last bit is interesting: that way I'm sure cronjobs aren't silently failing for days/weeks on end.
I have 4 monitors set up in Cronitor: snapshot creation, zpool status on the local and backup machine, and send/receive with Syncoid. This is how that looks on the Cronitor dashboard: https://img.marceldegraaf.net/v6IpNAyxrZ54vgLqIJpY
Let me know if you want more info or examples, happy to share whatever I can to help :-)
EDIT: feel free to reach out via email as well, my address is in my profile.