What does HackerNews think of openzfs-nvme-databases?

I recently rebuilt a load of infrastructure (mainly LAMP servers) and decided to back them all with ZFS on Linux for the benefit of efficient backup replication and encryption.

I've been using ZFS in combination with rsync for backups for a long time, so I was fairly comfortable with it... and it all worked out, but it was a way bigger time sink than I expected - because I wanted to do it right - and there is a lot of misleading advice on the web, particularly when it comes to running databases and replication.

For databases (you really should at minimum do basic tuning like block size alignment), by far the best resource I found for mariadb/innoDB is from the lets encrypt people [0]. They give reasons for everything and cite multiple sources, which is gold. If you search around the web elsewhere you will find endless contradicting advice, anecdotes and myths that are accompanied with incomplete and baseless theories. Ultimately you should also test this stuff and understand everything you tune (it's ok to decide to not tune something).

For replication, I can only recommend the man pages... yeah, really! ZFS gives you solid replication tools, but they are too agnostic, they are like git pluming, they don't assume you're going to be doing it over SSH (even though that's almost always how it's being used)... so you have to plug it together yourself, and this feels scary at first, especially because you probably want it to be automated, which means considering edge cases... which is why everyone runs to something like syncoid, but there's something horrible I discovered with replication scripts like syncoid, which is that they don't use ZFS's send --replication mode! They try to reimplement it in perl, for "greater flexibility", but incompletely. This is maddening when you are trying to test this stuff for the first time and find that all of the encryption roots break when you do a fresh restore, and not all dataset properties are automatically synced. ZFS takes care of all of this if you simply use the build in recursive "replicate" option. It's not that hard to script manually once you commit to it, just keep it simple, don't add a bunch of unnecessary crap into the pipeline like syncoid does, (they actually slow it down if you test), just use pv to monitor progress and it will fly.

I might publish my replication scripts at some point because I feel like there are no good functional reference scripts for this stuff that deal with the basics without going nuts and reinventing replication badly like so many others.

[0] https://github.com/letsencrypt/openzfs-nvme-databases

I'm more interested in how they used ZFS to provide redundancy. I always thought ZFS was optimized for spinning platters with SSD's used for persistent caching. In this scenario they used it to set up all their SSD's in mirrored pairs then stripe across that. No ZIL.

They've tweaked a few other settings as well [1].

I'd be curious to see more benchmarks and latency data (especially as they're utilizing compression, and of course checksums are computed over all data not just metadata like some other filesystems).

[1] https://github.com/letsencrypt/openzfs-nvme-databases

I'm thankful for their OpenZFS tuning doc which they developed as part of this server migration: https://github.com/letsencrypt/openzfs-nvme-databases

The one thing that I get hung up on when it comes to RAID and SSDs is the wear pattern vs. HDDs. Take for example this quote from the README.md:

We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.

Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random. In my mind, it makes sense to mirror SSD-based vdevs only for performance reasons and not for data integrity. The reason is that the mirrors are expected to fail after the same amount of TBW, and thus the availability/redundancy guarantee of mirroring is relatively unreliable.

Maybe someone with more experience in this area can change my mind, but if it were up to me, I would have configured the mirror drives as spares, and relied on a local HDD-based zpool for quick backup/restore capability. I imagine that would be a better solution, although it probably wouldn't have fit into tryingq's ideal 2U space.

> The Let's Encrypt post does not describe how they implement off-machine and off-site backup-and-recovery. I'd like to know if and how they do this.

The section:

> There wasn’t a lot of information out there about how best to set up and optimize OpenZFS for a pool of NVMe drives and a database workload, so we want to share what we learned. You can find detailed information about our setup in this GitHub repository.

points to: https://github.com/letsencrypt/openzfs-nvme-databases

Which states:

> Our primary database server rapidly replicates to two others, including two locations, and is backed up daily. The most business- and compliance-critical data is also logged separately, outside of our database stack. As long as we can maintain durability for long enough to evacuate the primary (write) role to a healthier database server, that is enough.

Which sounds like traditional master/slave setup, with fail over?

I enjoyed it as well, i'm also appreciative that they shared their configuration notes here. I've been running multiple data stores on ZFS for years now and it's taken a while to get out of the hardware mindset (albeit you still need a nice beefy controller anyway).

https://github.com/letsencrypt/openzfs-nvme-databases