Rather silly title, but that's true of many papers reusing this meme.

The abstract presents an alternative to wear leveling that they refer to as "capacity variance", also known as breaking compatibility with all previous filesystems, drivers, and bootloaders. Wear leveling has been a necessary evil, because without it flash storage simply would not be usable at all as hard drive replacements/successors. Nowadays some of the pieces are in place to allow software to interact with SSDs in a more natural fashion, but it still isn't possible to completely forego the FTL that emulates hard drive behavior.

Maybe:

① when we really need wear leveling because of previous filesystems, we should implement it as a kernel module instead of in the SSD firmware? Then we can debug the failures it causes, tune its performance to our use case, and tune the software running on top of it to perform well with its strengths and weaknesses. Compatibility with previous bootloaders isn't important because writing a bootloader for a given piece of hardware is a one-afternoon task.

② The nearly 3× increase in firesystem capacity the authors report might be worth a significant amount of software work?

③ Hard disks are so different from SSDs that we're probably paying much bigger penalties than that 3× for forcing SSDs to pretend to be hard disks?

To be concrete, hard disks are about eighty thousand times slower for random access than modern SDRAM: 8000 μs versus 0.1 μs. Their sequential transfer rate is maybe 250 megabytes per second, which means that the "bandwidth-delay product" is about 4 megabytes; if you're accessing data in chunks of less than 4 megabytes, you're wasting most of your disk's time seeking instead of transferring data. 25 years ago this number was about 256K.

A Samsung Evo 970 1TB SSD over NVMe PCIe3 can peak at 100k iops and 6.5 GB/s (personal communication) which works out to 64 kilobytes. I've heard that other SSDs commonly can sustain full read bandwidth down to 4KiB per I/O operation. But writes are an order of magnitude slower (100 μs instead of 10 μs) and erases are an order of magnitude slower than that (1000 μs instead of 100 μs). So, aside from the shitty endurances down in the mere hundreds of writes, even when you aren't breaking it, the SSD's performance characteristics are very, very different from a disk's. If you drive it with systems software designed for a disk, you're probably wasting most of its power.

For example:

ⓐ because the absolute bandwidth to memory is on the order of 10% of memcpy bandwidth, and 30× the bandwidth from a disk, lots of architectural choices that prefer caching stuff in RAM pay off much more poorly with an SSD (possibly worsening performance rather than improving it), and architectural choices that put one or more memcpys on the data path cost proportionally much more.

ⓑ because the "bandwidth-delay product" is 60–1000 times smaller, many design tradeoffs made to reduce random seeks on disks are probably a bad deal. At the very least you want to reduce the size of your B-tree nodes.

ⓒ because writes are so much more expensive than reads, design tradeoffs that increase writes in order to reduce reads (for example, to sort nearby records together in a database, or defrag your disk) are very often a bad deal on SSDs.

ⓓ because erases (which affect large contiguous areas in order to get acceptable performance) are so much more expensive than reads or writes, sequential writes are enormously cheaper than random writes, but in a different way than on disks. SMR has a similar kind of issue.

ⓔ we have the ridiculous situation described in the paper where the SSD wastes 65% of its storage capacity in order to perpetuate the illusion that it's just a really fast disk. 65%! (If the simulator they used is correct.)

Where / how did you get those spiffy enumerators?

http://canonical.org/~kragen/setting-up-keyboard maps my right Alt to Compose, and https://github.com/kragen/xcompose maps Compose ( 1 ) to ①, etc.