Another hidden issue is that as a container gets close to running out of memory, it furiously drops read only pages from memory, only to need to read some of them back into memory moments later.

This pathological swapping behavior can impact other workloads on the system.

cgroups2 has better protections against this behavior.

How can you use cgroups2 for this? I know that BSD resource limits are basically useless for this as they only allow to limit virtual memory, not RSS use.

EarlyOOM [1] is a configurable daemon that kills processes early enough to (hopefully) prevent thrashing. I'm using it on my Linux desktops (it has proven to catch my own programs' runaway memory usage before it risks locking up the development machine), but it may also be useful on servers. It logs to syslog but also can be configured to run a program on kill events.

[1] https://github.com/rfjakob/earlyoom, https://launchpad.net/ubuntu/+source/earlyoom, https://packages.debian.org/search?keywords=earlyoom

(Why would a user space OOM killer be necessary if the kernel has better information about the state of the world? I don't know the details, but my interpretation is that because people disliked OOM killing, the kernel devs made the kernel OOM killer trigger so late that it is largely useless. If that's true and thus a social problem, maybe it needs to be solved on that level, too.)

BTW in my experience, Linux 2.2 used to handle out of memory situations much more gracefully than any later kernel version.