I've worked scheduling bugs in other kernels before (Linux is not an outlier here). The key metric we keep an eye on is run queue latency, to detect when threads are waiting longer than one would expect. And there's many ways to measure it; my most recent is runqlat from bcc/BPF tools, which shows it as a histogram. eg:

   # ./runqlat 
   Tracing run queue latency... Hit Ctrl-C to end.
   ^C
        usecs               : count     distribution
            0 -> 1          : 233      |***********                             |
            2 -> 3          : 742      |************************************    |
            4 -> 7          : 203      |**********                              |
            8 -> 15         : 173      |********                                |
           16 -> 31         : 24       |*                                       |
           32 -> 63         : 0        |                                        |
           64 -> 127        : 30       |*                                       |
          128 -> 255        : 6        |                                        |
          256 -> 511        : 3        |                                        |
          512 -> 1023       : 5        |                                        |
         1024 -> 2047       : 27       |*                                       |
         2048 -> 4095       : 30       |*                                       |
         4096 -> 8191       : 20       |                                        |
         8192 -> 16383      : 29       |*                                       |
        16384 -> 32767      : 809      |****************************************|
        32768 -> 65535      : 64       |***                                     |
I'll also use metrics that sum it by thread to estimate speed up (which helps quantify the issue), and do sanity tests.

Note that this isolates one issue -- wait time in the scheduler -- whereas NUMA and scheduling also effects memory placement, so the runtime of applications can become slower with longer latency memory I/O from accessing remote memory. I like to measure and isolate that separately (PMCs).

So I haven't generally seen such severe scheduling issues on our 1 or 2 node Linux systems. Although they are testing on 8 node, which may exacerbate the issue. Whatever the bugs are, though, I'll be happy to see them fixed, and may help encourage people to upgrade to newer Linux kernels (which come with other benefits, like BPF).

I assume BPF is Berkeley Packet Filters or maybe eBPF (Extended Berkeley Packet Filters) in this case. Just to save anyone else having to look this up. It looks like this is the link to the tools.

https://github.com/iovisor/bcc

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter