Conversely we found in a microbenchmark the other day that allowing THP more than doubled the speed.

Note glibc lets you turn on and off THP per process which is pretty useful for benchmarking if it helps or hinders performance.

  $ hyperfine ' nbdkit -U - data "1 * 10737418240" --run exit '
  Benchmark 1:  nbdkit -U - data "1 * 10737418240" --run exit
    Time (mean ± σ):   3.658 s ±  0.049 s    [User: 0.406 s, System: 3.242 s]
    Range (min … max):    3.576 s …  3.713 s    10 runs

  $ hyperfine ' GLIBC_TUNABLES=glibc.malloc.hugetlb=1 nbdkit -U - data "1 * 10 737418240" --run exit '
  Benchmark 1:  GLIBC_TUNABLES=glibc.malloc.hugetlb=1 nbdkit -U - data "1 * 10 737418240" --run exit
    Time (mean ± σ):   1.655 s ±  0.007 s    [User: 0.299 s, System: 1.350 s]
    Range (min … max):    1.643 s …  1.666 s    10 runs

We achieved more than 3 times speedup using "on-demand" transparent huge pages in ClickHouse[1] for a very narrow use-case: random access to a hash table that does not fit in the L3 cache but is not much larger.

But there was a surprise... more than 10 times degradation of overall Linux server performance due to increased physical memory fragmentation after a few days in production: https://github.com/ClickHouse/ClickHouse/commit/60054d177c8b...

It was seven years ago, and I hope that the Linux kernel has been improved. I will need to try "revert of revert" of this commit. These changes cannot be tested by microbenchmarks, and only production usage can show their actual impact.

Also, we successfully use huge pages for text section of the executable, and it is beneficial for the stability of performance benchmarks due to lowering the number of iTLB misses.

[1] ClickHouse - high-performance OLAP DBMS: https://github.com/ClickHouse/ClickHouse/