This is very good: it uses the original Hoare partitioning algorithm which moves two indices at opposite ends of the array partition toward each other, rather than the ill-considered Lomuto partitioning:

    while (1) {                              \
      do q_i++; while (Q_LESS(q_i, q_l));    \
      do q_j--; while (Q_LESS(q_l, q_j));    \
     if (q_i >= q_j) break; /* Sedgewick says "until j < i" */ \
     Q_SWAP(q_i, q_j);                       \
    }                                        \
Lomuto is that algorithm that moves one index, swapping lower-than-pivot elements into a growing lower partition.

Lomuto ends up performing more comparisons.

What's more, Lomuto has quadratic behavior for sequences that are just repetitions of a value: every partitioning and sub-partitioning is a degenerate case. Hoare's algorithm deftly avoids this problem. You can explain that using the above snippet. If all elements are the same, then Q_LESS(q_i, q_l) is always false. Thus each of the two do/while loops executes one unconditional iteration, so that the two indices q_i and q_j march toward each other and meet in the middle. The repeating sequence thus nicely cut in half, and recursively so, ensuring O(N log N) behavior.

You can't easily banish the worst case from Quicksort, but implementations should not exhibit worst case behavior on inputs that have some obvious pattern, like a repeated value.

Lomuto partitioning has one major advantage over Hoare: Lomuto can be implemented in such that the inner loop is branchless, while the inner loops in Hoare are branchy and wreck the branch predictor.

Lomuto typically fixes the Dutch National Flag problem by keeping two 'midpoints'; left of the left midpoint is less than the pivot, right of the right midpoint is greater than the pivot, between the two midpoints is equal to the pivot. This makes most data inputs that result in the O(n^2) worst case of quicksort give O(n) performance instead.

Unfortunately the code is horrific looking. It's not something I'd ever want to do in a C macro.

And my algorithm pdqsort uses Hoare-style partitioning, is branchless, and properly solves the Dutch National Flag problem by being 0(n log k) for k unique values on average using only a single extra comparison per partitioning.

https://github.com/orlp/pdqsort