What does HackerNews think of pdqsort?

Pattern-defeating quicksort.

Language: C++

And my algorithm pdqsort uses Hoare-style partitioning, is branchless, and properly solves the Dutch National Flag problem by being 0(n log k) for k unique values on average using only a single extra comparison per partitioning.

https://github.com/orlp/pdqsort

Closely related is pattern defeating quicksort ( https://github.com/orlp/pdqsort ), which adapts quicksort to take advantage of sorted runs. I've adapted a few quicksorts to pdqsort and seen good speedups (as people were often sorting partially sorted data)

Basically: Timsort is to mergesort as pdqsort is to quicksort

See also this 2017 thread on pdqsort: https://news.ycombinator.com/item?id=14661659

Direct link to the GitHub home of pdqsort (in C++), which has some performance comparisons: https://github.com/orlp/pdqsort

This article is missing the important reference to the prior work by Edelkamp and Weiss, on branchless quicksort [1].

I've long since implemented that in pdqsort [2], along with other techniques described by me in [3] which is a drop-in replacement for std::sort that's ~twice as fast as std::sort for small data with branchless comparisons. It's also available in Boost.sort [4].

[1] https://arxiv.org/abs/1604.06697 [2] https://github.com/orlp/pdqsort [3] https://arxiv.org/abs/2106.05123 [4] https://www.boost.org/doc/libs/1_76_0/libs/sort/doc/html/sor...

> Agreed on pdqsort being 3x slower worst case on "killer" input, but it's my understanding that's why it's not being used to replace std::sort.

This is just not true. Introsort (the current std::sort in all C++ compilers I know of) also has 'killer' inputs, and will switch to heapsort twice as slow than than pdqsort does in those. Worse, the libc++ implementation of std::sort has a quadratic killer sequence that I found 7 years ago, and it is still not fixed: https://bugs.llvm.org/show_bug.cgi?id=20837. In my opinion there is no good reason that pdqsort is not adopted yet as std::sort as it is faster for many patterns, faster for random input and rarely if ever slower.

In fact, pdqsort is the standard unstable sorting algorithm in Rust: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sor...

Note that any deterministic quicksort variant that does not spend a lot of time on choosing its pivots will always have some 'killer' pattern that forces it to use its fallback algorithm (or worse, go O(n^2)), as you can always shuffle the input array such that poor pivots are chosen. What pdqsort does is shuffle things around so that those 'killer' patterns are a lot less trivial, and are not as likely to occur in real-world input. For me determinism was very important so I didn't do it in https://github.com/orlp/pdqsort, but note that the variant implemented in Rust uses non-deterministic swapping and thus it is impossible to force a 'killer' input against it.

Just for fun, I added pdqsort to the benchmark:

https://github.com/orlp/pdqsort

Here are some of the results on an Ivy Bridge hackintosh:

      size, qsort, inline, sort, stable, pdqsort, radix7
      1000,  88.0,   60.6, 31.2,   37.5,    24.6,   12.8
     10000, 109.3,   77.1, 45.6,   51.6,    29.1,   12.0
    100000, 137.1,   97.7, 57.7,   71.9,    33.3,   12.5
   1000000, 159.6,  114.2, 70.6,   88.6,    39.7,   20.4
  10000000, 185.2,  133.7, 82.0,   99.7,    43.6,   19.4
Edit: it's not quite as easy to plug radix7 into pdqsort's more varied benchmark program, which expects a template function with a type parameter of int.
Hoare partitioning can also be implemented in a branchless manner, my algorithm pdqsort does this: https://github.com/orlp/pdqsort
C's qsort() is notoriously bad because the comparison function isn't inlined, meaning the majority of time is spent in call overhead.

Zhangxp1998 ported quadsort to C++ and compared with std::sort and found it to be slower: https://gist.github.com/zhangxp1998/0e2fa30656c894017d183e0d...

Shameless self promotion, if you want a faster sorting implementation, check out pdqsort (https://github.com/orlp/pdqsort) which is almost twice as fast as libstdc++'s std::sort for randomly shuffled integers, and has all sorts of tricks for input distributions with patterns (like pre-sorted, reverse sorted, single out-of-order element appended, etc).

It's not even the fastest comparison-based sorter for the vast majority of inputs. There are much faster sorters out there that exploit the details of modern hardware (branch prediction is a big one, parallelism is also an obvious win if you can make use of it, etc), e.g. https://github.com/SaschaWitt/ips4o/ (sequential and parallel, disclaimer: the authors are my colleagues) or https://github.com/orlp/pdqsort (sequential)
I also love pdqsort. It has many of the same advantages, and has the advantage of being easier to implement (particularly if you already have a quicksort lying around).

https://github.com/orlp/pdqsort

Basically, timsort is to mergesort as pdqsort is to quicksort.

That is true - the benchmarks mostly focus on random cases, although there are a few benchmarks with "mostly sorted" arrays (sorted arrays with sqrt(n) random swaps).

If the input array consists of several concatenated ascending or descending sequences, then timsort is the best. After all, timsort was specifically designed to take advantage of that particular case. Pdqsort performs respectably, too, and if you have more than a dozen of these sequences or if the sequences are interspersed, then it starts winning over timsort.

Anyways, both pdqsort and timsort perform well when the input is not quite random. In particular, pdqsort blows introsort (e.g. typical C++ std::sort implementations) out of the water when the input is not random[1]. It's pretty much a strict improvement over introsort. Likewise, timsort (at least the variant implemented in Rust's standard library) is pretty much a strict improvement over merge sort (e.g. typical C++ std::stable_sort implementations).

Regarding radix sort, pdqsort can't quite match its performance (it's O(n log n) after all), but can perform fairly respectably. E.g. ska_sort[2] (a famous radix sort implementation) and Rust's pdqsort perform equally well on my machine when sorting 10 million random 64-bit integers. However, on larger arrays radix sort starts winning easily, which shouldn't be surprising.

I'm aware that benchmarks are tricky to get right, can be biased, and are always controversial. If you have any further questions, feel free to ask.

[1]: https://github.com/orlp/pdqsort

[2]: https://github.com/skarupke/ska_sort

My comments here are not to the posted article, but the one referenced at the top: http://www.oilshell.org/blog/2016/12/23.html

    def LookupKind(id):
         return KIND_TABLE[id]
    def LookupKind(id):
        return 175 & id & ((id ^ 173) + 11)
> The second implementation is faster because it doesn't require any memory accesses.

This is not necessarily true. If this function is in a hot loop, and the entire lookup table fits in L1 cache you can treat KIND_TABLE[id] as only taking ~1-2 cycles.

> Jeff Dean also posed this question in his talk: How long it does it take to quicksort 1 billion 4-byte integers? It surprised me that he estimated it by simply counting the number of branch mispredictions, which is described in this post.

This is no longer true. In state of the art implementations quicksort is implemented in a branchless fashion. See https://github.com/orlp/pdqsort for details (I'm the author of pdqsort, old username on HN). The trick is to first replace directly swapping misplaced elements in a loop by first finding a buffer of elements on the left that should be on the right, and a buffer of elements that are on the right but should be on the left. This can be done in a branchless manner:

    buffer_num = 0; buffer_max_size = 64;
    for (int i = 0; i < buffer_max_size; ++i) {
        // With branch:
        if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
        // Without:
        buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
    }
These buffers can then also be swapped branchlessly.
> But, worst case of Quicksort is still O( n^2 ) so what happens if Quicksort encounters such a partition?

I'd strongly suggest you to read the readme I wrote for https://github.com/orlp/pdqsort, which explains in detail how I beat heapsort (and other sorting algorithms). It also shows how I prevent the O(n^2) worst case.

> I used to get torqued at considering big-O, and did this just because it ignores constants; as a programmer, I worked hard to do well with such constants. But in the end I gave in and gave up and just accepted big-O notation as the most appropriate, simple, single criterion.

We have already established that O(n log n) is the lower bound, and we have reached this lower bound. This is where big O stops being useful. A metric that compares equal for every relevant algorithm (mergesort, introsort, pdqsort, timsort, smoothsort, heapsort) is not a useful metric.

> In the land of big-O, a factor of only 2 is considered not very exciting.

I live in the land of real-world performance, powered by actual benchmarks. In this land a factor of 2 is very exciting.

> So, for a hybrid sort routine, consider also radix sort.

I worked in the scope of std::sort, which is strictly a comparison sort.