Intro

Quicksort is quick, and it’s well known that it can be defeated under certain circumstances, from the simplest already sorted data, to the killer adversary. But if someone claims he has found another algorithm that can outperform Quicksort in random array without strong constraint on the data to be sorted, you’d probably disdain for it and think he must get something wrong. At least, that’s what I would do before my recent discovery - Mergesort is faster than Quicksort, if you use GCC and your CPU is somewhat “new”.

Ridiculous, right? If you don’t believe me, have a look at this Google Colab notebook:

It appears on the Google platform, with the NumPy 1.14.6 implementation, for a \(10^6\) sized int64 random array, Mergesort is faster than Quicksort, or the timing module of IPython is wrong.
Is this a bug? Is this particular? Well, you can do a quick benchmark with the following code on your machine (preferably with the same NumPy version in the notebook):

# -*- coding: utf-8 -*-
from __future__ import print_function

import time

import numpy as np

print(np.__version__)

size = 10 ** 5


def benchmark(kind):
    a = np.random.randint(0, size, size)
    times = []
    for i in range(100):
        b = a.copy()
        time1 = time.time()
        b.sort(kind=kind)
        times.append(time.time() - time1)
    return np.array(times)


for kind in ["quicksort", "mergesort"]:
    times = benchmark(kind) * 10 ** 6  # use usec as unit
    print("{}: {:.0f}±{:.0f} usec".format(kind, times.mean(), times.std()))

The output on my machine is:

1
2
3

1.15.2
quicksort: 7407±640 usec
mergesort: 5923±691 usec

As we shall see later, the result is CPU and compiler dependent. So if you get the same result, let’s delve into it to see what really happens, and if not, the remaining of this post will still make sense to you, and in the end of the post you’ll have a clear idea how to make Mergesort beat Quicksort on your computer.

Get rid of NumPy

At this point it is quite natural to think that the abnormal result is related to NumPy because it’s a huge project and lots of places can behave unexpectedly. In fact, at later version the Quicksort implementation in NumPy is actually Introsort, which transforms Quicksort to Heapsort if the worst case scenario of Quicksort seems approaching.
At first I think this is the origin of the slower Quicksort. The inference is easy to validate (or invalidate) by writing some home-brew sort code. This is easy. Quicksort and Mergesort are both very fundamental algorithms and appear in every algorithm course or textbook. Below is the code, almost couldn’t be simpler:

void int64_quick_sort(int64_t *dst, const size_t size) {
    int64_t *pl, *pr, *pm, pivot;

    if (size < 16) {
        int64_insertion_sort(dst, size);
        return;
    }

    pl = dst;
    pr = pl + size;
    pm = pl + ((pr - pl) >> 1);
    --pr;

    /* median of 3 */
    if (*pm < *pl) {
        SORT_SWAP(*pm, *pl);
    }

    if (*pr < *pm) {
        SORT_SWAP(*pr, *pm);
    }

    if (*pm < *pl) {
        SORT_SWAP(*pm, *pl);
    }

    SORT_SWAP(*pl, *pm);
    pivot = *pl;

    for (;;) {
        do { --pr; }
        while (pivot < *pr);

        if (pr <= pl) {
            break;
        }

        SORT_SWAP(*pl, *pr);

        do { ++pl; }
        while (*pl < pivot);

        if (pr <= pl) {
            break;
        }

        SORT_SWAP(*pl, *pr);
    }

    *pl = pivot;
    int64_quick_sort(dst, pl - dst);
    int64_quick_sort(pl + 1, dst + size - pl - 1);
}


void int64_merge_sort_rec(int64_t *dst, const size_t size, int64_t *pw) {
    int64_t *pl, *pr, *pm;

    if (size < 16) {
        int64_insertion_sort(dst, size);
        return;
    }

    pl = dst;
    pr = pl + size;
    pm = pl + ((pr - pl) >> 1);
    int64_merge_sort_rec(pl, pm - pl, pw);
    int64_merge_sort_rec(pm, pr - pm, pw);
    memcpy(pw, pl, (pm - pl) * sizeof(int64_t));

    while (pm < pr && pl < pm) {
        if (*pw <= *pm) {
            *pl++ = *pw++;
        } else {
            *pl++ = *pm++;
        }
    }

    if (pl < pm) {
        memcpy(pl, pw, (pm - pl)*sizeof(int64_t));
    }

    return;
}

void int64_merge_sort(int64_t *dst, const size_t size) {
    int64_t *pw;

    pw = (int64_t *) malloc(size * sizeof(int64_t));
    int64_merge_sort_rec(dst, size, pw);
    free(pw);
}

Integrate this code with some benchmark framework, I’m able to produce the following output:

cc -O3 -g -Wall -pedantic main.c -o fast_merge.out
./fast_merge.out
int64_quick_sort              - ok,    25683.0 usec
int64_merge_sort              - ok,    19178.0 usec
float64_quick_sort            - ok,    28916.0 usec
float64_merge_sort            - ok,    27397.0 usec

Surprised, huh? You can double-check again there’s nothing fishy in the source code of the benchmark suite or both algorithms. Again, the result is platform and compiler dependent, so if you can’t reproduce this result, don’t worry, you’ll know why at the end of the post.

CPU and Compiler

I soon find out that CPU (or more broadly platform) and compiler have an impact on the benchmark result. For the same binary executable, different CPU lead to different conclusion. Below is a little summary of platforms I tested:

CPU type	Launch Date	Faster int	Faster float	Comment
E5-2620	Q1’12	Mergesort	Quicksort
E5-2620 v2	Q3’13	Mergesort	The same
i7-4790k	Q2’14	Mergesort	Quicksort
E5-2630 v3	Q3’14	Mergesort	Quicksort
E5-2640 v3	Q3’14	Mergesort	Quicksort
i7-6700HQ	Q3’15	Mergesort	Mergesort	My PC
E5-2676 v3	Q4’15	Mergesort	The same	Amazon vps
E5-2680 v4	Q1’16	Mergesort	Mergesort
E5-26xx v4	Q?’16	Mergesort	Mergesort	Tencent vps
Silver 4116	Q3’17	Mergesort	Mergesort
Unknown Xeon	?	Mergesort	Mergesort	Google cloud console

It seems with newer CPU Mergesort performs better on float sort, indicating that the reason why Mergesort is faster lays in the assembly code level.
Another strong evidence indicating the assembly code matters is the compiler dependency. If Clang is used as the compiler, the world we’re familiar with comes back again:

clang -O3 -g -Wall -pedantic main.c -o fast_merge.out
./fast_merge.out
int64_quick_sort              - ok,    25017.0 usec
int64_merge_sort              - ok,    29115.0 usec
float64_quick_sort            - ok,    27895.0 usec
float64_merge_sort            - ok,    32559.0 usec

Quicksort becomes the best! Combined with previous GCC result, the mist around super-fast Mergesort becomes thinner:

fast

GCC must have done some genius optimization on the Mergesort code that greatly boost the speed, meanwhile Clang is not able to do so. So the difference between the binary code generated between GCC and Clang is our key to answer why Mergesort is faster.

Disassembly

There are tons of assembly code out there and we should only focus on the most important part, that is, those corresponding to the following C code:

while (pm < pr && pl < pm) {
    if (*pw <= *pm) {
        *pl++ = *pw++;
    } else {
        *pl++ = *pm++;
    }
}

Because this is where the merging process, the heaviest work, takes place. The code is not hard to understand, plain usual Mergesort routine. pl, pm and pr points at the left side, middle and right side of the array respectively, while pw points to the buffer to which the left part of the array has already been copied. First let’s look at the code generated by Clang for sorting int64:

0x4009f0 <int64_merge_sort_rec+224>     mov    (%r14),%rax 
0x4009f3 <int64_merge_sort_rec+227>     mov    (%r15),%rsi 
0x4009f6 <int64_merge_sort_rec+230>     cmp    %rsi,%rax   ;compare *pw and *pm
0x4009f9 <int64_merge_sort_rec+233>     jle    0x400a1e <int64_merge_sort_rec+270>
0x4009fb <int64_merge_sort_rec+235>     lea    0x8(%r15),%rdx   ;the else branch
0x4009ff <int64_merge_sort_rec+239>     lea    0x8(%rbx),%rcx 
0x400a03 <int64_merge_sort_rec+243>     mov    %rsi,(%rbx) 
0x400a06 <int64_merge_sort_rec+246>     cmp    %r15,%rbx   
0x400a09 <int64_merge_sort_rec+249>     setb   %al   
0x400a0c <int64_merge_sort_rec+252>     cmp    %rbp,%rdx   
0x400a0f <int64_merge_sort_rec+255>     jae    0x400a3b <int64_merge_sort_rec+299>    
0x400a11 <int64_merge_sort_rec+257>     cmp    %r15,%rbx   
0x400a14 <int64_merge_sort_rec+260>     mov    %rdx,%r15   
0x400a17 <int64_merge_sort_rec+263>     mov    %rcx,%rbx   
0x400a1a <int64_merge_sort_rec+266>     jb     0x4009f0 <int64_merge_sort_rec+224>    
0x400a1c <int64_merge_sort_rec+268>     jmp    0x400a3b <int64_merge_sort_rec+299>    
0x400a1e <int64_merge_sort_rec+270>     add    $0x8,%r14   ;the *pw <= *pm branch
0x400a22 <int64_merge_sort_rec+274>     mov    %rax,(%rbx) 
0x400a25 <int64_merge_sort_rec+277>     add    $0x8,%rbx   
0x400a29 <int64_merge_sort_rec+281>     cmp    %r15,%rbx   
0x400a2c <int64_merge_sort_rec+284>     setb   %al   
0x400a2f <int64_merge_sort_rec+287>     cmp    %rbp,%r15   
0x400a32 <int64_merge_sort_rec+290>     jae    0x400a41 <int64_merge_sort_rec+305>    
0x400a34 <int64_merge_sort_rec+292>     cmp    %r15,%rbx   
0x400a37 <int64_merge_sort_rec+295>     jb     0x4009f0 <int64_merge_sort_rec+224>

There isn’t much optimization in this code actually, merely a “machine translation” of C code. Note that two branches appear in this assembly code, decided by <int64_merge_sort_rec+233>, corresponding to if (*pw <= *pm) in C code.
So, what about the code GCC generated? The first impression it gives me is short:

0x4009b0 <int64_merge_sort_rec+256>     cmp    %rbx,%rbp      
0x4009b3 <int64_merge_sort_rec+259>     jae    0x4009e4 <int64_merge_sort_rec+308> 
0x4009b5 <int64_merge_sort_rec+261>     mov    (%r12),%rax    
0x4009b9 <int64_merge_sort_rec+265>     mov    (%rbx),%rdx    
0x4009bc <int64_merge_sort_rec+268>     lea    0x8(%r12),%rsi 
0x4009c1 <int64_merge_sort_rec+273>     lea    0x8(%rbx),%rcx 
0x4009c5 <int64_merge_sort_rec+277>     cmp    %rdx,%rax      
0x4009c8 <int64_merge_sort_rec+280>     cmovle %rsi,%r12      
0x4009cc <int64_merge_sort_rec+284>     add    $0x8,%rbp      
0x4009d0 <int64_merge_sort_rec+288>     cmp    %rdx,%rax      
0x4009d3 <int64_merge_sort_rec+291>     cmovg  %rcx,%rbx      
0x4009d7 <int64_merge_sort_rec+295>     cmovg  %rdx,%rax      
0x4009db <int64_merge_sort_rec+299>     cmp    %rbx,%r13      
0x4009de <int64_merge_sort_rec+302>     mov    %rax,-0x8(%rbp)
0x4009e2 <int64_merge_sort_rec+306>     ja     0x4009b0 <int64_merge_sort_rec+256>

Without further digging you can tell from the length that this must be very efficient, but let’s dig deeper into this and see what magic GCC has done.

%rbx and %rbp represent pm and pl respectively. If they become equal, then break the loop, otherwise continue.
The values in address %r12 (pw) and %rbx (pm) are then moved to %rax and %rdx.
%rax and %rdx are compared twice in <int64_merge_sort_rec+277> and <int64_merge_sort_rec+288> to decide if %r12 (pw) and %rbx (pm) really need to self-increase.
The smaller of %rdx (*pm) and %rax (*pw) is stored to %rax (<int64_merge_sort_rec+295>) and then moved to *pl (<int64_merge_sort_rec+302>).
%rbx (pm) is at last compared with %r13 (pr) as another condition to decide whether break the loop or start from the beginning.

Note that there is only one branch and only two highly predictable jump instructions in the control flow. This pattern can make CPU pipeline work at maximum power. The price is multiple comparison between *pw and *pm, which isn’t really a big deal for int64. What a bargain!

Interested reader can compare length, number of branches as well as predictability of jump instructions between GCC compiled Mergesort and Quicksort, then it may become much reasonable why Mergesort can be faster than Quicksort. In fact, perf stats indicates that Quicksort has 15% branch-miss while Mergesort has only 3%.

Now that it becomes clear why GCC compiled binary is much faster than Clang compiled binary, how about the relation between the speed of float number sorting and CPU release date? Why does newer CPU do better in Mergesort for float number? Well, I’m not an expert on CPU architecture and performance, but I guess the reason is that comparing float number used to be expensive but it keeps getting cheaper and cheaper over days. Hence old CPU suffer from multiple float number comparison, whereas for new CPU comparison of float is as cheap as comparison of integer.

How to make Mergesort faster than Quicksort

If you’ve tested my benchmark code in your environment and Quicksort is still invincible, basically you need to check two things:

Is your CPU too old? If it’s 10 years old well you’d better get a new one.
Is the assembly code generated by your compiler “optimized” for Mergesort? My test environment includes GCC 5.4.0 (Ubuntu 16.04 default) and GCC 4.8.5 (Centos 7 default) so these two versions should be OK. My preliminary result on Windows shows that no compiler on the platform including vc can do the optimization.
If all compilers fail on your machine or you’re simply not interested in disassembly, you can checkout the inline-assembly version of Mergesort in this repo, which yields compiler-independent optimized Mergesort code.

A convenient way for you to avoid buying a new CPU or setting up lots of compilers is to use the Google cloud console. Try:

1
2
3

git clone https://github.com/liwt31/fast_mergesort.git
cd fast_mergesort/bin
./fast_merge.out

fast_merge.out is a compiled binary with optimized Mergesort. You’d probably see a result saying Mergesort is faster than Quicksort like this:

int64_quick_sort              - ok,    30958.0 usec
int64_merge_sort              - ok,    23785.0 usec
float64_quick_sort            - ok,    33966.0 usec
float64_merge_sort            - ok,    27837.0 usec

You can also compile and run for yourself:

1 2	cd .. make run

Conclusion

Mergesort is faster than Quicksort, with modern CPU and a genius compiler.