Intro
Quicksort is quick, and it’s well known that it can be defeated under certain circumstances, from the simplest already sorted data, to the killer adversary. But if someone claims he has found another algorithm that can outperform Quicksort in random array without strong constraint on the data to be sorted, you’d probably disdain for it and think he must get something wrong. At least, that’s what I would do before my recent discovery - Mergesort is faster than Quicksort, if you use GCC and your CPU is somewhat “new”.
Ridiculous, right? If you don’t believe me, have a look at this Google Colab notebook:
It appears on the Google platform, with the NumPy 1.14.6
implementation, for a \(10^6\) sized int64 random array, Mergesort is faster than Quicksort, or the timing module of IPython is wrong.
Is this a bug? Is this particular? Well, you can do a quick benchmark with the following code on your machine (preferably with the same NumPy version in the notebook):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# -*- coding: utf-8 -*-
from __future__ import print_function
import time
import numpy as np
print(np.__version__)
size = 10 ** 5
def benchmark(kind):
a = np.random.randint(0, size, size)
times = []
for i in range(100):
b = a.copy()
time1 = time.time()
b.sort(kind=kind)
times.append(time.time() - time1)
return np.array(times)
for kind in ["quicksort", "mergesort"]:
times = benchmark(kind) * 10 ** 6 # use usec as unit
print("{}: {:.0f}±{:.0f} usec".format(kind, times.mean(), times.std()))
The output on my machine is:1
2
31.15.2
quicksort: 7407±640 usec
mergesort: 5923±691 usec
As we shall see later, the result is CPU and compiler dependent. So if you get the same result, let’s delve into it to see what really happens, and if not, the remaining of this post will still make sense to you, and in the end of the post you’ll have a clear idea how to make Mergesort beat Quicksort on your computer.
Get rid of NumPy
At this point it is quite natural to think that the abnormal result is related to NumPy because it’s a huge project and lots of places can behave unexpectedly. In fact, at later version the Quicksort implementation in NumPy is actually Introsort, which transforms Quicksort to Heapsort if the worst case scenario of Quicksort seems approaching.
At first I think this is the origin of the slower Quicksort. The inference is easy to validate (or invalidate) by writing some home-brew sort code. This is easy. Quicksort and Mergesort are both very fundamental algorithms and appear in every algorithm course or textbook. Below is the code, almost couldn’t be simpler:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92void int64_quick_sort(int64_t *dst, const size_t size) {
int64_t *pl, *pr, *pm, pivot;
if (size < 16) {
int64_insertion_sort(dst, size);
return;
}
pl = dst;
pr = pl + size;
pm = pl + ((pr - pl) >> 1);
--pr;
/* median of 3 */
if (*pm < *pl) {
SORT_SWAP(*pm, *pl);
}
if (*pr < *pm) {
SORT_SWAP(*pr, *pm);
}
if (*pm < *pl) {
SORT_SWAP(*pm, *pl);
}
SORT_SWAP(*pl, *pm);
pivot = *pl;
for (;;) {
do { --pr; }
while (pivot < *pr);
if (pr <= pl) {
break;
}
SORT_SWAP(*pl, *pr);
do { ++pl; }
while (*pl < pivot);
if (pr <= pl) {
break;
}
SORT_SWAP(*pl, *pr);
}
*pl = pivot;
int64_quick_sort(dst, pl - dst);
int64_quick_sort(pl + 1, dst + size - pl - 1);
}
void int64_merge_sort_rec(int64_t *dst, const size_t size, int64_t *pw) {
int64_t *pl, *pr, *pm;
if (size < 16) {
int64_insertion_sort(dst, size);
return;
}
pl = dst;
pr = pl + size;
pm = pl + ((pr - pl) >> 1);
int64_merge_sort_rec(pl, pm - pl, pw);
int64_merge_sort_rec(pm, pr - pm, pw);
memcpy(pw, pl, (pm - pl) * sizeof(int64_t));
while (pm < pr && pl < pm) {
if (*pw <= *pm) {
*pl++ = *pw++;
} else {
*pl++ = *pm++;
}
}
if (pl < pm) {
memcpy(pl, pw, (pm - pl)*sizeof(int64_t));
}
return;
}
void int64_merge_sort(int64_t *dst, const size_t size) {
int64_t *pw;
pw = (int64_t *) malloc(size * sizeof(int64_t));
int64_merge_sort_rec(dst, size, pw);
free(pw);
}
Integrate this code with some benchmark framework, I’m able to produce the following output:1
2
3
4
5
6cc -O3 -g -Wall -pedantic main.c -o fast_merge.out
./fast_merge.out
int64_quick_sort - ok, 25683.0 usec
int64_merge_sort - ok, 19178.0 usec
float64_quick_sort - ok, 28916.0 usec
float64_merge_sort - ok, 27397.0 usec
Surprised, huh? You can double-check again there’s nothing fishy in the source code of the benchmark suite or both algorithms. Again, the result is platform and compiler dependent, so if you can’t reproduce this result, don’t worry, you’ll know why at the end of the post.
CPU and Compiler
I soon find out that CPU (or more broadly platform) and compiler have an impact on the benchmark result. For the same binary executable, different CPU lead to different conclusion. Below is a little summary of platforms I tested:
CPU type | Launch Date | Faster int | Faster float | Comment |
---|---|---|---|---|
E5-2620 | Q1’12 | Mergesort | Quicksort | |
E5-2620 v2 | Q3’13 | Mergesort | The same | |
i7-4790k | Q2’14 | Mergesort | Quicksort | |
E5-2630 v3 | Q3’14 | Mergesort | Quicksort | |
E5-2640 v3 | Q3’14 | Mergesort | Quicksort | |
i7-6700HQ | Q3’15 | Mergesort | Mergesort | My PC |
E5-2676 v3 | Q4’15 | Mergesort | The same | Amazon vps |
E5-2680 v4 | Q1’16 | Mergesort | Mergesort | |
E5-26xx v4 | Q?’16 | Mergesort | Mergesort | Tencent vps |
Silver 4116 | Q3’17 | Mergesort | Mergesort | |
Unknown Xeon | ? | Mergesort | Mergesort | Google cloud console |
It seems with newer CPU Mergesort performs better on float sort, indicating that the reason why Mergesort is faster lays in the assembly code level.
Another strong evidence indicating the assembly code matters is the compiler dependency. If Clang is used as the compiler, the world we’re familiar with comes back again:1
2
3
4
5
6clang -O3 -g -Wall -pedantic main.c -o fast_merge.out
./fast_merge.out
int64_quick_sort - ok, 25017.0 usec
int64_merge_sort - ok, 29115.0 usec
float64_quick_sort - ok, 27895.0 usec
float64_merge_sort - ok, 32559.0 usec
Quicksort becomes the best! Combined with previous GCC result, the mist around super-fast Mergesort becomes thinner:
GCC must have done some genius optimization on the Mergesort code that greatly boost the speed, meanwhile Clang is not able to do so. So the difference between the binary code generated between GCC and Clang is our key to answer why Mergesort is faster.
Disassembly
There are tons of assembly code out there and we should only focus on the most important part, that is, those corresponding to the following C code:1
2
3
4
5
6
7while (pm < pr && pl < pm) {
if (*pw <= *pm) {
*pl++ = *pw++;
} else {
*pl++ = *pm++;
}
}
Because this is where the merging process, the heaviest work, takes place. The code is not hard to understand, plain usual Mergesort routine. pl
, pm
and pr
points at the left side, middle and right side of the array respectively, while pw
points to the buffer to which the left part of the array has already been copied. First let’s look at the code generated by Clang for sorting int64:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
250x4009f0 <int64_merge_sort_rec+224> mov (%r14),%rax
0x4009f3 <int64_merge_sort_rec+227> mov (%r15),%rsi
0x4009f6 <int64_merge_sort_rec+230> cmp %rsi,%rax ;compare *pw and *pm
0x4009f9 <int64_merge_sort_rec+233> jle 0x400a1e <int64_merge_sort_rec+270>
0x4009fb <int64_merge_sort_rec+235> lea 0x8(%r15),%rdx ;the else branch
0x4009ff <int64_merge_sort_rec+239> lea 0x8(%rbx),%rcx
0x400a03 <int64_merge_sort_rec+243> mov %rsi,(%rbx)
0x400a06 <int64_merge_sort_rec+246> cmp %r15,%rbx
0x400a09 <int64_merge_sort_rec+249> setb %al
0x400a0c <int64_merge_sort_rec+252> cmp %rbp,%rdx
0x400a0f <int64_merge_sort_rec+255> jae 0x400a3b <int64_merge_sort_rec+299>
0x400a11 <int64_merge_sort_rec+257> cmp %r15,%rbx
0x400a14 <int64_merge_sort_rec+260> mov %rdx,%r15
0x400a17 <int64_merge_sort_rec+263> mov %rcx,%rbx
0x400a1a <int64_merge_sort_rec+266> jb 0x4009f0 <int64_merge_sort_rec+224>
0x400a1c <int64_merge_sort_rec+268> jmp 0x400a3b <int64_merge_sort_rec+299>
0x400a1e <int64_merge_sort_rec+270> add $0x8,%r14 ;the *pw <= *pm branch
0x400a22 <int64_merge_sort_rec+274> mov %rax,(%rbx)
0x400a25 <int64_merge_sort_rec+277> add $0x8,%rbx
0x400a29 <int64_merge_sort_rec+281> cmp %r15,%rbx
0x400a2c <int64_merge_sort_rec+284> setb %al
0x400a2f <int64_merge_sort_rec+287> cmp %rbp,%r15
0x400a32 <int64_merge_sort_rec+290> jae 0x400a41 <int64_merge_sort_rec+305>
0x400a34 <int64_merge_sort_rec+292> cmp %r15,%rbx
0x400a37 <int64_merge_sort_rec+295> jb 0x4009f0 <int64_merge_sort_rec+224>
There isn’t much optimization in this code actually, merely a “machine translation” of C code. Note that two branches appear in this assembly code, decided by <int64_merge_sort_rec+233>
, corresponding to if (*pw <= *pm)
in C code.
So, what about the code GCC generated? The first impression it gives me is short:1
2
3
4
5
6
7
8
9
10
11
12
13
14
150x4009b0 <int64_merge_sort_rec+256> cmp %rbx,%rbp
0x4009b3 <int64_merge_sort_rec+259> jae 0x4009e4 <int64_merge_sort_rec+308>
0x4009b5 <int64_merge_sort_rec+261> mov (%r12),%rax
0x4009b9 <int64_merge_sort_rec+265> mov (%rbx),%rdx
0x4009bc <int64_merge_sort_rec+268> lea 0x8(%r12),%rsi
0x4009c1 <int64_merge_sort_rec+273> lea 0x8(%rbx),%rcx
0x4009c5 <int64_merge_sort_rec+277> cmp %rdx,%rax
0x4009c8 <int64_merge_sort_rec+280> cmovle %rsi,%r12
0x4009cc <int64_merge_sort_rec+284> add $0x8,%rbp
0x4009d0 <int64_merge_sort_rec+288> cmp %rdx,%rax
0x4009d3 <int64_merge_sort_rec+291> cmovg %rcx,%rbx
0x4009d7 <int64_merge_sort_rec+295> cmovg %rdx,%rax
0x4009db <int64_merge_sort_rec+299> cmp %rbx,%r13
0x4009de <int64_merge_sort_rec+302> mov %rax,-0x8(%rbp)
0x4009e2 <int64_merge_sort_rec+306> ja 0x4009b0 <int64_merge_sort_rec+256>
Without further digging you can tell from the length that this must be very efficient, but let’s dig deeper into this and see what magic GCC has done.
%rbx
and%rbp
representpm
andpl
respectively. If they become equal, then break the loop, otherwise continue.- The values in address
%r12
(pw
) and%rbx
(pm
) are then moved to%rax
and%rdx
. %rax
and%rdx
are compared twice in<int64_merge_sort_rec+277>
and<int64_merge_sort_rec+288>
to decide if%r12
(pw
) and%rbx
(pm
) really need to self-increase.- The smaller of
%rdx
(*pm
) and%rax
(*pw
) is stored to%rax
(<int64_merge_sort_rec+295>
) and then moved to*pl
(<int64_merge_sort_rec+302>
). %rbx
(pm
) is at last compared with%r13
(pr
) as another condition to decide whether break the loop or start from the beginning.
Note that there is only one branch and only two highly predictable jump instructions in the control flow. This pattern can make CPU pipeline work at maximum power. The price is multiple comparison between *pw
and *pm
, which isn’t really a big deal for int64. What a bargain!
Interested reader can compare length, number of branches as well as predictability of jump instructions between GCC compiled Mergesort and Quicksort, then it may become much reasonable why Mergesort can be faster than Quicksort. In fact,
perf stats
indicates that Quicksort has 15% branch-miss while Mergesort has only 3%.
Now that it becomes clear why GCC compiled binary is much faster than Clang compiled binary, how about the relation between the speed of float number sorting and CPU release date? Why does newer CPU do better in Mergesort for float number? Well, I’m not an expert on CPU architecture and performance, but I guess the reason is that comparing float number used to be expensive but it keeps getting cheaper and cheaper over days. Hence old CPU suffer from multiple float number comparison, whereas for new CPU comparison of float is as cheap as comparison of integer.
How to make Mergesort faster than Quicksort
If you’ve tested my benchmark code in your environment and Quicksort is still invincible, basically you need to check two things:
- Is your CPU too old? If it’s 10 years old well you’d better get a new one.
- Is the assembly code generated by your compiler “optimized” for Mergesort? My test environment includes GCC 5.4.0 (Ubuntu 16.04 default) and GCC 4.8.5 (Centos 7 default) so these two versions should be OK. My preliminary result on Windows shows that no compiler on the platform including
vc
can do the optimization.
If all compilers fail on your machine or you’re simply not interested in disassembly, you can checkout the inline-assembly version of Mergesort in this repo, which yields compiler-independent optimized Mergesort code.
A convenient way for you to avoid buying a new CPU or setting up lots of compilers is to use the Google cloud console. Try:1
2
3git clone https://github.com/liwt31/fast_mergesort.git
cd fast_mergesort/bin
./fast_merge.out
fast_merge.out
is a compiled binary with optimized Mergesort. You’d probably see a result saying Mergesort is faster than Quicksort like this:1
2
3
4int64_quick_sort - ok, 30958.0 usec
int64_merge_sort - ok, 23785.0 usec
float64_quick_sort - ok, 33966.0 usec
float64_merge_sort - ok, 27837.0 usec
You can also compile and run for yourself:1
2cd ..
make run
Conclusion
Mergesort is faster than Quicksort, with modern CPU and a genius compiler.