Pure Python Might be Faster than You Think

Recently I was trying to improve the performance of some of my Python code by writing C/C++ extensions. Before I started I thought “well, I should be able to get 10x to 100x performance boost without much effort”. But eventually I could only speed my program up for 2 to 3 times, with much effort. The gain in performance is too little to compensate the pain in development. At first I thought the poor performance is due to some mistakes in my C/C++ extension, but then I realized that it is due to my code is closely related to calling lib functions. What is that supposed to mean? Compare the following C++ and Python code that do set intersection, which is the simplified version of some of my code that I wished to gain much better performance with C/C++ extension.

  • C++

    1
    2
    3
    4
    5
    6
    7
    8
    9
    auto s1 = unordered_set<int>();
    auto s2 = unordered_set<int>();
    auto s3 = unordered_set<int>();
    // s1 and s2 initlized...
    for (auto j: s1){
    if(s2.find(j) != s2.end()){
    s3.insert(j);
    }
    }
  • Python

    1
    2
    3
    4
    s1 = set()
    s2 = set()
    # s1 and s2 initlized...
    s3 = s1.intersection(s2)

The two codes essentially do the same thing, and their cores are calling lib functions. For C++, it’s find and insert. And for Python, it’s intersection. This post will show why any attempt to speed up this python code is almost certainly in vain. The reasons are two-fold. For code like this:

  • The time cost of Python is mostly in C level (for CPython) rather than Python level.
  • The C level code for Python is very well constructed.

C/C++ or Python, which is faster on set intersection?

One would easily expect the C++ version would do much better in this set intersection job. But let’s do two complete benchmarks on set intersection, at least to see if Python is almost as fast as C/C++. The set size is \(10^5\) and the intersection loops for 100 times.

Benchmark for C++

Complete code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <unordered_set>
#include <stdlib.h>

using namespace std;

int SZ = 100000;

int main(){
auto s1 = unordered_set<int>();
auto s2 = unordered_set<int>();
auto s3 = unordered_set<int>();
for (int i = 0; i < SZ; i++){
s1.insert(rand() % SZ);
s2.insert(rand() % SZ);
}
for (int i = 0; i < 100; i++){
for (auto j: s1){
if(s2.find(j) != s2.end()){
s3.insert(j);
}
}
s3.clear();
}
return 0;
}

Compile with O3 optimization:

1
g++ -std=c++11 -O3 test_set.cpp

And test the time cost:

1
2
$ time ./a.out
./a.out 1.19s user 0.01s system 99% cpu 1.202 total

That’s about 1 sec. A back-of-the-envelope calculation: set size equals \(10^5\), which loops for \(10^2\) times, so total time cost equals \(10^7\). Considering both lookup and insertion requires hashing (which involves at least a mod) and random access to a large container inevitably leads to cache miss, the performance is acceptable. What about Python then?

benchmark for python

Complete code:

1
2
3
4
5
6
7
8
9
10
11
12
import random

SZ = 100000

s1 = set()
s2 = set()
for i in range(SZ):
s1.add(random.randint(0, SZ))
s2.add(random.randint(0, SZ))

for i in range(100):
s3 = s1.intersection(s2)

Then test the time cost directly:

1
2
$ time python test_set.py
python p.py 0.97s user 0.04s system 99% cpu 1.010 total

It costs less than 1 sec and is even faster than C/C++! This result is totally reproducible, and there’s no obvious issue with the C++ intersection code (see this). So, it seems that the C code laying under Python is extremely efficient, but what if the set size becomes smaller? With smaller container size and larger loop cycles which is probably more realistic, C++ should be able to beat Python. By maintaining the whole computational cost constant and shrinking the size of the set at the same time, the threshold at which C++ overcomes Python can be found:

Set size Loop cycles C++ time Python time Ratio
\(10^5\) \(10^2\) 1.20 s 0.97 s 1.2
\(10^4\) \(10^3\) 0.51 s 0.37 s 1.4
\(10^3\) \(10^4\) 0.33 s 0.21 s 1.6
\(10^2\) \(10^5\) 0.27 s 0.14 s 1.9
\(10^1\) \(10^6\) 0.27 s 0.35 s 0.77

The result is unexpected —- only when the size is about 10 does C++ become faster than Python. Furthermore, with size 100, the speed of Python is twice of the speed of C++! When the set becomes smaller, the speed of C++ and Python both increase initially, arguably due to improvement of data locality. But the speed of Python decreases when the loop cycles hit \(10^6\) because of the accumulated overhead for every loop at interpreter level.

Still, C/C++ is faster than Python generally

We’ve seen lots of benchmarks telling that C/C++ is about 10x to 100x faster than Python. How can the benchmark and this post reconcile? For the 100x difference to be correct, the following statement must stand:

  • The Python version should do more than \(10^7\) loops per sec to compete with C/C++ version.

This happens to most benchmarks, but not to many real-world applications. The benchmarks are undoubtedly “correct”, but they typically are loop-intensive as they intend to accomplish something meaningful in a handful of code with minimal dependency. They also have a preference for numerical operations, because it’s basic and prevailing, but this is Python’s exact weak point. In real world applications, programs tend to call lots of lib functions and few will do more than \(10^7\) loops per sec. When the majority of time is spent on lib function instead of arithmetic operation, Python’s outstanding lib is able to provide us many surprises.

Conclusion

In conclusion: Only Python code that do more than \(10^7\) loops per sec has the potential for 10x-100x speedup with C/C++ extension. The code I wished to improve generally has \(10^5\) to \(10^6\) loops per sec. In each loop, a significant number of Python functions are called. In this case, the room for further performance optimization is really small, as Python libs are very well tuned. If these Python libs are replaced with C/C++ lib, the speed might even become slower. There are exceptions, of course. For example, an array of 64-bit int is better than a list of Python int in many ways. But care must be taken when trying to do such optimizations.