Code for this tutorial found here: https://github.com/shuttle1987/pybind11_tutorial_code

Recently I was involved in some discussion sparked by this blog post where “rdrewd” was talking about benchmarking a simple program that generated prime numbers in Python. The thing about this that I like is that it’s a very good example of a situation where making pragmatic decisions about implementation is highly relevant. I like that the author is focusing on the pragmatics of solving the underlying problem.

There’s a page on the python wiki about performance in python that is a good summary of some key issues. One passage I think is highly relevant (emphasis mine):

Some people are inappropriately obsessed with speed and think that just because C can provide better performance for certain types of problem, it must therefore be a better language for all purposes. Other people think that speed of development is far more important, and choose Python even for those applications where it will run slower. Often, they are surprised to find Python code can run at quite acceptable speeds, and in some cases even faster than what they could get from C/C++

with a similar amount of development time invested.Usually it is not the absolute speed that is important, you should think about what would be an acceptable speed of execution. Optimizations beyond achieving this acceptable speed are wasteful of resources (usually: your time. And thus: money.).

The general philosophy of cPython is to write the bottleneck code in a faster language. The original blog post got my attention because it’s a very good example of a time where calling out to a faster language makes sense. Personally if I’m faced with a situation where the code is running in cPython I’ll try PyPy first because the speedups due to the tracing JIT are quite remarkable. Many times in my career switching to PyPy was sufficient to make the implementation of a project have feasible run times. The times it has not been enough I’ve used various FFI techniques to get the benefit of fast development in Python while running performance critical code in a faster language.

If there’s one thing I’d like people to take away from this is that choosing implementations is not a binary choice, sometimes mixing and matching is best.

## Generation of prime numbers

I mostly want to talk about programming implementation details but first I think it’s good to go over some algorithms.

Even the best implementation in the world can’t overcome a bad algorithm choice in a problem like this. Efficiently generating prime numbers is actually quite involved.

Naive algorithms like the Sieve of Eratosthenes are conceptually easy to understand but allow for almost no locality of reference. Modern CPUs are very sophisticated and are often explicitly designed to leverage locality of reference via memory caching. Efficiently using the CPU cache can have huge impacts on the performance of otherwise similar looking code.

The best algorithms for generating primes allow for good locality of reference. For example the Sieve of Atkin has a better asymptotic complexity than the Sieve of Eratosthenes, but it also has much better locality of reference compared to a *naive *implementation of the Sieve of Eratosthenes. (Upon further investigation even though the asymptotic complexity for an idealized operation is lower for some algorithms the actual underlying operations on the hardware can have a significantly large constant factor. In a practical setting this cannot be overlooked, benchmarking as always is very useful)

A good implementation of a prime number generation algorithm will take locality into account. This will outperform naive implementations because the algorithms with better locality of reference allows for much better utilization of cache-backed CPU hardware.

All this applies for any language, when it matters good algorithms choice along with good implementation makes a huge difference.

**EDIT:** As **Dana Jacobsen** has pointed out in the comments the choice of of algorithm here is quite involved, and optimizing these algorithms makes a substantial difference, thanks for pointing out some considerations I overlooked.

## Practical implementation

Lets say we have chosen the right algorithm and that algorithm intrinsically takes some serious computing power to run. How do we actually get that into code in a way we can easily use in our applications?

The reference code that sparked this off looks like this in Python3:

primes=[2] cand=3 while len(primes) < 20: for factor in primes: if cand % factor == 0: # Not a prime break else: primes.append(cand) cand += 2 print(primes)

Original Python2 code on pastebin: http://pastebin.com/DHQtzN0s

For the sake of the article lets assume that this algorithm is the one we have decided to use. The blog post talked of the relative merits of using different languages, a C implementation was made and people benchmarked various different implementations of the same algorithm in different languages. Many people in the comments tried various languages and the inevitable discussions of the relative merits of languages came up. The author expressed regret nobody wrote a c++ implementation, so here’s a naive one I threw together today:

#include <iostream> #include <vector> /** * Generates the a sequence of prime numbers * @param how_many how many primes to generate */ std::vector<int> generate_primes(unsigned int how_many){ std::vector<int> primes; primes.reserve(how_many); primes.push_back(2); //2 is the first prime number unsigned int candidate = 3; bool potential_prime = true; while(primes.size() < how_many){ potential_prime = true; for(unsigned int prime: primes){ if(candidate % prime == 0){ candidate += 2; potential_prime = false; break; } } if(potential_prime){ primes.push_back(candidate); } } return primes; } int main(){ std::vector<int> first_20_primes = generate_primes(20); for(int prime : first_20_primes){ std::cout << prime << std::endl; } return 0; }

This C++11 solution is fairly close to the original python code.

Comparatively speaking you can see that Python makes it really easy to implement *something* quickly. There’s more that can be done with c++ as well but this is about the limit of quickly putting something together. As is fairly obvious C++ is just a more verbose language.

## Benchmarking

Decisions about performance benefit enormously from benchmarking. Because the impact of performance is very much non-uniform finding *where* the performance matters is a crucial step. If you are ever trying to decide if you need to use a foreign function interface out or any other means of writing code in a different language due to performance reasons you really need to be benchmarking your code to find out where the *real* bottlenecks are.

Some benchmarks were provided in the blog post that were on extremely small sets of data. The problem size was small enough where the IO completely eclipsed the run time. *But that was the authors entire point*. In a throwaway project this small the run time difference did not actually matter. In the particular case of generating 20 primes, just write it in Python, you will have the solution with the least amount of developer hours spent that way. The computational time is going to be small enough that the trade-off of time that would be required to write it in another language is just not worth it.

However what happens when you want to generate the first thousand primes? How about the first 10 thousand? What happens if you put this code on a popular web server where you are getting a lot of hits and you are charged for the amount of CPU time used?

We should seriously consider another implementation when the balance shifts away from ease of implementation being the main factor. Don’t make the decision lightly, if you are writing high quality code to begin with then using higher performance code that complicates the building of your project involves a real trade-off in the form of introducing additional complexity.

First we would benchmark and see where the bottlenecks are as this enables us to make a judgment call about the costs of improving the performance of the code by re-implementing those bottlenecks.

## FFI and python

So lets say you have benchmarked and you have found that your python implementation of this function is such a bottleneck that you can’t satisfy your requirements, what do you do next?

One of the big things that has changed in recently is how much easier it is to make python bindings now. You don’t even need to know the python C-api, you can use something like pybind11 and just use that to generate your python modules.

Here’s how we would expose that C++ function to Python:

#include <vector> #include <pybind11/pybind11.h> #include <pybind11/stl.h> namespace py = pybind11; /** * Generates the a sequence of prime numbers * @param how_many how many primes to generate */ std::vector<int> generate_primes(unsigned int how_many){ std::vector<int> primes; primes.reserve(how_many); primes.push_back(2); //2 is the first prime number unsigned int candidate = 3; bool potential_prime = true; while(primes.size() < how_many){ potential_prime = true; for(unsigned int prime: primes){ if(candidate % prime == 0){ candidate += 2; potential_prime = false; break; } } if(potential_prime){ primes.push_back(candidate); } } return primes; } /** * Generate the python bindings for this C++ function */ PYBIND11_PLUGIN(primes_py3) { py::module m("example", "Generating primes in c++ with python bindings using pybind11"); m.def("generate_primes", &generate_primes, "A function which generates a list of primes. The length of this list is specified by the user"); return m.ptr(); }

Overall the amount of boilerplate is remarkably small. Also the pybind11 library is header only, which is great for keeping builds simple as we don’t need to link to it. However the build is still complex enough that making a script is worthwhile:

#!/bin/bash g++ -O3 -fPIC -shared -std=c++11 -I /home/janis/code_snippets/prime_gen/pybind11-1.8.1/include `python3-config --cflags --ldflags` prime_gen_python3_bindings.cpp -o primes_py3.so

This compiles the C++ into a form that can be imported from python directly:

janis@janis-laptop:~/code_snippets/prime_gen (master)$ python3 Python 3.4.3+ (default, Oct 14 2015, 16:03:50) [GCC 5.2.1 20151010] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import primes_py3 >>> primes_py3.__doc__ 'Generating primes in c++ with python bindings using pybind11' >>> primes_py3.generate_primes.__doc__ 'generate_primes(int) -> list<int>\n\nA function which generates a list of primes. The length of the list is specified by the user\n' >>> primes_py3.generate_primes(20) [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

As you can see from the python side this is really easy to use, you don’t even know that the C++ is there.

When working on a team this can be a very powerful technique to have in your arsenal. By making performance critical modules in this way you can keep the ease of use for the consumers of the code while getting performance improvements.

## Summary

Ease of development in smaller projects is one of the biggest wins that Python delivers, this is because speed of *development* is one of Python’s strongest points. C++ has very made very different decisions because the philosophy of the language is fundamentally different. C++ is designed to be powerful, it isn’t designed for ease of use in smaller projects, though it has got a *lot* easier to use in the more modern revisions of the language. Where C++ really shines is in the certain situations where the power it gives you is actually important, such as high performance requirements situations.

As this tutorial shows when there’s some performance bottlenecks you have options other than complete re-writes. Specifically you can just deal with the parts your profiling/benchmarking have shown to be problematic.

If you are writing your performance critical code in c++ using pybind11 is a convenient way to consume that code from Python. It is by no means the only way, you could use subprocesses or message queues like ZeroMQ, each of these has their own set of trade-offs. If you have C code then I’d recommend looking into CFFI, it might make your life a lot easier. I used it in my MMAP backed array library and saved myself a ton of hassles.

Essentially you want to use the right tool for the job, and sometimes this involves using more than one language.

If you have any questions about code implementation decisions and trade-offs please leave a comment below! Improving the performance of code is something we specialize in, contact us if you have a project you are looking for help with.

## 7 Comments

you can also use swing and call your c\c++ from other languages like ruby lua and others

Thanks for commenting! Did you mean swig? If not could you please link me to what you are referring to, I’d like to know more. This could possibly make for a good future blog post, I’ve been looking for ways to interface c++ code with Ruby.

While somewhat tangential to your main point, the fastest sieve implementations use the Sieve of Eratosthenes rather than the Sieve of Atkin. If you read the Wikipedia page for the latter completely, there is a lot of discussion of this. Of course if you use naive versions of both, you get poor performance. Similarly, comparing the 4-line naive SoE to Bernstein’s painstakingly optimized SoA will confirm the latter is faster, but says almost nothing about the algorithms (compare primesieve’s SoE to the Wikipedia SoA code and you’ll get an equally lopsided answer in the other direction). In either case, one needs to use a segmented sieve to control locality, which *is* your point. But the way your paragraph reads, you’re implying selecting what turns out to be the slower algorithm when equal optimization is applied.

Thanks for commenting! I agree that the writing was not so clear there so I made an edit to address your points.

Binaries are generally smaller by a factor of at least 2 compared to equivalent bindings generated by Boost.

Got some benchmarks as a reference, if so I’d like to see this.