Why Are Python Programs Often Slower Than the Equivalent Program Written in C or C++

Why are Python Programs often slower than the Equivalent Program Written in C or C++?

Python is a higher level language than C, which means it abstracts the details of the computer from you - memory management, pointers, etc, and allows you to write programs in a way which is closer to how humans think.

It is true that C code usually runs 10 to 100 times faster than Python code if you measure only the execution time. However if you also include the development time Python often beats C. For many projects the development time is far more critical than the run time performance. Longer development time converts directly into extra costs, fewer features and slower time to market.

Internally the reason that Python code executes more slowly is because code is interpreted at runtime instead of being compiled to native code at compile time.

Other interpreted languages such as Java bytecode and .NET bytecode run faster than Python because the standard distributions include a JIT compiler that compiles bytecode to native code at runtime. The reason why CPython doesn't have a JIT compiler already is because the dynamic nature of Python makes it difficult to write one. There is work in progress to write a faster Python runtime so you should expect the performance gap to be reduced in the future, but it will probably be a while before the standard Python distribution includes a powerful JIT compiler.

What is the reason that python is so much slower than C in this instance?

It's CPython (the implementation) that's slow in this case, not Python necessarily. CPython needs to interpret the bytecode which will almost always be slower than compiled C code. It simply does more work than the equivalent C code. In theory each call to sqrt for example requires looking up that function, rather than just a call to a known address.

If you want comparable speed from Python, you could either annotate the source with types and compile with Cython, or try running with Pypy for some JIT performance.

Why does trivial loop in python run so much slower than the same in C++? And how to optimize that?

A smart C compiler can probably optimize your loop away by recognizing that at the end, a will always be 1. Python can't do that because when iterating over xrange, it needs to call __next__ on the xrange object until it raises StopIteration. python can't know if __next__ will have side-effect until it calls it, so there is no way to optimize the loop away. The take-away message from this paragraph is that it is MUCH HARDER to optimize a Python "compiler" than a C compiler because python is such a dynamic language and requires the compiler to know how the object will behave in certain circumstances. In
C, that's much easier because C knows exactly what type every object is ahead of time.

Of course, compiler aside, python needs to do a lot more work. In C, you're working with base types using operations supported in hardware instructions. In python, the interpreter is interpreting the byte-code one line at a time in software. Clearly that is going to take longer than machine level instructions. And the data model (e.g. calling __next__ over and over again) can also lead to a lot of function calls which the C doesn't need to do. Of course, python does this stuff to make it much more flexible than you can have in a compiled language.

The typical way to speed up python code is to use libraries or intrinsic functions which provide a high level interface to low-level compiled code. scipy and numpy are excellent examples this kind of library. Other things you can look into are using pypy which includes a JIT compiler -- you probably won't reach native speeds, but it'll probably beat Cpython (the most common implementation), or writing extensions in C/fortran using the Cpython-API, cython or f2py for performance critical sections of code.

Given this same piece of code, why is the python version 100x+ slower than C?

In short - it's not slower, it's just stuck.

The while loop in your python version is an infinite loop - your indentation doesn't include changing j so you'll never exit it. The fact that your program didn't just "take longer" but actually got completely stuck should have been a hint (don't feel bad though, I once waited 3 days before getting convinced about a similar scenario).

That's one thing, fixing that would make the program stop, but with incorrect results - that's because the outer for loop is also missing the indentation - you want to run the check for each number in the range.

Fixed code:

import time

def iterative_brute_force(n):
longest = 0
terms = 0
for i in range(1, n + 1):
j = i
this_terms = 1

while j != 1:
this_terms += 1
if this_terms > terms:
terms = this_terms
longest = i

if j % 2 == 0:
j = j / 2
else:
j = 3 * j + 1

return longest, terms

t0 = time.time()
print(iterative_brute_force(10 ** 6))
t1 = time.time()
total = t1-t0
print(total)

Gives -

(837799, 525)
34.4885718822

while the c version gives -

longest: 837799 (525)
It took 0.600000 seconds

There, now everything makes sense again, python is indeed slower and we can get to the real question :)

On a side note though - this is far from being optimized, as you may repeat numbers you already visited. A better algorithm here would have been storing the results for these numbers in some handy lookup table.


Now, regarding the fundamental question here (which is relevant even after fixing the code as you can see) - execution time across languages is a tricky domain, even if you truly perform the same algorithm in your code, the actual implementation is affected by compiler or interpreter behavior - Python is interpreted and therefore your code has to perform through another layer of code that manages your program, while C just runs natively. This opens up potential language features (and maybe some optimizations), and it would depend on benchmarking each application to see how well it works, but it's probably safe to say that on most workloads you'll observe that python (or other interpreted languages) behaves 10-100x slower due to this overhead.

Furthermore - having compiled the c code in advance allows your compiler to produce a far better optimized code. It's possible to use JITting on python to mitigate that (and reduce the interpreter overhead a bit), but it's not available on all python implementations (at least not "pure" ones)

See also the discussion at - Why are Python Programs often slower than the Equivalent Program Written in C or C++?

Python vs CPP: Why is the difference in speed so huge?

Here's a simple example of the difference:

i++ in C++ compiles down to (on x86-64 machines) a simple inc REGISTER instruction. Takes a fraction of a cycle to execute.

i += 1 in Python can be disassembled with the dis module via dis.dis('i += 1') which informs us that the bytecode involved is:

  1           0 LOAD_NAME                0 (i)
2 LOAD_CONST 0 (1)
4 INPLACE_ADD
6 STORE_NAME 0 (i)
8 LOAD_CONST 1 (None)
10 RETURN_VALUE

Try it online!

Technically, all the instructions that end in _NAME become _FAST in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)/RETURN_VALUE pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real bytecode within a function would be more like:

  1           0 LOAD_FAST                0 (i)
2 LOAD_CONST 0 (1)
4 INPLACE_ADD
6 STORE_FAST 0 (i)

Each of those instructions requires either a run through a switch statement or a computed goto (depending on how CPython was compiled), loading the next instruction and updating code position information (it also involves repeatedly checking to ensure no other thread is asking for the GIL). LOAD_FAST and LOAD_CONST instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++ from before, except it has to change memory, not a register, so it's slower). STORE_FAST similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value).
INPLACE_ADD has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int object, and also do more reference count adjustment.

Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.

I'll note that the relative cost here is highest for simple math operations; the more work a single bytecode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.

As for speeding it up, the main rules are:

  1. Write Python code, not C code. You're manually maintaining your counters, when Python's range could do the job for you (and save a lot of individual bytecode instructions). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for loops over range rather than while loops with manual counter adjustment).
  2. For mass math operations, use extension modules that can do the work in bulk, e.g. numpy. All that overhead for a single addition is bad; paying it just once for 1000 additions is pretty trivial.
  3. Try alternate interpreters (e.g. PyPy)
  4. Use Cython to compile C++ from your Python code (requires adding appropriate cdef declarations)
  5. Use ctypes to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)

Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.


To address point #1, a Pythonic version of your code would look like:

def main():
sum = 1
for i in range(2, 100000):
for j in range(2, i):
if i%j == 0:
sum += 1
break

print(sum)

if __name__ == "__main__":
main()

You could even replace the inner loop with:

    sum += any(i % j == 0 for j in range(2, i))

though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.

For demonstration of the difference in bytecode complexity, consider a function that does nothing but run a loop with either while and a manual counter or for and range:

def whileloop(n):
i = 0
while i < n:
i += 1

def forloop(n):
for i in range(n):
pass

Disassembling each function shows:

  3           0 LOAD_CONST               1 (0)
2 STORE_FAST 1 (i)

4 4 SETUP_LOOP 20 (to 26)
>> 6 LOAD_FAST 1 (i)
8 LOAD_FAST 0 (n)
10 COMPARE_OP 0 (<)
12 POP_JUMP_IF_FALSE 24

5 14 LOAD_FAST 1 (i)
16 LOAD_CONST 2 (1)
18 INPLACE_ADD
20 STORE_FAST 1 (i)
22 JUMP_ABSOLUTE 6
>> 24 POP_BLOCK
>> 26 LOAD_CONST 0 (None)
28 RETURN_VALUE

for whileloop and:

  8           0 SETUP_LOOP              16 (to 18)
2 LOAD_GLOBAL 0 (range)
4 LOAD_FAST 0 (n)
6 CALL_FUNCTION 1
8 GET_ITER
>> 10 FOR_ITER 4 (to 16)
12 STORE_FAST 1 (i)

9 14 JUMP_ABSOLUTE 10
>> 16 POP_BLOCK
>> 18 LOAD_CONST 0 (None)
20 RETURN_VALUE

Try it online!

for forloop. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while runs from the LOAD_FAST following the SETUP_LOOP to the JUMP_ABSOLUTE, encompassing nine instructions per loop; for the for, it runs from the FOR_ITER to the JUMP_ABSOLUTE, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while loop.

Is Python slower than Java/C#?

Don't conflate Language and Run-Time.

Python (the language) has many run-time implementations.

  • CPython is usually interpreted, and will be slower than native-code C#. It might be slower than Java, depending on the Java JIT compiler.

  • JYthon is interpreted in the JVM and has the same performance profile as Java.

  • IronPython relies on the same .NET libraries and IL as C#, so the performance difference will be relatively small.

  • Python can be translated to native code via PyREX, PyToC, and others. In this case, it will generally perform as well as C++. You can -- to an extent -- further optimize C++ and perhaps squeeze out a little bit better performance than unoptimized output from PyREX.

    For more information, see http://arcriley.blogspot.com/2009/03/so-long-pyrex.html

Note that Python (the language) is not slow. Some Python run-times (CPython, for example) will be slower than native-code C++.

Python faster than C++? How does this happen?

There isn't anything obvious here. Since Python's written in C, it must use something like printf to implement print. C++ I/O Streams, like cout, are usually implemented in a way that's much slower than printf. If you want to put C++ on a better footing, you can try changing to:

#include <cstdio>
int main()
{
int x=0;
while(x!=1000000)
{
++x;
std::printf("%d\n", x);
}
return 0;
}

I did change to using ++x instead of x++. Years ago people thought that this was a worthwhile "optimization." I will have a heart attack if that change makes any difference in your program's performance (OTOH, I am positive that using std::printf will make a huge difference in runtime performance). Instead, I made the change simply because you aren't paying attention to what the value of x was before you incremented it, so I think it's useful to say that in code.

Performance differences between Python and C

Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.

Python and C++ performance comparison

That's not so easy to answer without implementation details, but in general, python is known for it's much less cache friendliness, because you mostly haven't the option to low-level optimize cache behaviour in python. However, this isn't always correct. You propably can optimize the cache friendliness in python directly, or you use parts of c++ code for critical sections. But always consider, that you can just optimize your code better in C++. So if you have really critical code parts, where you want to achieve every percent of speed and effiency, you should use C++. That's the reason, that many programs use both, C++ for raw performance things and python for a nice interface and program structure.



Related Topics



Leave a reply



Submit