Why Is Equivalent Python Code So Much Slower

Why is equivalent Python code so much slower

Summary

"Because the function call overhead in Python is much larger than in Ruby."

Details

Being a microbenchmark, this really doesn't say much about the performance of either language in proper use. Likely you would want to rewrite the program to take advantage of the strengths of Python and Ruby, but this does illustrate one of the weak points of Python at the moment. The root cause of the speed differences come from function call overhead. I made a few tests to illustrate. See below for code and more details. For the Python tests, I used 2000 for both gcd parameters.

Interpreter: Python 2.6.6
Program type: gcd using function call
Total CPU time: 29.336 seconds

Interpreter: Python 2.6.6
Program type: gcd using inline code
Total CPU time: 13.194 seconds

Interpreter: Python 2.6.6
Program type: gcd using inline code, with dummy function call
Total CPU  time: 30.672 seconds

This tells us that it's not the calculation made by the gcd function that contributes most to the time difference, it's the function call itself. With Python 3.1, the difference is similar:

Interpreter: Python 3.1.3rc1
Program type: gcd using function call
Total CPU time: 30.920 seconds

Interpreter: Python 3.1.3rc1
Program type: gcd using inline code
Total CPU time: 15.185 seconds

Interpreter: Python 3.1.3rc1
Program type: gcd using inline code, with dummy function call
Total CPU time: 33.739 seconds

Again, the actual calculation is not biggest contributor, it's the function call itself. In Ruby, the function call overhead is much smaller. (Note: I had to use smaller parameters (200) for the Ruby version of the programs because the Ruby profiler really slows down real-time performance. That doesn't affect CPU time performance, though.)

Interpreter: ruby 1.9.2p0 (2010-08-18 revision 29036) [i486-linux]
Program type: gcd using function call
Total CPU time: 21.66 seconds

Interpreter: ruby 1.9.2p0 (2010-08-18 revision 29036) [i486-linux]
Program type: gcd using inline code
Total CPU time: 21.31 seconds

Interpreter: ruby 1.8.7 (2010-08-16 patchlevel 302) [i486-linux]
Program type: gcd using function call
Total CPU time: 27.00 seconds

Interpreter: ruby 1.8.7 (2010-08-16 patchlevel 302) [i486-linux]
Program type: gcd using inline code
Total CPU time: 24.83 seconds

Notice how neither Ruby 1.8 nor 1.9 suffer greatly from the gcd function call – the function call vs. inline version are more or less equal. Ruby 1.9 seems to be a little better with less difference between the function call and inline versions.

So the answer to the question is: "because the function call overhead in Python is much larger than in Ruby".

Code

# iter_gcd -- Python 2.x version, with gcd function call
#             Python 3.x version uses range instead of xrange
from sys import argv,stderr

def gcd(m, n):
    if n > m:
        m, n = n, m
    while n != 0:
        rem = m % n
        m = n
        n = rem
    return m

def main(a1, a2):
    comp = 0
    for j in xrange(a1, 1, -1):
        for i in xrange(1, a2):
            comp += gcd(i,j)
    print(comp)

if __name__ == '__main__':
    if len(argv) != 3:
        stderr.write('usage: {0:s} num1 num2\n'.format(argv[0]))
        exit(1)
    else:
        main(int(argv[1]), int(argv[2]))

# iter_gcd -- Python 2.x version, inline calculation
#             Python 3.x version uses range instead of xrange
from sys import argv,stderr

def main(a1, a2):
    comp = 0
    for j in xrange(a1, 1, -1):
        for i in xrange(1, a2):
            if i < j:
                m, n = j, i
            else:
                m, n = i, j
            while n != 0:
                rem = m % n
                m = n
                n = rem
            comp += m
    print(comp)

if __name__ == '__main__':
    if len(argv) != 3:
        stderr.write('usage: {0:s} num1 num2\n'.format(argv[0]))
        exit(1)
    else:
        main(int(argv[1]), int(argv[2]))

# iter_gcd -- Python 2.x version, inline calculation, dummy function call
#             Python 3.x version uses range instead of xrange
from sys import argv,stderr

def dummyfunc(n, m):
    a = n + m

def main(a1, a2):
    comp = 0
    for j in xrange(a1, 1, -1):
        for i in xrange(1, a2):
            if i < j:
                m, n = j, i
            else:
                m, n = i, j
            while n != 0:
                rem = m % n
                m = n
                n = rem
            comp += m
            dummyfunc(i, j)
    print(comp)

if __name__ == '__main__':
    if len(argv) != 3:
        stderr.write('usage: {0:s} num1 num2\n'.format(argv[0]))
        exit(1)
    else:
        main(int(argv[1]), int(argv[2]))

# iter_gcd -- Ruby version, with gcd function call

def gcd(m, n)
    if n > m
        m, n = n, m
    end
    while n != 0
        rem = m % n
        m = n
        n = rem
    end
    return m
end

def main(a1, a2)
    comp = 0
    a1.downto 2 do
        |j|
        1.upto a2-1 do
            |i|
            comp += gcd(i,j)
        end
    end
    puts comp
end

 if __FILE__ == $0
    if ARGV.length != 2
        $stderr.puts('usage: %s num1 num2' % $0)
        exit(1)
    else
        main(ARGV[0].to_i, ARGV[1].to_i)
    end
end

# iter_gcd -- Ruby version, with inline gcd

def main(a1, a2)
    comp = 0
    a1.downto 2 do |j|
        1.upto a2-1 do |i|
            m, n = i, j
            if n > m
                m, n = n, m
            end
            while n != 0
                rem = m % n
                m = n
                n = rem
            end
            comp += m
        end
    end
    puts comp
end

 if __FILE__ == $0
    if ARGV.length != 2
        $stderr.puts('usage: %s num1 num2' % $0)
        exit(1)
    else
        main(ARGV[0].to_i, ARGV[1].to_i)
    end
end

Test runs

Finally, the commands used to run Python and Ruby with profiling to get the numbers for comparison were pythonX.X -m cProfile iter_gcdX.py 2000 2000 for Python and rubyX.X -rprofile iter_gcdX.rb 200 200 for Ruby. The reason for the difference is that the Ruby profiler adds a lot of overhead. The results are still valid because I'm comparing the difference between a function call and inline code, not the difference between Python and Ruby as such.

Why are Python Programs often slower than the Equivalent Program Written in C or C++?

Python is a higher level language than C, which means it abstracts the details of the computer from you - memory management, pointers, etc, and allows you to write programs in a way which is closer to how humans think.

It is true that C code usually runs 10 to 100 times faster than Python code if you measure only the execution time. However if you also include the development time Python often beats C. For many projects the development time is far more critical than the run time performance. Longer development time converts directly into extra costs, fewer features and slower time to market.

Internally the reason that Python code executes more slowly is because code is interpreted at runtime instead of being compiled to native code at compile time.

Other interpreted languages such as Java bytecode and .NET bytecode run faster than Python because the standard distributions include a JIT compiler that compiles bytecode to native code at runtime. The reason why CPython doesn't have a JIT compiler already is because the dynamic nature of Python makes it difficult to write one. There is work in progress to write a faster Python runtime so you should expect the performance gap to be reduced in the future, but it will probably be a while before the standard Python distribution includes a powerful JIT compiler.

Exact same python code 20 times slower on faster machine?

This is caused by difference of dict.keys() between Python 2 and 3. In Python 2 dict.keys() will create a copy of the keys as list and return it. In Python 3 dict.keys() will return dictionary view instead which is set like object. Checking if an element can be found from list is much slower than checking if it's in set which explains the difference.

If you make following change the code runs roughly in the same time on Python 2 & 3:

if arrang in dictionary: # Instead of if arrang in dictionary.keys()
    return dictionary[arrang]

How much slower python classes are compared to their equivalent functions?

No.

In general you will not notice any difference in performance based on using classes or not. The different code structures implied may mean that one is faster than the other, but it's impossible to say which.

Always write code to be read, then if, and only if, it's not fast enough make it faster. Remember: Premature optimization is the root of all evil.

Why is my Python code 100 times slower than the same code in PHP?

The answer is, you aren't using the right tools/data structures for the tasks in python.

Calling numpy functionality has quite an overhead (scipy.stats.norm.pdf uses numpy under the hood) in python and thus one would never call this functions for one element but for the whole array (so called vectorized computation), that means instead of

for x in range:
        y = norm.pdf(x, x1+((x2-x1)/2), slope)
        ys.append(y)

one would rather use:

ys = norm.pdf(x,x1+((x2-x1)/2), slope)

calculating pdf for all elements in x and paying the overhead only once rather than len(x) times.

For example to calculate pdf for 10^4 elements takes less than 10 times more time than for one element:

%timeit norm.pdf(0)   # 68.4 µs ± 1.62 µs
%timeit norm.pdf(np.zeros(10**4))   # 415 µs ± 12.4 µs

Using vectorized computation will not only make your program faster but often also shorter/easier to understand, for example:

def calculate_dist_vec(x1, x2, steps, slope):
    x = np.linspace(x1, x2, steps+2)
    y = norm.pdf(x, x1+((x2-x1)/2), slope)
    ys = y/np.sum(y)
    return x,ys

Using this vectorized version gives you a speed-up around 10.

The problem: norm.pdf is optimized for long vectors (nobody really cares how fast/slow it is for 10 elements if it is very fast for one million elements), but your test is biased against numpy, because it uses/creates only short arrays and thus norm.pdf cannot shine.

So if it is really about small arrays and you are serious about speeding it up you will have to roll out your own version of norm.pdf Using cython for creating this fast and specialized function might be worth a try.

Why is my Python script so much slower than its R equivalent?

The most horribly inefficient thing you do is calling the DataFrame.append method in a loop, i.e.

df = pandas.DataFrame(...)
for file in files:
    ...
    for line in file:
        ...
        df = df.append(...)

NumPy data structures are designed with functional programming in mind, hence this operation is not meant to be used in an iterative imperative fashion, because the call doesn't change your data frame in-place, but it creates a new one, resulting in an enormous increase in time and memory complexity. If you really want to use data frames, accumulate your rows in a list and pass it to the DataFrame constructor, e.g.

pre_df = []
for file in files:
    ...
    for line in file:
        ...
        pre_df.append(processed_line)

df = pandas.DataFrame(pre_df, ...)

This is the easiest way since it will introduce minimal changes to the code you have. But the better (and computationally beautiful) way is to figure out how to generate your dataset lazily. This can be easily achieved by splitting your workflow into discrete functions (in the sense of functional programming style) and compose them using lazy generator expressions and/or imap, ifilter higher-order functions. Then you can use the resulting generator to build your dataframe, e.g.

df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)

As for reading multiple files in one run you might want to read this.

P.S.

If you've got performance issues you should profile your code before trying to optimise it.

What is the reason that python is so much slower than C in this instance?

It's CPython (the implementation) that's slow in this case, not Python necessarily. CPython needs to interpret the bytecode which will almost always be slower than compiled C code. It simply does more work than the equivalent C code. In theory each call to sqrt for example requires looking up that function, rather than just a call to a known address.

If you want comparable speed from Python, you could either annotate the source with types and compile with Cython, or try running with Pypy for some JIT performance.

Why Is Equivalent Python Code So Much Slower