Why Does Python Code Run Faster in a Function

Why does Python code run faster in a function?

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can't dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.

Is it REALLY true that Python code runs faster in a function?

The flaw in your test is the way timeit compiles the code of your stmt. It's actually compiled within the following template:

template = """
def inner(_it, _timer):
    %(setup)s
    _t0 = _timer()
    for _i in _it:
        %(stmt)s
    _t1 = _timer()
    return _t1 - _t0
"""

Thus stmt is actually running in a function, using the fastlocals array (i.e. STORE_FAST).

Here's a test with your function in the question as f_opt versus the unoptimized compiled stmt executed in the function f_no_opt:

>>> code = compile(stmt, '<string>', 'exec')
>>> f_no_opt = types.FunctionType(code, globals())

>>> t_no_opt = min(timeit.repeat(f_no_opt, repeat=10, number=10))
>>> t_opt = min(timeit.repeat(f_opt, repeat=10, number=10))
>>> t_opt / t_no_opt
0.4931101445632647

Why does my python function run faster than the one in c++?

There are many reasons why this performance test does not give useful results.

Don't compare, or pay attention to, release timing. The entire point of using a language like C or C++ is to enable (static) compiler optimizations. So really, the results are the same. On the other hand, it is important to make sure that aggressive compiler optimizations don't optimize out your entire test (due to the result of computation going unused, or due to undefined behaviour anywhere in your program, or due to the compiler assuming that part of the code can't actually be reached because it there would be undefined behaviour if it were reached).
for i in [x]: is a pointless loop: it creates a Python list of one element, and iterates once. That one iteration does i *= a, i.e., it multiplies i, which is the Numpy array. The code only works accidentally; it happens that Numpy arrays specially define * to do a loop and multiply each element. Which brings us to...
The entire point of using Numpy is that it optimizes number-crunching by using code written in C behind the scenes to implement algorithms and data structures. i simply contains a pointer to a memory allocation that looks essentially the same as the one the C program uses, and i *= a does a few O(1) checks and then sets up and executes a loop that looks essentially the same as the one in the C code.
This is not reliable timing methodology, in general. That is a whole other kettle of fish. The Python standard library includes a timeit module intended to make timing easier and help avoid some of the more basic traps. But doing this properly in general is a research topic beyond the scope of a Stack Overflow question.

"But I want to see the slow performance of native Python, rather than Numpy's optimized stuff - "

If you just want to see the slow performance of Python iteration, then you need for the loop to actually iterate over the elements of the array (and write them back):

def mult(x, a):
    for i in range(len(x)):
        x[i] *= a

Except that experienced Pythonistas won't write the code that way, because range(len( is ugly. The Pythonic approach is to create a new list:

def mult(x, a):
    return [i*a for i in x]

That will also show you the inefficiency of native Python data structures (we need to create a new list, which contains pointers to int objects).

On my machine, it is actually even slower to process the Numpy array this way than a native Python list. This is presumably because of the extra work that has to be done to interface the Numpy code with native Python, and "box" the raw integer data into int objects.

Why does the second code segment run so much faster than the first?

While it's true that snippet 1 has more instructions to execute, two if statements will only double your execution time for those if statements (I'm using only as a relative term here). The majority of your speed is lost in the for loop:

python -m timeit -s 'i=0' 'for x in range(1000): i+=1' 
10000 loops, best of 3: 46.4 usec per loop

python -m timeit -s 'i=0' 'while i<1000: i+=1' 
10000 loops, best of 3: 0.0299 usec per loop

You are losing multiple orders of magnitude in the for loop, so the if statement is relatively inconsequential:

python -m timeit -s 'x=1; y=4' 'x<y'
10000000 loops, best of 3: 0.0256 usec per loop

However, I will point out that this is the case for python3's range and python2's xrange. If you are using python2's range, as @jdowner pointed out, you will be generating the entire list of numbers ahead of time

Why is the sequential code faster than the multithreaded code?

Due to GIL a python interpreter can't run more than one python code at any given time.

But, other than that, You're appending to list.

>>> import timeit
>>> def appending():
...     output = []
...     for i in range(1000000):
...         output.append(i)
...     return output
... 
>>> def gen_exp():
...     return [i for i in range(1000000)]

>>> print(f"{timeit.timeit(appending, number=100):.2}")
8.1

>>> print(f"{timeit.timeit(gen_exp, number=100):.2}")
5.2

This slow nature of appending to the list is best shown on readline/readlines performance differances.

Without those, normally time-benchmarking would be simplified as following.

import math
import timeit
from concurrent import futures
import multiprocessing

def run(i):
    i = i * math.pi
    return i ** 2

def wrapper_sequential():
    return [run(i) for i in range(20000)]

def wrapper_thread_pool():
    with futures.ThreadPoolExecutor(max_workers=10) as exc:
        fut = [exc.submit(run, i) for i in range(20000)]
        output = [f.result() for f in fut]

    return output

def wrapper_multiprocess():
    with multiprocessing.Pool(10) as pool:
        output = pool.map(run, (i for i in range(20000)))

    return output

if __name__ == '__main__':
    print(f"Thr: {timeit.timeit(wrapper_thread_pool, number=10):.4}")
    print(f"Seq: {timeit.timeit(wrapper_sequential, number=10):.4}")
    print(f"Mlt: {timeit.timeit(wrapper_multiprocess, number=10):.4}")

Thr: 5.146
Seq: 0.05411
Mlt: 4.055

Cost to create thread is just not worth as GIL only allows a python interpreter single python code at any given moment.

For Multiprocessing, as there is no direct way for python interpreter to communicate over processes, internally pickle is used to serialize data for inter-process communications - this is overhead.

If calculation is heavy enough, Multiprocessing will eventually overcome that overhead and starts getting ahead of sequential, but thread will never.

How can I make my python code run faster

This is a lame first pass to tighten up your forloops. Since you only use the file shape once per file, you can move the handling outside the loop which should reduce the amount of loading of data in interrupting processing. I still don't get what counter and inc do as they don't seem to be updated in the loop. You definitely want to look into repeated string concatenation performance, or how the performance of your appending to predictors_wrf and names_wrf looks as starting points

k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    file_exists = os.path.isfile(filename)
    if file_exists:
        f = nc.Dataset(filename,'r')
        times = f.variables['Times'][1:]
        num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Why Does Python Code Run Faster in a Function