How to Parallelize a Simple Python Loop

How do I parallelize a simple Python loop?

Using multiple threads on CPython won't give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the multiprocessing module instead:

pool = multiprocessing.Pool(4)
out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))

Note that this won't work in the interactive interpreter.

To avoid the usual FUD around the GIL: There wouldn't be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.

parallelize 'for' loop in Python 3

My guess is that you want to work on several files at the same time. To do so, the best way (in my opinion) is to use multiprocessing. To use this, you need to define an elementary step, and it is already done in your code.

import numpy as np
import multiprocessing as mp
import os

def f(file):
mindex=np.zeros((1200,1200))
for i in range(1200):
var1 = xray.open_dataset(file)['variable'][:,i,:].data
for j in range(1200):
var2 = var1[:,j]
## Mathematical Calculations to find var3[i,j]##
mindex[i,j] = var3[i,j]
return (file, mindex)


if __name__ == '__main__':
N= mp.cpu_count()

files = os.scandir(folder)

with mp.Pool(processes = N) as p:
results = p.map(f, [file.name for file in files])

This should return a list of element results in which each element is a tuple with the file name and the mindex matrix. With this, you can work on multiple files at the same time. It is particularly efficient if the computation on each file is long.

Implement Parallel for loops in Python

You can also use concurrent.futures in Python 3, which is a simpler interface than multiprocessing. See this for more details about differences.

from concurrent import futures

total_error = 0

with futures.ProcessPoolExecutor() as pool:
for error in pool.map(some_function_call, parameters1, parameters2):
total_error += error

In this case, parameters1 and parameters2 should be a list or iterable of the same size as the number of times you want to run the function (24 times as per your example).

If paramters<1,2> are not iterables/mappable, but you just want to run the function 24 times, you can submit the jobs for the function for the required number of times, and later acquire the result using a callback.

class TotalError:
def __init__(self):
self.value = 0

def __call__(self, r):
self.value += r.result()

total_error = TotalError()
with futures.ProcessPoolExecutor() as pool:
for i in range(24):
future_result = pool.submit(some_function_call, parameters1, parameters2)
future_result.add_done_callback(total_error)

print(total_error.value)

parallelize for loop and merge pandas dataframes

Edit: using multiprocessing instead of threading

After reading your comments it seems that you want to run your function in different processes (in parallel):

import multiprocessing
import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3']})
year_start = 2020
year_stop = 2015
year_range = range(year_start, year_stop, -1)

def make_df(year):
df = pd.DataFrame({str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
return df

pool = multiprocessing.Pool(year_start - year_stop)
df_list = pool.map(func=make_df, iterable=year_range)
pool.close()
pool.join()

df = df.join(df_list)
print(df)


Related Topics



Leave a reply



Submit