How to Vectorize (Make Use of Pandas/Numpy) Instead of Using a Nested for Loop

How to vectorize (make use of pandas/numpy) instead of using a nested for loop

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

Example

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum

Testing

dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)

t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)

print(time.time() - t1)

Timings

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms

How to vectorize python for loop that modifies each element of a dataframe?

First convert your list to an array:

arr = np.asarray(sample_list)

Then note that your addition needs to broadcast to produce a 2D output. To add a "virtual" dimension to an array, use np.newaxis:

arr[:,np.newaxis] + arr

That gives you:

array([[ 2,  3,  6],
[ 3, 4, 7],
[ 6, 7, 10]])

Which is trivially divided by 2 to get the final result.

Doing the other way around is more efficient, as the divisions are in 1D instead of 2D:

arr = np.asarray(sample_list) / 2
arr[:,np.newaxis] + arr

Pandas utilize Numpy Vectorization in a user defined function instead of using loops/lambda.apply()

Unfortunately if working with strings in numpy/pandas always are loops under the hoods.

Idea is create DataFrame from split by whitespaces, forward filling last values, filter by isin and last test if all Trues per rows:

df1['IsPerson'] = (df1['Vendor'].str.split(expand=True)
.ffill(axis=1)
.isin(df2['Persons'].tolist())
.all(axis=1))

Solution with sets:

s = set(df2['Persons'])
df1['IsPerson'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))

Performance

Depends of length of Both DataFrames, number of unique values and number of matched values. So in real data should be different.

np.random.seed(123)

N = 100000
L = list('abcdefghijklmno ')

df1 = pd.DataFrame({'Vendor': [''.join(x) for x in np.random.choice(L, (N, 5))]})
df2 = pd.DataFrame({'Persons': [''.join(x) for x in np.random.choice(L, (N * 10, 5))]})

In [133]: %%timeit
...: s = set(df2['Persons'])
...: df1['IsPerson1'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))
...:
470 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [134]: %%timeit
...: df1['IsPerson2'] = (df1['Vendor'].str.split(expand=True)
...: .ffill(axis=1)
...: .isin(df2['Persons'].tolist())
...: .all(axis=1))
...:
858 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Making the nested loop run faster, e.g. by vectorization in Python

You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):

import numpy as np
import array

ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?

range_index = 0 # Initialize. To be bumped up by each invocation of my_func()

inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')

def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1

np.apply_along_axis (my_func,0,ids_arr)
print (inv)

Output:

{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}

Edit:

I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:

inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')

(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)

how to apply Numpy Vectorize instead of apply function

You don't need apply nor vectorize for this:

df['bool'] = df['DAYS'] < 365

output:

   ID  DAYS   bool
0 0 293 True
1 0 1111 False
2 0 3020 False
3 0 390 False
4 0 210 True
5 1 10 True

Vectorize loop operation

If you have a correspondence between struct types (C types) and numpy numerical types, this should be fairly simple. The documentation for struct is here, while numpy's is here. The relevant conversions are:

  • 'f' -> np.single (Python does not have the equivalent type)
  • 'B' -> np.ubyte
  • 'I' -> np.uintc

You can create the output as an array of values by making a custom dtype, much like struct allows you to do:

dt = np.dtype([(c, np.single) for c in 'XYZW'] +
[(c, np.ubyte) for c in 'RGBA'] +
[('', np.intc, 3)])

The reason for creating separate fields for each channel (e.g. [('X', np.single), ('Y', np.single), ...) rather than creating a single field for all channels (e.g. [('XYZW', np.single, 4), ...) is that you want to be able to access arrays with uniform strides. The empty portion that you will not be assigning to can be a single block in each element: ('zeros', np.intc, 3).

You can have other dtypes that give you want results. For example, you can name your fields, or split them up into the individual channels. I would recommend doing that after you've written the output array, in a view, to simplify the processing.

Now that you have a dtype, make an array with it:

output = np.zeros(rgb.shape[:2], dtype=dt)

Now you can store the fields using the dt.fields attribute combined with output.setfield:

for name, plane in zip('XYZ', np.moveaxis(points, -1, 0)):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)

for name, plane in zip('RGB', np.moveaxis(rgb, -1, 0)):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)

tp, offset, *_ = dt.fields['A']
output.setfield(255, tp, offset)

You can shorten to a single loop using itertoools.chain:

from itertools import chain

for name, plane in zip('XYZRGBA', chain(np.moveaxis(points, -1, 0),
np.moveaxis(rgb, -1, 0),
[255])):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)

Notice that the loop is not terribly expensive here. It only goes through seven iterations. Each element of the result is a buffer of the exact form your struct call was creating. You can discard the shape information by raveling the output array.

The result is an array with your custom dtype, equivalent the struct format spec to 'ffffBBBBIII'. Each element canis a scalar that be indexed by field name:

>>> output[0, 0]['W']
0.0

You can create alternate views into the array if you want, e.g., to group values into categories or something like that:

>>> dt2 = np.dtype([('p', np.single, 4), ('i', np.ubyte, 4), ('z', np.intc, 3)]
>>> output2 = output.view(dtype=dt2)
>>> output2[0, 0]['p']
array([0.501182 , 0.7935149, 0.9981835, 0. ], dtype=float32) # Random example data

This view does not copy the data, just interprets the existing buffer in a different way. Internally, the representation is still a packed version of what you were trying to achieve with your struct.

Python vectorizing nested for loops

Approach #1

Here's a vectorized approach -

m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2

Possible improvement : We can probably speedup the last step with numexpr module -

import numexpr as ne

mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')

Approach #2

We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi on the fly without actually creating the meshes as done earlier with np.mgrid. This would be benefited by the use of broadcasting for efficiency purposes. The implementation would look like this -

m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2

Simplified version : Thanks to @Bi Rico for suggesting an improvement here as we can use np.ogrid to perform those operations in a bit more concise manner, like so -

m,n,r = volume.shape    
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2

Runtime test

Function definitions -

def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2

def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')

def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2

def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2

Timings -

In [106]: # Setup input arrays  
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
...:

In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop

In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop

In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop

In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop

In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop

So, as always broadcasting showing its magic for a crazy almost 10,000x speedup over the original code and more than 10x better than creating meshes by using on-the-fly broadcasted operations!

Is there a way to vectorize counting items' co-occurences in pandas/numpy?

A numpy/scipy solution using sparse incidence matrices:

from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

@B.add_function()
def pp(L):
SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
unq,idx = np.unique(np.concatenate(L),return_inverse=True)
S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
SS = (S.T@S).tocoo()
idx = (SS.col>SS.row).nonzero()
return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count


from collections import Counter
from itertools import combinations

@B.add_function()
def yatu(L):
return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))

@B.add_function()
def feature_engineer(L):
Counter((min(nodes), max(nodes))
for row in L for nodes in combinations(row, 2))

from string import ascii_lowercase as ltrs

ltrs = np.array([*ltrs])

@B.add_arguments('array size')
def argument_provider():
for exp in range(4, 30):
n = int(1.4**exp)
L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
yield n,L

r = B.run()
r.plot()

enter image description here

We see that the method presented here (pp) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.

OPs example:

import pandas as pd

df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))

Prints:

   left right  count
0 a b 1
1 a c 2
2 b c 2
3 c d 1
4 a d 1
5 c e 1
6 a e 1
7 b e 1
8 c f 1
9 e f 1
10 a f 1
11 b f 1
12 b j 1
13 c j 1

How to solve this using numpy vectorization

Here is a way to do what you've asked (UPDATED to simplify the code).

A few notes first:

  • numpy arrays must be of homogeneous type, so the numbers you show in your question will be converted by numpy to strings to match the data type of the labels (if pandas is an option, it might allow you to have columns of numbers co-exist with distinct columns of strings).
  • Though I have taken the result all the way through to match the original homogeneous data type (string), you can stop early and use the intermediate 1D numerical results if that's all you need.
  • I have used int as the numeric type, and you can change this to float if required.
import numpy
arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]])
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}
print('arr_to_check:'); print(arr_to_check)

aT = arr_to_check.T
labels = aT[0,:]
values = aT[1,:].astype(int)
print('labels:'); print(labels)
print('values:'); print(values)

for label, value in max_possible.items():
curMask = (labels == label)
values[curMask] *= (values[curMask] <= value)
print('values:'); print(values)

aT[1,:] = values
arr_to_check = aT.T
print('arr_to_check:'); print(arr_to_check)

Input:

arr_to_check:
[['A' '20']
['B' '100']
['C' '80']
['D' '90']
['E' '100']]

Output:

labels:
['A' 'B' 'C' 'D' 'E']
values:
[ 20 100 80 90 100]
values:
[ 20 0 80 0 100]
arr_to_check:
[['A' '20']
['B' '0']
['C' '80']
['D' '0']
['E' '100']]

Explanation:

  • Transpose the input so that we can use vectorized operations directly on the numeric vector (values).
  • Iterate over each key/value pair in max_possible and use a vectorized formula to multiply values by 0 if the value in max_possible has been breached for rows whose label (in labels) matches the key in max_possible.
  • Update the original numpy array using values.


Related Topics



Leave a reply



Submit