How to vectorize (make use of pandas/numpy) instead of using a nested for loop
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
How to vectorize python for loop that modifies each element of a dataframe?
First convert your list to an array:
arr = np.asarray(sample_list)
Then note that your addition needs to broadcast to produce a 2D output. To add a "virtual" dimension to an array, use np.newaxis
:
arr[:,np.newaxis] + arr
That gives you:
array([[ 2, 3, 6],
[ 3, 4, 7],
[ 6, 7, 10]])
Which is trivially divided by 2 to get the final result.
Doing the other way around is more efficient, as the divisions are in 1D instead of 2D:
arr = np.asarray(sample_list) / 2
arr[:,np.newaxis] + arr
Pandas utilize Numpy Vectorization in a user defined function instead of using loops/lambda.apply()
Unfortunately if working with strings in numpy/pandas always are loops under the hoods.
Idea is create DataFrame
from split by whitespaces, forward filling last values, filter by isin
and last test if all True
s per rows:
df1['IsPerson'] = (df1['Vendor'].str.split(expand=True)
.ffill(axis=1)
.isin(df2['Persons'].tolist())
.all(axis=1))
Solution with sets:
s = set(df2['Persons'])
df1['IsPerson'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))
Performance
Depends of length of Both DataFrames, number of unique values and number of matched values. So in real data should be different.
np.random.seed(123)
N = 100000
L = list('abcdefghijklmno ')
df1 = pd.DataFrame({'Vendor': [''.join(x) for x in np.random.choice(L, (N, 5))]})
df2 = pd.DataFrame({'Persons': [''.join(x) for x in np.random.choice(L, (N * 10, 5))]})
In [133]: %%timeit
...: s = set(df2['Persons'])
...: df1['IsPerson1'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))
...:
470 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [134]: %%timeit
...: df1['IsPerson2'] = (df1['Vendor'].str.split(expand=True)
...: .ffill(axis=1)
...: .isin(df2['Persons'].tolist())
...: .all(axis=1))
...:
858 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Making the nested loop run faster, e.g. by vectorization in Python
You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis()
. Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):
import numpy as np
import array
ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?
range_index = 0 # Initialize. To be bumped up by each invocation of my_func()
inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')
def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1
np.apply_along_axis (my_func,0,ids_arr)
print (inv)
Output:
{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}
Edit:
I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy
array indexing. Use the below lines to create and initialize inv
as a numpy
array of Python arrays. The rest of the code can remain as-is:
inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')
(Note: This assumes that your application isn't doing dict
-specific stuff on inv
, such as inv.items()
or inv.keys()
. If that's the case, however, you might need an extra step to convert the numpy array into a dict
)
how to apply Numpy Vectorize instead of apply function
You don't need apply
nor vectorize
for this:
df['bool'] = df['DAYS'] < 365
output:
ID DAYS bool
0 0 293 True
1 0 1111 False
2 0 3020 False
3 0 390 False
4 0 210 True
5 1 10 True
Vectorize loop operation
If you have a correspondence between struct types (C types) and numpy numerical types, this should be fairly simple. The documentation for struct is here, while numpy's is here. The relevant conversions are:
'f'
->np.single
(Python does not have the equivalent type)'B'
->np.ubyte
'I'
->np.uintc
You can create the output as an array of values by making a custom dtype
, much like struct
allows you to do:
dt = np.dtype([(c, np.single) for c in 'XYZW'] +
[(c, np.ubyte) for c in 'RGBA'] +
[('', np.intc, 3)])
The reason for creating separate fields for each channel (e.g. [('X', np.single), ('Y', np.single), ...
) rather than creating a single field for all channels (e.g. [('XYZW', np.single, 4), ...
) is that you want to be able to access arrays with uniform strides. The empty portion that you will not be assigning to can be a single block in each element: ('zeros', np.intc, 3)
.
You can have other dtypes that give you want results. For example, you can name your fields, or split them up into the individual channels. I would recommend doing that after you've written the output array, in a view, to simplify the processing.
Now that you have a dtype, make an array with it:
output = np.zeros(rgb.shape[:2], dtype=dt)
Now you can store the fields using the dt.fields
attribute combined with output.setfield
:
for name, plane in zip('XYZ', np.moveaxis(points, -1, 0)):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)
for name, plane in zip('RGB', np.moveaxis(rgb, -1, 0)):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)
tp, offset, *_ = dt.fields['A']
output.setfield(255, tp, offset)
You can shorten to a single loop using itertoools.chain
:
from itertools import chain
for name, plane in zip('XYZRGBA', chain(np.moveaxis(points, -1, 0),
np.moveaxis(rgb, -1, 0),
[255])):
tp, offset, *_ = dt.fields[name]
output.setfield(plane, tp, offset)
Notice that the loop is not terribly expensive here. It only goes through seven iterations. Each element of the result is a buffer of the exact form your struct call was creating. You can discard the shape information by raveling the output
array.
The result is an array with your custom dtype, equivalent the struct format spec to 'ffffBBBBIII'
. Each element canis a scalar that be indexed by field name:
>>> output[0, 0]['W']
0.0
You can create alternate views into the array if you want, e.g., to group values into categories or something like that:
>>> dt2 = np.dtype([('p', np.single, 4), ('i', np.ubyte, 4), ('z', np.intc, 3)]
>>> output2 = output.view(dtype=dt2)
>>> output2[0, 0]['p']
array([0.501182 , 0.7935149, 0.9981835, 0. ], dtype=float32) # Random example data
This view does not copy the data, just interprets the existing buffer in a different way. Internally, the representation is still a packed version of what you were trying to achieve with your struct
.
Python vectorizing nested for loops
Approach #1
Here's a vectorized approach -
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2
Possible improvement : We can probably speedup the last step with numexpr
module -
import numexpr as ne
mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
Approach #2
We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi
on the fly without actually creating the meshes as done earlier with np.mgrid
. This would be benefited by the use of broadcasting
for efficiency purposes. The implementation would look like this -
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2
Simplified version : Thanks to @Bi Rico for suggesting an improvement here as we can use np.ogrid
to perform those operations in a bit more concise manner, like so -
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2
Runtime test
Function definitions -
def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2
def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2
def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2
Timings -
In [106]: # Setup input arrays
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
...:
In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop
In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop
In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop
In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop
In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop
So, as always broadcasting
showing its magic for a crazy almost 10,000x
speedup over the original code and more than 10x
better than creating meshes by using on-the-fly broadcasted operations!
Is there a way to vectorize counting items' co-occurences in pandas/numpy?
A numpy/scipy solution using sparse incidence matrices:
from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument
B = BenchmarkBuilder()
@B.add_function()
def pp(L):
SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
unq,idx = np.unique(np.concatenate(L),return_inverse=True)
S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
SS = (S.T@S).tocoo()
idx = (SS.col>SS.row).nonzero()
return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count
from collections import Counter
from itertools import combinations
@B.add_function()
def yatu(L):
return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))
@B.add_function()
def feature_engineer(L):
Counter((min(nodes), max(nodes))
for row in L for nodes in combinations(row, 2))
from string import ascii_lowercase as ltrs
ltrs = np.array([*ltrs])
@B.add_arguments('array size')
def argument_provider():
for exp in range(4, 30):
n = int(1.4**exp)
L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
yield n,L
r = B.run()
r.plot()
We see that the method presented here (pp
) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.
OPs example:
import pandas as pd
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))
Prints:
left right count
0 a b 1
1 a c 2
2 b c 2
3 c d 1
4 a d 1
5 c e 1
6 a e 1
7 b e 1
8 c f 1
9 e f 1
10 a f 1
11 b f 1
12 b j 1
13 c j 1
How to solve this using numpy vectorization
Here is a way to do what you've asked (UPDATED to simplify the code).
A few notes first:
- numpy arrays must be of homogeneous type, so the numbers you show in your question will be converted by numpy to strings to match the data type of the labels (if pandas is an option, it might allow you to have columns of numbers co-exist with distinct columns of strings).
- Though I have taken the result all the way through to match the original homogeneous data type (string), you can stop early and use the intermediate 1D numerical results if that's all you need.
- I have used
int
as the numeric type, and you can change this tofloat
if required.
import numpy
arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]])
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}
print('arr_to_check:'); print(arr_to_check)
aT = arr_to_check.T
labels = aT[0,:]
values = aT[1,:].astype(int)
print('labels:'); print(labels)
print('values:'); print(values)
for label, value in max_possible.items():
curMask = (labels == label)
values[curMask] *= (values[curMask] <= value)
print('values:'); print(values)
aT[1,:] = values
arr_to_check = aT.T
print('arr_to_check:'); print(arr_to_check)
Input:
arr_to_check:
[['A' '20']
['B' '100']
['C' '80']
['D' '90']
['E' '100']]
Output:
labels:
['A' 'B' 'C' 'D' 'E']
values:
[ 20 100 80 90 100]
values:
[ 20 0 80 0 100]
arr_to_check:
[['A' '20']
['B' '0']
['C' '80']
['D' '0']
['E' '100']]
Explanation:
- Transpose the input so that we can use vectorized operations directly on the numeric vector (
values
). - Iterate over each key/value pair in
max_possible
and use a vectorized formula to multiplyvalues
by 0 if the value inmax_possible
has been breached for rows whose label (inlabels
) matches the key inmax_possible
. - Update the original numpy array using
values
.
Related Topics
How to Get Max() to Return Variable Names Instead of Values in Python
How to Convert Column With Dtype as Object to String in Pandas Dataframe
Python: Searching for Common Values in Two Files
Tf.Data.Dataset: How to Get the Dataset Size (Number of Elements in an Epoch)
Test If Dictionary Key Exists, Is Not None and Isn't Blank
How to Get String Objects Instead of Unicode from Json
How to Upgrade the Sqlite Version Used by Python'S Sqlite3 Module on Mac
How to Save Training History on Every Epoch in Keras
Sqlalchemy - Select for Update Example
Large File Crashing on Jupyter Notebook
Render_Template in Python-Flask Is Not Working
Importerror: No Module Named Psycopg2 After Install
How to Find the Unit Digits of a Specific Number
Adding Months to a Pandas Object in Python
Getting the Bounding Box of the Recognized Words Using Python-Tesseract
In Python, How to Check If a String Only Contains Certain Characters
Extract Values Between Two Strings in a Text File Using Python