Pandas Dataframe Calculations With Previous Row

How to Fill Pandas Column with calculations involving cells from previous rows without Using for loop

Recursive calculations are not vectorisable, for improve performance is used numba:

from numba import jit

@jit(nopython=True)
def f(a):
d = np.empty(a.shape)
d[0] = a[0]
for i in range(1, a.shape[0]):
d[i] = d[i-1] * 0.3 + a[i] * 0.7
return d

df['Val'] = f(df['Par'].to_numpy())
print (df)
Par Val
0 50 50.0000
1 60 57.0000
2 70 66.1000
3 80 75.8300
4 90 85.7490
5 100 95.7247

Difference for performance for 1k rows:

from numba import jit
import itertools

np.random.seed(2022)

df = pd.DataFrame({'Par': np.random.randint(100, size=1000)})


In [64]: %%timeit
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...: import itertools
...:
...: df.loc[0,"Val"] = df.loc[0,"Par"]
...: for _ in itertools.repeat(None, len(df)):
...: df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
...:
1.05 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [65]: %%timeit
...: @jit(nopython=True)
...: def f(a):
...: d = np.empty(a.shape)
...: d[0] = a[0]
...: for i in range(1, a.shape[0]):
...: d[i] = d[i-1] * 0.3 + a[i] * 0.7
...: return d
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
121 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Test for 100krows:

np.random.seed(2022)
df = pd.DataFrame({'Par': np.random.randint(100, size=100000)})


In [70]: %%timeit
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...: import itertools
...:
...: df.loc[0,"Val"] = df.loc[0,"Par"]
...: for _ in itertools.repeat(None, len(df)):
...: df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
...:
4min 47s ± 5.39 s per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [71]: %%timeit
...: @jit(nopython=True)
...: def f(a):
...: d = np.empty(a.shape)
...: d[0] = a[0]
...: for i in range(1, a.shape[0]):
...: d[i] = d[i-1] * 0.3 + a[i] * 0.7
...: return d
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...:
129 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Need to do calculation in dataframe with previous row value

you can try using cummulative sum of pandas to achieve this,

df['Amount'].cumsum()

# Edit-1
condition = df['Balance Created'].isnull()
df.loc[condition, 'Balance Created'] = df['Amount'].loc[condition]

you can also apply based on groups like deposit and withdraw

df.groupby('transaction')['Amount'].cumsum()

How to speed up calculations involving previous row in pandas?

We can use numba to speed up calculations here, see Enhancing performance section in the docs.

import numba 

@numba.njit
def func(a, b_0=5):
n = len(a)
b = np.full(n, b_0, dtype=np.float64)
for i in range(1, n):
b[i] = (b[i - 1] + a[i - 1]) / 2
return b

df['b'] = func(df['a'].to_numpy())
df

a b
0 1 5.00
1 6 3.00
2 2 4.50
3 8 3.25

Comparing performance

Benchmarking code, for reference.

enter image description here

The blue line represents the performance of the fastest version of your current method (using .at). The orange line represents the numba's performance.

Add previous row value to current row and so on Python

Use cumsum

df['tvd'] = df['tvd'].cumsum()

Example:

import pandas as pd
import numpy as np
from io import StringIO

txt = """ MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00"""

df = pd.read_csv(StringIO(txt), sep='\s\s+')

for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))

df['TVD_diff'] = df['TVD_diff'].cumsum()

print(df)

Output:

      MD  Incl.   Azi.  TVD_diff
0 0.0 0.0 350.0 NaN
1 161.0 0.0 350.0 161.0
2 261.0 0.0 350.0 261.0
3 361.0 0.0 350.0 361.0
4 461.0 0.0 350.0 461.0

Is there a way to use the previous calculated row value with the sum of a different column in a Pandas Dataframe?

We can define a function fast_sum to perform the required calculation then using the technique called just in time compilation, compile this function to machine code so that it can run more efficiently at C like speeds

import numba

@numba.jit(nopython=True)
def fast_sum(a):
b = np.zeros_like(a)
b[0] = a[0]
for i in range(1, len(a)):
b[i] = (b[i - 1] * 5 + a[i]) / 6
return b

df['B'] = fast_sum(df['A'].fillna(0).to_numpy())


                         A         B
2021-05-19 07:00:00 0.00 0.000000
2021-05-19 07:30:00 0.00 0.000000
2021-05-19 08:00:00 0.00 0.000000
2021-05-19 08:30:00 0.00 0.000000
2021-05-19 09:00:00 19.91 3.318333
2021-05-19 09:30:00 0.11 2.783611
2021-05-19 10:00:00 0.00 2.319676
2021-05-19 10:30:00 22.99 5.764730
2021-05-19 11:00:00 0.00 4.803942

Performance test on sample dataframe with 90000 rows

df = pd.concat([df] * 10000, ignore_index=True)

%%timeit
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
# 1.62 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Related Topics



Leave a reply



Submit