How to Fill Pandas Column with calculations involving cells from previous rows without Using for loop
Recursive calculations are not vectorisable, for improve performance is used numba:
from numba import jit
@jit(nopython=True)
def f(a):
d = np.empty(a.shape)
d[0] = a[0]
for i in range(1, a.shape[0]):
d[i] = d[i-1] * 0.3 + a[i] * 0.7
return d
df['Val'] = f(df['Par'].to_numpy())
print (df)
Par Val
0 50 50.0000
1 60 57.0000
2 70 66.1000
3 80 75.8300
4 90 85.7490
5 100 95.7247
Difference for performance for 1k rows:
from numba import jit
import itertools
np.random.seed(2022)
df = pd.DataFrame({'Par': np.random.randint(100, size=1000)})
In [64]: %%timeit
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...: import itertools
...:
...: df.loc[0,"Val"] = df.loc[0,"Par"]
...: for _ in itertools.repeat(None, len(df)):
...: df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
...:
1.05 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %%timeit
...: @jit(nopython=True)
...: def f(a):
...: d = np.empty(a.shape)
...: d[0] = a[0]
...: for i in range(1, a.shape[0]):
...: d[i] = d[i-1] * 0.3 + a[i] * 0.7
...: return d
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
121 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Test for 100krows:
np.random.seed(2022)
df = pd.DataFrame({'Par': np.random.randint(100, size=100000)})
In [70]: %%timeit
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...: import itertools
...:
...: df.loc[0,"Val"] = df.loc[0,"Par"]
...: for _ in itertools.repeat(None, len(df)):
...: df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
...:
4min 47s ± 5.39 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [71]: %%timeit
...: @jit(nopython=True)
...: def f(a):
...: d = np.empty(a.shape)
...: d[0] = a[0]
...: for i in range(1, a.shape[0]):
...: d[i] = d[i-1] * 0.3 + a[i] * 0.7
...: return d
...:
...: df['Val1'] = f(df['Par'].to_numpy())
...:
...:
129 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Need to do calculation in dataframe with previous row value
you can try using cummulative sum of pandas to achieve this,
df['Amount'].cumsum()
# Edit-1
condition = df['Balance Created'].isnull()
df.loc[condition, 'Balance Created'] = df['Amount'].loc[condition]
you can also apply based on groups like deposit and withdraw
df.groupby('transaction')['Amount'].cumsum()
How to speed up calculations involving previous row in pandas?
We can use numba
to speed up calculations here, see Enhancing performance section in the docs.
import numba
@numba.njit
def func(a, b_0=5):
n = len(a)
b = np.full(n, b_0, dtype=np.float64)
for i in range(1, n):
b[i] = (b[i - 1] + a[i - 1]) / 2
return b
df['b'] = func(df['a'].to_numpy())
df
a b
0 1 5.00
1 6 3.00
2 2 4.50
3 8 3.25
Comparing performance
Benchmarking code, for reference.
The blue line represents the performance of the fastest version of your current method (using .at
). The orange line represents the numba's performance.
Add previous row value to current row and so on Python
Use cumsum
df['tvd'] = df['tvd'].cumsum()
Example:
import pandas as pd
import numpy as np
from io import StringIO
txt = """ MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00"""
df = pd.read_csv(StringIO(txt), sep='\s\s+')
for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))
df['TVD_diff'] = df['TVD_diff'].cumsum()
print(df)
Output:
MD Incl. Azi. TVD_diff
0 0.0 0.0 350.0 NaN
1 161.0 0.0 350.0 161.0
2 261.0 0.0 350.0 261.0
3 361.0 0.0 350.0 361.0
4 461.0 0.0 350.0 461.0
Is there a way to use the previous calculated row value with the sum of a different column in a Pandas Dataframe?
We can define a function fast_sum
to perform the required calculation then using the technique called just in time compilation, compile this function to machine code so that it can run more efficiently at C
like speeds
import numba
@numba.jit(nopython=True)
def fast_sum(a):
b = np.zeros_like(a)
b[0] = a[0]
for i in range(1, len(a)):
b[i] = (b[i - 1] * 5 + a[i]) / 6
return b
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
A B
2021-05-19 07:00:00 0.00 0.000000
2021-05-19 07:30:00 0.00 0.000000
2021-05-19 08:00:00 0.00 0.000000
2021-05-19 08:30:00 0.00 0.000000
2021-05-19 09:00:00 19.91 3.318333
2021-05-19 09:30:00 0.11 2.783611
2021-05-19 10:00:00 0.00 2.319676
2021-05-19 10:30:00 22.99 5.764730
2021-05-19 11:00:00 0.00 4.803942
Performance test on sample dataframe with 90000
rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
# 1.62 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related Topics
How to Determine If My Python Shell Is Executing in 32Bit or 64Bit
How to Add List into a New Column in CSV - Python
How to Map True/False to 1/0 in a Pandas Dataframe
Efficient Way of Having a Function Only Execute Once in a Loop
How to Add Parenthesis Around a Substring in a String
Find Value in Dictionary Using Regex in Python
How to Append a List Withoud Adding the Quote
Python - Split Array into Multiple Arrays
Combine Date and Time Columns Using Python Pandas
How to Run Python Script from Another Machine Without Installing Imported Modules
Fast Way to Split Column into Multiple Rows in Pandas
Print All Number Divisible by 7 and Contain 7 from 0 to 100
Return Value from Python-Shell as Response
Possible to Loop Through Excel Files With Differently Named Sheets, and Import into a List