How to Invoke Pandas.Rolling.Apply with Parameters from Multiple Column

Pandas rolling apply using multiple columns

How about this:

def masscenter(ser):
print(df.loc[ser.index])
return 0

rol = df.price.rolling(window=2)
rol.apply(masscenter, raw=False)

It uses the rolling logic to get subsets from an arbitrary column. The raw=False option provides you with index values for those subsets (which are given to you as Series), then you use those index values to get multi-column slices from your original DataFrame.

Using custom function for Pandas Rolling Apply that depends on colname

I wouldn't be surprised if there's a "better" solution, but I think could at least be a "good start" (I don't do a whole lot with .rolling(...)).

With this solution, I make two critical assumptions:

  1. All denom_<X> have a corresponding <X> column.
  2. Everything you do with the (<X>, denom_<X>) pairs is the same. (This should be straightforward to customize as needed.)

With that said, I do the .rolling within the function, rather than outside, in part because it seems like .apply(...) on a RollingGroupBy can only work column-wise, which isn't too helpful here (imo).

def cust_fn(df: pd.DataFrame, rolling_args: Tuple) -> pd.Series:
cols = df.columns
denom_cols = ["id"] # the whole dataframe is passed, so place identifiers / uncomputable variables here.

for denom_col in cols[cols.str.startswith("denom_")]:
denom_cols += [denom_col, denom_col.replace("denom_", "")]
col = denom_cols[-1] # sugar
df[f"calc_{col}"] = df[col].rolling(*rolling_args).sum() / df[denom_col].max()

for col in cols[~cols.isin(denom_cols)]:
print(col, df[col])
df[f"calc_{col}"] = df[col].rolling(*rolling_args).mean()

return df

Then the way you'd go about running this is the following (and you get the corresponding output):

>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1))
id a b c denom_a denom_b calc_a calc_b calc_c
0 a0 4 3 7 7 10 0.444444 0.250000 7.0
1 a0 5 4 4 8 11 1.000000 0.583333 5.5
2 a0 6 5 3 9 12 1.222222 0.750000 3.5
3 a1 1 3 8 7 10 0.111111 0.250000 8.0
4 a1 2 2 9 8 11 0.333333 0.416667 8.5
5 a1 3 4 7 9 12 0.555556 0.500000 8.0
6 a2 7 1 4 7 10 0.875000 0.090909 4.0
7 a2 9 3 6 8 11 2.000000 0.363636 5.0

If you need dynamically state which non-numeric/computable columns exist, then it might make sense to define cust_fn as follows:

def cust_fn(df: pd.DataFrame, rolling_args: Tuple, index_cols: List = []) -> pd.Series:
cols = df.columns
denon_cols = index_cols

# ... the rest is unchanged

Then you would adapt your calling of cust_fn as follows:

>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1), index_cols=["id"])

Of course, comment on this if you run into issues adapting it to your uses. /p>

Pandas apply, rolling, groupby with multiple input & multiple output columns

Important notes

  1. The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
  2. The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:

    • Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
    • Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
  3. When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
  4. When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
  5. You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
  6. Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!


Code examples

Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.

df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]

It will look like this:

group   a   b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0


Input 1 column, output 1 column

Basic

def func_i1_o1(x):    
return x+1

df['c'] = df['b'].apply(func_i1_o1)


Rolling

def func_i1_o1_rolling(x):
return (x[0] + x[1])

df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)


Roling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)


Input 2 columns, output 1 column

Basic

def func_i2_o1(x):
return np.sum(x)

df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)


Rolling

As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.

def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)

df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)


Rolling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)


Input 1 column, output 2 columns

Basic

You could use a 'normal' solution by returning pd.Series:

def func_i1_o2(x):
return pd.Series((x+1, x+2))

df[['i', 'j']] = df['b'].apply(func_i1_o2)

Or you could use the zip/tuple combination which is about 8 times faster!

def func_i1_o2_fast(x):
return x+1, x+2

df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))


Rolling

As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.

1

def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0

df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)

Pros: Everything is done within 1 function.

Cons: You have to create the columns first and it is slower since it doesn't use the raw input.

2

rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()

def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0

df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2

Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).

Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.

Rolling & Groupby

Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.

def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0

df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)


Input 2 columns, output 2 columns

Basic

I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.

def func_i2_o2(x):
return np.mean(x), np.median(x)

df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))


Rolling

As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.

1. Get values from other columns using indexes (see func_i2_o1_rolling)

2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)

def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0

df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)


Rolling & Groupby

Add the reset_index solution (see notes above) to the rolling function.

def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0

df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)

rolling.apply on custom function that requires multiple columns of dataframe to reduce single column

It appears that this feature is currently not available. There is an issue open on this topic on pandas github. Please check: https://github.com/pandas-dev/pandas/issues/15095.

using a custom function with arguments on Pandas rolling

According to this doc, Series.rolling().apply() does not accept **kwargs. Instead, there's kwargs option, which takes a dictionary:

series.rolling(10).apply(testfunc, raw=False, kwargs={'mult':2})

Output:

0           NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
95 11.782115
96 10.999794
97 9.678652
98 9.669550
99 10.093348
Name: test, Length: 100, dtype: float64


Related Topics



Leave a reply



Submit