Restart Cumsum and Get Index If Cumsum More Than Value

Restart cumsum and get index if cumsum more than value

Here's one with numba and array-initialization -

from numba import njit

@njit
def cumsum_breach_numba2(x, target, result):
total = 0
iterID = 0
for i,x_i in enumerate(x):
total += x_i
if total >= target:
result[iterID] = i
iterID += 1
total = 0
return iterID

def cumsum_breach_array_init(x, target):
x = np.asarray(x)
result = np.empty(len(x),dtype=np.uint64)
idx = cumsum_breach_numba2(x, target, result)
return result[:idx]

Timings

Including @piRSquared's solutions and using the benchmarking setup from the same post -

In [58]: np.random.seed([3, 1415])
...: x = np.random.randint(100, size=1000000).tolist()

# @piRSquared soln1
In [59]: %timeit list(cumsum_breach(x, 10))
10 loops, best of 3: 73.2 ms per loop

# @piRSquared soln2
In [60]: %timeit cumsum_breach_numba(np.asarray(x), 10)
10 loops, best of 3: 69.2 ms per loop

# From this post
In [61]: %timeit cumsum_breach_array_init(x, 10)
10 loops, best of 3: 39.1 ms per loop

Numba : Appending vs. array-initialization

For a closer look at how the array-initialization helps, which seems be the big difference between the two numba implementations, let's time these on the array data, as the array data creation was in itself heavy on runtime and they both depend on it -

In [62]: x = np.array(x)

In [63]: %timeit cumsum_breach_numba(x, 10)# with appending
10 loops, best of 3: 31.5 ms per loop

In [64]: %timeit cumsum_breach_array_init(x, 10)
1000 loops, best of 3: 1.8 ms per loop

To force the output to have it own memory space, we can make a copy. Won't change the things in a big way though -

In [65]: %timeit cumsum_breach_array_init(x, 10).copy()
100 loops, best of 3: 2.67 ms per loop

Resetting Cumulative Sum once a value is reached and set a flag to 1

"Ordinary" cumsum() is here useless, as this function "doesn't know"
where to restart summation.

You can do it with the following custom function:

def myCumSum(x, thr):
if myCumSum.prev >= thr:
myCumSum.prev = 0
myCumSum.prev += x
return myCumSum.prev

This function is "with memory" (from the previous call) - prev, so there
is a way to "know" where to restart.

To speed up the execution, define a vectorized version of this function:

myCumSumV = np.vectorize(myCumSum, otypes=[np.int], excluded=['thr'])

Then execute:

threshold = 40
myCumSum.prev = 0 # Set the "previous" value
# Replace "a" column with your cumulative sum
df.a = myCumSumV(df.a.values, threshold)
df['flag'] = df.a.ge(threshold).astype(int) # Compute "flag" column

The result is:

     a  b  flag
0 5 1 0
1 11 1 0
2 41 1 1
3 170 0 1
4 5 1 0
5 15 1 0

Cumsum with restarts

Not sure how this could be vectorized, if it even can be, since by taking the cumulative sum you'll be propagating the remainders each time the threshold is surpassed. So probably this is a good case for numba, which will compile the code down to C level, allowing for a loopy but performant approach:

from numba import njit, int32

@njit('int32[:](int32[:], uintc)')
def windowed_cumsum(a, thr):
indices = np.zeros(len(a), int32)
window = 0
ix = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices[ix] = i
ix += 1
window = 0
return indices[:ix]

The explicit signature implies ahead of time compilation, though this enforces specific dtypes on the input array. The inferred dtype for the example array is of int32, though if this might not always be the case or for a more flexible solution you can always ignore the dtypes in the signature, which will only imply that the function will be compiled on the first execution.

input_data = np.array([4000, 5000, 6000, 2000, 8000, 3000])

windowed_cumsum(input_data, 10000)
# array([2, 4])

Also @jdehesa raises an interesting point, which is that for very long arrays compared to the number of bins, a better option might be to just append the indices to a list. So here is an alternative approach using lists (also in no-python mode), along with timings under different scenarios:

from numba import njit, int32

@njit
def windowed_cumsum_list(a, thr):
indices = []
window = 0
for i in range(len(a)):
window += a[i]
if window >= thr:
indices.append(i)
window = 0
return indices

a = np.random.randint(0,10,10_000)

%timeit windowed_cumsum(a, 20)
# 16.1 µs ± 232 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 20)
# 65.5 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit windowed_cumsum(a, 2000)
# 7.38 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit windowed_cumsum_list(a, 2000)
# 7.1 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So it seems that under most scenarios using numpy will be a faster option, since even in the second case, with a length 10000 array and a resulting array of 20 indices of bins, both perform similarly, though for memory efficiency reasons the latter might be more convenient in some cases.

reset cumulative sum base on condition pandas and return other cumulative sum

I would first go through the 2 columns once for their cumulative sums.

cum_amount = df['amount'].cumsum()
cum_duration = df['duration'].cumsum()

Get a list ready for the results

results = []

Then loop through each index (equivalent to counter)

for idx in cum_duration.index:
# keep only rows within `5` and the max. index is where the required numbers are located
wanted_idx = (cum_duration[cum_duration<5]).index.max()

# read those numbers with the wanted index
results.append({'idx': idx, 'cum_duration': cum_duration[wanted_idx], 'cum_amount': cum_amount[wanted_idx]})

# subtract the lag (we need only the leads not the lags)
cum_amount -= cum_amount[idx]
cum_duration -= cum_duration[idx]

Finally the result in a DataFrame.

pd.DataFrame(results)

idx cum_duration cum_amount
0 0 2.29 4834.0
1 1 2.21 3599.0
2 2 1.85 2429.0
3 3 4.80 2316.0
4 4 3.99 1109.0
5 5 1.20 1261.0
6 6 4.24 1068.0
7 7 3.07 1098.0
8 8 2.08 1215.0
9 9 4.09 1043.0
10 10 2.95 1176.0
11 11 3.96 1038.0
12 12 3.95 1119.0
13 13 3.92 1074.0
14 14 3.91 1076.0
15 15 1.50 1224.0
16 16 3.65 962.0
17 17 3.85 1039.0
18 18 3.82 1062.0
19 19 3.34 917.0

How to group based on cumulative sum that resets on a condition

We can only do self def function

def dymcumsum(v, limit):
idx = []
sums = 0
for i in range(len(v)):
sums += v[i]
if sums >= limit:
idx.append(i)
sums = 0
return(idx)
df['New']=np.nan
df.loc[dymcumsum(df.WORD_COUNT,20),'New']=1
df.New=df.New.iloc[::-1].eq(1).cumsum()[::-1].factorize()[0]+1

df
ARTICLE WORD_COUNT New
0 0 6 1
1 1 10 1
2 3 5 1
3 4 7 2
4 5 26 2
5 6 7 3
6 9 4 3
7 10 133 3
8 11 42 4
9 12 1 5

Reset Pandas Cumsum for every multiple of 1000

IIUC, this simply is an integer division problem with some tricks:

thresh = 1000
df['cumsum'] = df['Production'].cumsum()

# how many times cumsum passes thresh
multiple = (df['cumsum'] // thresh )

# detect where thresh is pass
mask = multiple.diff().ne(0)

# update the number generated:
df['numberGenerated'] = np.where(mask, multiple, 0)

# then the adjusted cumsum
df['adjCumsum'] = (df['numberGenerated'].mul(thresh)) + df['cumsum'] % thresh

Output:

            Production       ID  cumsum  adjCumsum  numberGenerated
2017-10-19 1054 1323217 1054 1054 1
2017-10-20 0 1323217 1054 54 0
2017-10-21 0 1323217 1054 54 0
2017-10-22 3054 1323217 4108 4108 4
2017-10-23 0 1323217 4108 108 0
2017-10-23 500 1323218 4608 608 0

Perfrom cumulative sum over a column but reset to 0 if sum become negative in Pandas

Slightly modify also this method is slow that numba solution

sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)

numba solution

from numba import njit
@njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total < lim:
total = 0
result.append(total)
return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]

Cumulative Sum With Reset Condition

I'm not familiar with pandas but my understanding is that it is based on numpy. Using numpy you can define custom functions that can be used with accumulate.

Here is one that I think is close to what you're looking for:

import numpy as np
def capsum(array,cap):
capAdd = np.frompyfunc(lambda a,b:a+b if a < cap else b,2,1)
return capAdd.accumulate(values, dtype=np.object)

values = np.random.rand(1000000) * 3 // 1

result = capsum(values,5) # --> produces the result in 0.17 sec.

I believe (or I hope) you can use numpy functions on dataframes.



Related Topics



Leave a reply



Submit