Cumulative sum for positive numbers only
One option is
x1 <- inverse.rle(within.list(rle(x), values[!!values] <-
(cumsum(values))[!!values]))
x[x1!=0] <- ave(x[x1!=0], x1[x1!=0], FUN=seq_along)
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or a one-line code would be
x[x>0] <- with(rle(x), sequence(lengths[!!values]))
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Cumulative sum on time series split by consecutive negative or positive values
Putting 0 in with the positives, you can use the shift-compare-cumsum pattern:
In [33]: sign = df["values"] >= 0
In [34]: df["vsum"] = df["values"].groupby((sign != sign.shift()).cumsum()).cumsum()
In [35]: df
Out[35]:
date values vsum
0 2017-05-01 1.00 1.00
1 2017-05-02 0.50 1.50
2 2017-05-03 -2.00 -2.00
3 2017-05-04 -1.00 -3.00
4 2017-05-05 -1.25 -4.25
5 2017-05-06 0.50 0.50
6 2017-05-07 0.50 1.00
which works because (sign != sign.shift()).cumsum()
gives us a new number for each contiguous group:
In [36]: sign != sign.shift()
Out[36]:
0 True
1 False
2 True
3 False
4 False
5 True
6 False
Name: values, dtype: bool
In [37]: (sign != sign.shift()).cumsum()
Out[37]:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
Name: values, dtype: int64
Running total of positive and negative numbers where the sum cannot go below zero
Unfortunately, there is no way to do this without cycling through the records one-by-one. That, in turn, requires something like a recursive CTE.
with t as (
select t.*, row_number() over (order by date) as seqnum
from mytable t
),
cte as (
select NULL as number, 0 as desired, 0 as seqnum
union all
select t.number,
(case when cte.desired + t.number < 0 then 0
else cte.desired + t.number
end),
cte.seqnum + 1
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*
from cte
where cte.number is not null;
I would recommend this approach only if your data is rather small. But then again, if you have to do this, there are not many alternatives other then going through the table row-by-agonizing-row.
Here is a db<>fiddle (using Postgres).
Cumulative sum that resets when turning negative/positive
The data
I'm going to change the data in the example you provided.
df = pl.DataFrame(
{
"a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
"b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
}
)
print(df)
shape: (13, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 9 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 15 │
└─────┴─────┘
Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)
The algorithm
df = (
df
.with_column((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
.with_column(
pl.when(pl.col("sign_a_minus_b") == 0)
.then(None)
.otherwise(pl.col("sign_a_minus_b"))
.forward_fill()
.alias("run_type")
)
.with_column(
(pl.col("run_type") != pl.col("run_type").shift_and_fill(1, 0))
.cumsum()
.alias("run_id")
)
.with_column(pl.col("sign_a_minus_b").cumsum().over("run_id").alias("result"))
)
print(df)
shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a ┆ b ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11 ┆ 11 ┆ 0 ┆ null ┆ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9 ┆ 9 ┆ 0 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 9 ┆ -1 ┆ -1 ┆ 3 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 15 ┆ 0 ┆ 1 ┆ 4 ┆ 2 │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘
I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)
The basic idea is to calculate a run_id
for each run of positive or negative values. We will then use the cumsum
function and the over
windowing expression to create a running count of positives/negatives over each run_id
.
Key assumption: ties in columns a
and b
do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.
sign_a_minus_b
does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a
and b
.
run_type
extends any run to include any cases where a tie occurs in columns a
and b
. The null
value at the top of the column was intended - it shows what happens when a tie occurs in the first row.
result
is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.
One final note: if ties in columns a
and b
are not allowed, then this algorithm can be simplified ... and run faster.
pandas calculate/show dataframe cumsum() only for positive values and other condition
Here is a solution (but there might be a more elegant one):
indexes = (df.col_2 == 'closed') & (df.col_values > 0)
df.loc[indexes, 'new_col'] = df.loc[indexes].groupby('col_1')['col_values'].cumsum()
Conditional cumulative sum
Here's another way.
> r <- rle(sign(t$v2))
> diff(c(0,cumsum(t$v1)[cumsum(r$lengths)]))[r$values==1]
[1] 8 10 12 2
It's easier to understand if you split it up; it works by picking out the right elements of the cumulative sum and subtracting them.
> (s <- cumsum(t$v1))
[1] 1 3 4 8 14 21 29 31 34 38 46 47 49
> (r <- rle(sign(t$v2)))
Run Length Encoding
lengths: int [1:7] 4 2 2 1 2 1 1
values : num [1:7] 1 -1 1 -1 1 -1 1
> (k <- cumsum(r$lengths))
[1] 4 6 8 9 11 12 13
> (a <- c(0,s[k]))
[ 1] 0 8 21 31 34 46 47 49
> (d <- diff(a))
[1] 8 13 10 3 12 1 2
> d[r$values==1]
[1] 8 10 12 2
Similarly, but without rle
:
> k <- which(diff(c(sign(t$v2),0))!=0)
> diff(c(0,cumsum(t$v1)[k]))[t$v2[k]>0]
[1] 8 10 12 2
Cumulative sum of certain numbers in a dataframe ordered by date
You can use this.
library(data.table)
df <- as.data.table(df)
# Order by date
df <- df[order(date)]
# Perform the cumsum for positives and negatives separately
df[, expected := cumsum(values), by = sign(values)]
# Just for the negatives, get the previous positive value
df[, expected := ifelse(values > 0, expected, c(0, expected[-.N]))]
print(df)
date values expected
1: 2016-12-05 5 5
2: 2016-12-07 -10 5
3: 2016-12-08 10 15
4: 2017-01-05 5 20
5: 2017-01-10 -7 20
6: 2017-01-11 8 28
7: 2017-01-11 8 36
Note that if there are more than one consecutive negative values, you have to repeat the operation. For instance, if your data frame is this one:
df <- data.frame(date = as.Date(c("2016-12-08", "2016-12-07", "2016-12-05", "2017-01-05","2017-01-10", "2017-01-10", "2017-01-11", "2017-01-11")),
values = c(10, -10, 5, 5, -7, -15, 8, 8))
One single execution of the above code would produce the following output:
date values expected
1: 2016-12-05 5 5
2: 2016-12-07 -10 5
3: 2016-12-08 10 15
4: 2017-01-05 5 20
5: 2017-01-10 -7 20
6: 2017-01-10 -15 -17
7: 2017-01-11 8 28
8: 2017-01-11 8 36
The value -17 would be wrong. In order to avoid this problem, you can repeat the process until there aren't any negative values left. So the full code would be:
df <- df[order(date)]
df[, expected := cumsum(values), by = sign(values)]
# If there are negative values, repeat the process
while(length(which(df$expected < 0))){
df[, expected := ifelse(values > 0, expected, c(0, expected[-.N]))]
}
print(df)
date values expected
1: 2016-12-05 5 5
2: 2016-12-07 -10 5
3: 2016-12-08 10 15
4: 2017-01-05 5 20
5: 2017-01-10 -7 20
6: 2017-01-10 -15 20
7: 2017-01-11 8 28
8: 2017-01-11 8 36
Perfrom cumulative sum over a column but reset to 0 if sum become negative in Pandas
Slightly modify also this method is slow that numba
solution
sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)
numba
solution
from numba import njit
@njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total < lim:
total = 0
result.append(total)
return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]
Related Topics
Calculate Rolling Correlation Using Rollapply
Earliest Date for Each Id in R
Arithmetic Mean on a Multidimensional Array on R and Matlab: Drastic Difference of Performances
Subscripts and Superscripts "-" or "+" with Ggplot2 Axis Labels? (Ionic Chemical Notation)
Extract Column from Data.Frame as a Vector
R: Ggplot Display All Dates on X Axis
Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula
Keep Document Id with R Corpus
Digging into R Profiling Information
Read Fasta into a Dataframe and Extract Subsequences of Fasta File
Convert a Matrix with Dimnames into a Long Format Data.Frame
How to Define a Vectorized Function in R
Automatic Documentation of Datasets
Calculating the Difference Between Consecutive Rows by Group Using Dplyr