Group vector on conditional sum
Here is my Rcpp
-solution (close to Khashaa's solution but a bit shorter/stripped down), because you said speed was important, Rcpp
is probably the way to go:
# create the data
set.seed(1)
x <- sample(10, 20, replace = TRUE)
y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)
# create the Rcpp function
library(Rcpp)
cppFunction('
IntegerVector sotosGroup(NumericVector x, int cutoff) {
IntegerVector groupVec (x.size());
int group = 1;
double runSum = 0;
for (int i = 0; i < x.size(); i++) {
runSum += x[i];
if (runSum > cutoff) {
group++;
runSum = x[i];
}
groupVec[i] = group;
}
return groupVec;
}
')
# use the function as usual
y_cpp <- sotosGroup(x, 15)
sapply(split(x, y_cpp), sum)
#> 1 2 3 4 5 6 7 8 9 10
#> 13 13 9 10 15 12 12 13 14 8
all.equal(y, y_cpp)
#> [1] TRUE
In case anyone needs to be convinced by the speed:
# Speed Benchmarks
library(data.table)
library(microbenchmark)
dt <- data.table(x)
frank <- function(DT, n = 15) {
DT[, xc := cumsum(x)]
b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
z = 1; res = z
while (!is.na(z))
res <- c(res, z <- b[z])
DT[, g := cumsum(.I %in% res)][]
}
microbenchmark(
frank(dt),
sotosGroup(x, 15),
times = 100
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> frank(dt) 1720.589 1831.320 2148.83096 1878.0725 1981.576 13728.830 100 b
#> sotosGroup(x, 15) 2.595 3.962 6.47038 7.5035 8.290 11.579 100 a
R: Conditional summation of a numeric vector
Try this:
MakeNonNeg <- function(v) {
size <- length(v)
myOut <- as.numeric(v)
if (size > 1L) {
for (i in 1:(size-1L)) {
if (myOut[i] >= 0) {next}
myOut[i+1L] <- myOut[i]+myOut[i+1L]
myOut[i] <- 0
}
}
myOut
}
MakeNonNeg(inVector)
[1] 2 0 0 29 7
Below is a more exotic example:
set.seed(4242)
BigVec <- sample(-40000:100000, 100000, replace = TRUE)
gmp::sum.bigz(BigVec)
Big Integer ('bigz') :
[1] 2997861106
t3 <- MakeNonNeg(BigVec)
gmp::sum.bigz(t3)
Big Integer ('bigz') :
[1] 2997861106
BigVec[1:20]
[1] 98056 8680 -7814 53620 58390 90832 74970 -16392 52648 83779 -17229 38484 -36589 75156 71200 95968 -11599 57705
[19] 19209 -21596
t3[1:20]
[1] 98056 8680 0 45806 58390 90832 74970 0 36256 83779 0 21255 0 38567 71200 95968 0 46106 19209 0
Here is my system info:
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Below are timings for both functions with JIT disabled.
microbenchmark(
makeNonNeg = MakeNonNeg(BigVec),
zeroElement = zeroElement(BigVec),
times=10)
Unit: milliseconds
expr min lq mean median uq max neval
makeNonNeg 254.1255 255.8430 267.9527 258.6369 277.0222 303.6516 10
zeroElement 152.0358 164.7988 175.3191 166.4948 198.3855 209.8739 10
With JIT
enabled, we obtain much different results for makeNonNeg
. However, the results for zeroElement
don't change that much (I'm thinking that since Reduce
is the major part of the function and it is already bytecoded, there is not much room for improvement).
library(compiler)
enableJIT(3)
[1] 0
microbenchmark(
makeNonNeg = MakeNonNeg(BigVec),
zeroElement = zeroElement(BigVec),
times=10)
Unit: milliseconds
expr min lq mean median uq max neval
makeNonNeg 11.20514 11.55366 12.76953 11.84655 12.20554 20.60036 10
zeroElement 144.15123 149.33591 163.66421 157.34711 176.20139 198.57268 10
So, with JIT
disabled, zeroElement
is about 50% faster and when JIT
is enabled, MakeNonNeg
is about 13x faster.
Creating a vector of conditional sums
Since you did not provide a reproducible example, I'll use the iris
dataset.
aggregate(iris[,1:4], by=list(iris$Species), sum)
# Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 setosa 250.3 171.4 73.1 12.3
#2 versicolor 296.8 138.5 213.0 66.3
#3 virginica 329.4 148.7 277.6 101.3
Create grouping based on cumulative sum and another group
This works! It uses a combination of purrr's accumulate
(similar to cumsum but more versatile) and cumsum
with appropriate use of group_by
to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.
library(tidyverse)
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
if (include.equals) {
purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
} else {
purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
}
}
df_raw %>%
group_by(domain) %>%
mutate(cumsum_val = sum_reset_at(value, 1)) %>%
mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
ungroup %>%
mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
group_by(domain, group1) %>%
mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
ungroup %>%
mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
select(-starts_with("next_"))
And as specified, this produces:
# A tibble: 13 x 5
domain value cumsum_val group1 group2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 1 0 0 2 2
3 1 2 2 2 2
4 1 2.5 2.5 3 3
5 1 0.1 0.1 4 3
6 2 0.1 0.1 5 4
7 2 0.5 0.6 5 4
8 2 0 0.6 5 4
9 2 0.2 0.8 5 4
10 2 0.6 1.4 5 4
11 2 0 0 6 4
12 2 0 0 6 4
13 2 0.1 0.1 6 4
Grouping elements based on their sum
We can do a double cumsum
to create a grouping variable and split
the vector
('x') based on that.
lst <- split(x, cumsum(cumsum(x)>10))
If we need individual objects in the global environment, list2env
can be used (but not recommended)
list2env(setNames(lst, letters[seq_along(lst)]), envir = .GlobalEnv)
Total sum of a numeric vector conditional on values from a character vector
Test data, in a dataframe for convenience rather than two vectors:
> d = data.frame(x=runif(10),y=sample(LETTERS[1:3],10,TRUE))
> d
x y
1 0.25927547 B
2 0.95012667 C
3 0.85133149 C
4 0.64658480 B
etc
Use tapply
:
> tapply(d$x,d$y,sum)
A B C
1.547225 1.891884 2.666552
Convince yourself:
> sum(d$x[d$y=="A"])
[1] 1.547225
> sum(d$x[d$y=="B"])
[1] 1.891884
> sum(d$x[d$y=="C"])
[1] 2.666552
The aggregate
function with a formula argument makes it easy to do with more than one column:
> d=data.frame(x=runif(10),y=sample(LETTERS[1:3],10,TRUE),z=sample(LETTERS[1:3],10,TRUE))
> d
x y z
1 0.4166217 A C
2 0.5816940 B C
3 0.9915231 A C
4 0.7177323 B C
etc
We want to sum x within classes defined by y and z.
> aggregate(x~y+z,d,sum)
y z x
1 C A 1.6392171
2 B B 0.9389463
3 C B 0.3330299
4 A C 2.3748477
5 B C 1.2994263
Convince yourself:
> sum(d$x[d$y=="C" & d$z=="A"])
[1] 1.639217
> sum(d$x[d$y=="B" & d$z=="B"])
[1] 0.9389463
Note that C,A is a separate entry to A,C...
Related Topics
Implementation of Skyline Query or Efficient Frontier
Differencebetween Aes and Aes_String (Ggplot2) in R
Breaks for Scale_X_Date in Ggplot2 and R
Plot Line and Bar Graph (With Secondary Axis for Line Graph) Using Ggplot
How to Underline Text in a Plot Title or Label? (Ggplot2)
How to Make a Barplot with R from a Table
Adding Counts of a Factor to a Dataframe
How to Modify Unexported Object in a Package
Predict X Values from Simple Fitting and Annoting It in the Plot
How to Print (To Paper) a Nicely-Formatted Data Frame
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
Why Does Is.Vector() Return True for List
Car::Scatter3D in R - Labeling Axis Better
Plot Separate Years on a Common Day-Month Scale
How to Rearrange an Order of Matches Between Two Data Frames
Combining Geom_Point and Geom_Line with Position_Jitterdodge for Two Grouping Factors
How to Specify the Size/Layout of a Single Plot to Match a Certain Grid in R