Find Consecutive Sequence of Zeros in R

Find consecutive sequence of zeros in R

Using data.table, as your question suggests you actually want to, as far I a can see, this is doing what you want

DT <- data.table(myOriginalDf)

# add the original order, so you can't lose it
DT[, orig := .I]

# rle by id, saving the length as a new variables

DT[, rleLength := {rr <- rle(value); rep(rr$length, rr$length)}, by = 'id']

# key by value and length to subset 

setkey(DT, value, rleLength)

# which rows are value = 0 and length > 2

DT[list(0, unique(rleLength[rleLength>2])),nomatch=0]

##    value rleLength id orig
## 1:     0         3  x    6
## 2:     0         3  x    7
## 3:     0         3  x    8
## 4:     0         4  y   10
## 5:     0         4  y   11
## 6:     0         4  y   12
## 7:     0         4  y   13

How to find the indices where there are n consecutive zeroes in a row

Here are two base R approaches:

1) rle First run rle and then compute ok to pick out the sequences of zeros that are more than 3 long. We then compute the starts and ends of all repeated sequences subsetting to the ok ones at the end.

with(rle(x), {
  ok <- values == 0 & lengths > 3
  ends <- cumsum(lengths)
  starts <- ends - lengths + 1
  data.frame(starts, ends)[ok, ]
})

giving:

  starts ends
1      6   17
2     34   58
3     72   89

2) gregexpr Take the sign of each number -- that will be 0 or 1 and then concatenate those into a long string. Then use gregexpr to find the location of at least 4 zeros. The result gives the starts and the ends can be computed from that plus the match.length attribute minus 1.

s <- paste(sign(x), collapse = "")
g <- gregexpr("0{4,}", s)[[1]]
data.frame(starts = 0, ends = attr(g, "match.length") - 1) + g

giving:

  starts ends
1      6   17
2     34   58
3     72   89

Find distribution of consecutive zeros

1) We can use rleid from data.table

data.table(x)[, strrep(0, sum(x==0)) ,rleid(x == 0)][V1 != "",.N , V1]
#    V1 N
#1:   0 3
#2:  00 2
#3: 000 1

2) or we can use tidyverse

library(tidyverse)
tibble(x) %>%
    group_by(grp = cumsum(x != 0)) %>% 
    filter(x == 0)  %>% 
    count(grp) %>% 
    ungroup %>% 
    count(n)
# A tibble: 3 x 2
#     n    nn
#   <int> <int>
#1     1     3
#2     2     2
#3     3     1

3) Or we can use tabulate with rleid

tabulate(tabulate(rleid(x)[x==0]))
#[1] 3 2 1

Benchmarks

By checking with system.time on @SymbolixAU's dataset

system.time({
  tabulate(tabulate(rleid(x2)[x2==0]))
 })
#  user  system elapsed 
#  0.03    0.00    0.03

Comparing with the Rcpp function, the above is not that bad

 system.time({
  m <- zeroPattern(x2)
  m[m[,2] > 0, ]
})
#   user  system elapsed 
#   0.01    0.01    0.03

With microbenchmark, removed the methods that are consuming more time (based on @SymbolixAU's comparisons) and initiated a new comparison. Note that here also, it is not exactly apples to apples but it is still a lot more similar as in the previous comparison there is an overhead of data.table along with some formatting to replicate the OP's expected output

microbenchmark(
    akrun = {
        tabulate(tabulate(rleid(x2)[x2==0]))
    },
    G = {
        with(rle(x2), table(lengths[values == 0]))
    },
    sym = {
        m <- zeroPattern(x2)
        m[m[,2] > 0, ]
    },
    times = 5, unit = "relative"
)
#Unit: relative
#  expr      min       lq     mean   median       uq      max neval cld
# akrun 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000     5  a 
#     G 6.049181 8.272782 5.353175 8.106543 7.527412 2.905924     5   b
#   sym 1.385976 1.338845 1.661294 1.399635 3.845435 1.211131     5  a

Maximum of a consecutive sequence in a column with zeros

Here is one way to do it:

library(dplyr)

test %>% 
  group_by(id = data.table::rleid(vals)) %>% 
  summarise(max = ifelse(sum(vals) != 0,
                         list(max(cumsum, na.rm = TRUE)),
                         list(NULL))
            ) %>% 
  pull(max) %>%
  unlist

#> [1] 3 3 1

# the data
id = 1:16
vals = c(0,1,1,1,0,0,0,0,1,1,1,0,0,0,1,0)
cumsum  = c(0, 1, 2, 3, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 1, 0)
test = data.frame(id,vals, cumsum)

^{Created on 2021-08-16 by the reprex package (v2.0.1)}

How to count consecutive zero in last run?

Reverse a and then compute its cumulative sum. The leading 0's will be the only 0's left and ! of that will be TRUE for each and FALSE for other elements. The sum of that is the desired number.

sum(!cumsum(rev(a)))

Find consecutive zeroes in a row

#Had to fix Client 4, one number was missing
DF <- read.table(text = 'Clients     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
                 "Client 1"    123 768 678 452 213 123 55  10  0   0   0   0
                 "Client 2"    549 542 21  321 31  59  998 0   546 980 0   987
                 "Client 3"    500 0   500 0   500 0   500 0   500 0   500 0
                 "Client 4"    126 545 2315 27  268 126 56  0   0   0   0   0   
                 "Client 5"    546 546 0   0   0   328 486 326 0   0   66  0
                 "Client 6"    0   0   0   25  78  563 698 631 230 53  0   0', header = TRUE)

Loop over rows, reverse the order, and find which entry is the first non-zero; if the client never head a transaction return length(x):

n <- apply(DF[, -1], 1, function(x) if (any(x)) which.max(rev(x) != 0) - 1 else length(x))
#[1] 4 0 1 5 1 2

DF$Clients[n >= 3]
#[1] Client 1 Client 4
#Levels: Client 1 Client 2 Client 3 Client 4 Client 5 Client 6

Finding the first number after consecutive zeros in data frame

We can use rle to select the first row after first consecutive zeroes in each group (ID).

library(dplyr)

data %>%
 group_by(ID) %>%
 slice(with(rle(event == 0), sum(lengths[1:which.max(values)])) + 1)

#     ID  time event
#  <int> <int> <dbl>
#1     1     8     1
#2     2     6     1

Count of consecutive zeros in a dataframe

Solution using rle:

getConsecZeroRle <- function(x) {
    foo <- rle(x)
    foo$lengths[tail(which(foo$values), 1)]
}
result <- apply(df[, -1] == 0, 1, function(x) getConsecZeroRle(x))
df$test <- as.numeric(result)
df$test[is.na(df$test)] <- 0

Explanation:

Use apply to iterate over the subset of your dataframe. For each row calculate length of consecutive zeros (rle) and extract last value using tail. Rows that don't have zeros will produce NA (using is.na(df$test)) to replace them with zeros.

Solution using sum:

getConsecZeroSum <- function(x) {
    x[1:tail(which(!x), 1)] <- FALSE
    sum(x)
}
df$test <- apply(df[, -1] == 0, 1, function(x) getConsecZeroSum(x))

Explanation:

Extract last FALSE value in each row and turn everything to FALSE before it (x[1:tail(which(!x), 1)] <- FALSE) then use sum to count zero values from the end.

Result:

#      a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 test
# 1 row1 0 0 0 1 0 0 1 0 0  0  0  0  0  0  0    8
# 2 row2 0 0 0 1 1 1 1 1 1  1  1  1  1  1  0    1

Change zero to ones in vector if surrounded by less than five consecutive zeros

A possible solution with rle which does not change shorts sequences of zero's at the beginning or end of x:

# create the run length encoding
r <- rle(x)

# create an index of which zero's should be changed
i <- r$values == 0 & r$lengths < 5 & 
  c(tail(r$values, -1) == 1, FALSE) & 
  c(FALSE, head(r$values, -1) == 1)

# set the appropriate values to 1
r$values[i] <- 1

# use the inverse of rle to recreate the vector
inverse.rle(r)

which gives:

[1] 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1

tidyverse : consecutive appearance of zeros

One option could be:

tb1 %>%
 group_by(rleid = with(rle(a), rep(seq_along(lengths), lengths))) %>%
 mutate(b = 1:n() * (a != 1)) 

       a     b rleid
   <dbl> <int> <int>
 1     1     0     1
 2     0     1     2
 3     0     2     2
 4     0     3     2
 5     0     4     2
 6     1     0     3
 7     0     1     4
 8     0     2     4
 9     0     3     4
10     0     4     4
11     0     5     4
12     1     0     5
13     1     0     5
14     0     1     6
15     0     2     6
16     0     3     6
17     0     4     6
18     1     0     7
19     0     1     8
20     0     2     8

Find Consecutive Sequence of Zeros in R