Find start and end positions/indices of runs/consecutive values
Core logic:
# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)
# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)
# Display results
data.frame(start, end)
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
Tidyverse/dplyr
way (data frame-centric):
library(dplyr)
rle(x) %>%
unclass() %>%
as.data.frame() %>%
mutate(end = cumsum(lengths),
start = c(1, dplyr::lag(end)[-1] + 1)) %>%
magrittr::extract(c(1,2,4,3)) # To re-order start before end for display
Because the start
and end
vectors are the same length as the values
component of the rle
object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter
or subset the start
and end
vectors using the condition on the run values.
Get start and end index of runs of values
A solution from base R.
a <- c(1,1,0,0,1,2,0,0)
# Get run length encoding
b <- rle(a)
# Create a data frame
dt <- data.frame(number = b$values, lengths = b$lengths)
# Get the end
dt$end <- cumsum(dt$lengths)
# Get the start
dt$start <- dt$end - dt$lengths + 1
# Select columns
dt <- dt[, c("number", "start", "end")]
# Sort rows
dt <- dt[order(dt$number), ]
dt
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6
Update
Here is a solution using with
to make the code more concise.
with(rle(a), data.frame(number = values,
start = cumsum(lengths) - lengths + 1,
end = cumsum(lengths))[order(values),])
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6
find start end index of bouts of consecutive equal values
Use the shifting cumsum trick to mark consecutive groups, then use groupby
to get indices and filter by your conditions.
v = (df['A'] != df['A'].shift()).cumsum()
u = df.groupby(v)['A'].agg(['all', 'count'])
m = u['all'] & u['count'].ge(3)
df.groupby(v).apply(lambda x: (x.index[0], x.index[-1]))[m]
A
3 (3, 5)
7 (9, 11)
dtype: object
R search function to return start and end location
The link provided by @markus solves your problem, you need to modify it according to your requirement.
get_inds <- function(test, a, b) {
test <- subset(test, C1 == a)
inds <- rle(test$C1 == a & test$C2 == b)
end = cumsum(inds$lengths)
start = c(1, head(end, -1) + 1)
df = data.frame(start, end)[inds$values, ]
row.names(df) <- NULL
df
}
get_inds(test, 'aa', 'J')
# start end
#1 1 3
#2 5 8
#3 10 11
You need to change the condition for rle
and remove the rows where the condition is not satisfied.
Find runs and lengths of consecutive values in an array
Find consecutive runs and length of runs with condition
import numpy as np
arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4])
res = np.ones_like(arr)
np.bitwise_xor(arr[:-1], arr[1:], out=res[1:]) # set equal, consecutive elements to 0
# use this for np.floats instead
# arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2.4, 2.4, 2.4, 2, 1, 3, 4, 4, 4, 5])
# res = np.hstack([True, ~np.isclose(arr[:-1], arr[1:])])
idxs = np.flatnonzero(res) # get indices of non zero elements
values = arr[idxs]
counts = np.diff(idxs, append=len(arr)) # difference between consecutive indices are the length
cond = counts > 2
values[cond], counts[cond], idxs[cond]
Output
(array([2]), array([4]), array([8]))
# (array([2.4, 4. ]), array([3, 3]), array([ 8, 14]))
Locate start and end of consecutive values below threshold
vec<-c(vec, "Dummy"=-1) #add a dummy that takes a value that doesnt exist in the threshold, because runs$length has a blank col name for the last column
reclass <- c(vec)
reclass[vec>thrs] <- 1
reclass[vec<=thrs & vec>=0] <- 0 #be careful not to assign these categories to the dummy
runs <- rle(reclass)
then purely by looking at the pattern....
> runs$lengths
20160315 20160410 20160515 20160605 20160725 20160815 20160905 20161115 Dummy
2 2 1 1 2 1 1 2 2 1
> runs$values
20160215 20160330 20160410 20160515 20160630 20160725 20160815 20161005 20161225 Dummy
0 1 0 1 0 1 0 1 0 -1
> (endingDates<-names(runs$values[runs$values==0 & runs$lengths >=2]))
[1] "20160215" "20160630" "20161225"
> (offset<-runs$lengths[which(names(runs$values) %in% endingDates)]-1)
20160315 20160725 Dummy
1 1 1
> (startingDates <- names(reclass)[which(names(reclass) %in% endingDates) - offset])
[1] "20160101" "20160605" "20161115"
How to find the indices where there are n consecutive zeroes in a row
Here are two base R approaches:
1) rle First run rle
and then compute ok
to pick out the sequences of zeros that are more than 3 long. We then compute the starts
and ends
of all repeated sequences subsetting to the ok
ones at the end.
with(rle(x), {
ok <- values == 0 & lengths > 3
ends <- cumsum(lengths)
starts <- ends - lengths + 1
data.frame(starts, ends)[ok, ]
})
giving:
starts ends
1 6 17
2 34 58
3 72 89
2) gregexpr Take the sign of each number -- that will be 0 or 1 and then concatenate those into a long string. Then use gregexpr
to find the location of at least 4 zeros. The result gives the starts and the ends can be computed from that plus the match.length
attribute minus 1.
s <- paste(sign(x), collapse = "")
g <- gregexpr("0{4,}", s)[[1]]
data.frame(starts = 0, ends = attr(g, "match.length") - 1) + g
giving:
starts ends
1 6 17
2 34 58
3 72 89
Find indices of sequential duplicates in string in R
We can use split
by the run-length-id of 'string' into a list
, get the range
of values, and rbind
the list
elements
rl <- rle(string)
lst <- lapply(split(seq_along(string), rep(seq_along(rl$values), rl$lengths)), range)
names(lst) <- r1$values
do.call(rbind, lst)
# [,1] [,2]
#A 1 3
#C 4 4
#G 5 6
#C 7 8
#T 9 12
Or in a compact way
library(data.table)
data.table(letter = string)[, .(letter = letter[1], start = .I[1],
end = .I[.N]), rleid(letter)]
Or with tidyverse
library(tidyverse)
library(data.table)
string %>%
tibble(letter = .) %>%
mutate(rn = row_number()) %>%
group_by(grp = rleid(letter)) %>%
summarise(letter = first(letter),
start = first(rn),
end = last(rn)) %>%
ungroup %>%
select(-grp)
Finding Runs of Consecutive Ones
We could get the index of consecutive tuple of ones, using stri_locate_all
. We paste
the vector
'v1' to a single string, and use the regex lookaround ((?=11)
) to match the pattern. The stri_locate_all
gives the 'start' and 'end' index of all those tuples of 11
. Here I extracted only the start
column ([,1]
)
library(stringi)
stri_locate_all(paste(v1, collapse=""), regex="(?=11)")[[1]][,1]
#[1] 4 5 8
Regarding the OP's function, it has two input variables, 'x' and 'k' where 'x' represents the vector ('v1'), 'k' the tuple length which I guess would be 2. We assign 'n' as the length
of the vector, create a NULL
vector 'runs' for allocating the output index. Then, we loop through the sequence of the vector until the 6th element (n-k+1
), and for each 'i', we again take the sequence beginning from 'i' to i+k-1
i.e. if the 'i' is 1, the index will be '2', and the sequence is 1:2
, get the elements in the vector corresponds to that v1[1:2]
, check whether it is equal to 1, if all
the elements are 1, then we concatenate the 'runs' with the corresponding index ('i')
data
v1 <- c(1,0,0,1,1,1,0,1,1)
Related Topics
R Shiny Rest API Communication
What Is the Meaning of the Dollar Sign "$" in R Function()
In R, Use Gsub to Remove All Punctuation Except Period
R Extract Rows Where Column Greater Than 40
Ggplot2 Plot Without Axes, Legends, etc
Change Both Legend Titles in a Ggplot with Two Legends
Colorbar from Custom Colorramppalette
Mutate Multiple Columns in a Dataframe
Use Grepl to Search Either of Multiple Substrings in a Text
Stacked Bar Chart in R (Ggplot2) with Y Axis and Bars as Percentage of Counts
There Is Pmin and Pmax Each Taking Na.Rm, Why No Psum
Shiny App: Downloadhandler Does Not Produce a File
Dplyr/R Cumulative Sum with Reset
Why Is Message() a Better Choice Than Print() in R for Writing a Package
Code Chunk Font Size in Rmarkdown with Knitr and Latex
Rstudio Not Picking the Encoding I'm Telling It to Use When Reading a File