Find Start and End Positions/Indices of Runs/Consecutive Values

Find start and end positions/indices of runs/consecutive values

Core logic:

# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)

# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)

# Display results
data.frame(start, end)
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15

Tidyverse/dplyr way (data frame-centric):

library(dplyr)

rle(x) %>%
unclass() %>%
as.data.frame() %>%
mutate(end = cumsum(lengths),
start = c(1, dplyr::lag(end)[-1] + 1)) %>%
magrittr::extract(c(1,2,4,3)) # To re-order start before end for display

Because the start and end vectors are the same length as the values component of the rle object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter or subset the start and end vectors using the condition on the run values.

Get start and end index of runs of values

A solution from base R.

a <- c(1,1,0,0,1,2,0,0)

# Get run length encoding
b <- rle(a)

# Create a data frame
dt <- data.frame(number = b$values, lengths = b$lengths)
# Get the end
dt$end <- cumsum(dt$lengths)
# Get the start
dt$start <- dt$end - dt$lengths + 1

# Select columns
dt <- dt[, c("number", "start", "end")]
# Sort rows
dt <- dt[order(dt$number), ]

dt
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6

Update

Here is a solution using with to make the code more concise.

with(rle(a), data.frame(number = values,
start = cumsum(lengths) - lengths + 1,
end = cumsum(lengths))[order(values),])
# number start end
#2 0 3 4
#5 0 7 8
#1 1 1 2
#3 1 5 5
#4 2 6 6

find start end index of bouts of consecutive equal values

Use the shifting cumsum trick to mark consecutive groups, then use groupby to get indices and filter by your conditions.

v = (df['A'] != df['A'].shift()).cumsum()
u = df.groupby(v)['A'].agg(['all', 'count'])
m = u['all'] & u['count'].ge(3)

df.groupby(v).apply(lambda x: (x.index[0], x.index[-1]))[m]

A
3 (3, 5)
7 (9, 11)
dtype: object

R search function to return start and end location

The link provided by @markus solves your problem, you need to modify it according to your requirement.

get_inds <- function(test, a, b) {
test <- subset(test, C1 == a)
inds <- rle(test$C1 == a & test$C2 == b)
end = cumsum(inds$lengths)
start = c(1, head(end, -1) + 1)
df = data.frame(start, end)[inds$values, ]
row.names(df) <- NULL
df
}

get_inds(test, 'aa', 'J')

# start end
#1 1 3
#2 5 8
#3 10 11

You need to change the condition for rle and remove the rows where the condition is not satisfied.

Find runs and lengths of consecutive values in an array

Find consecutive runs and length of runs with condition

import numpy as np

arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4])

res = np.ones_like(arr)
np.bitwise_xor(arr[:-1], arr[1:], out=res[1:]) # set equal, consecutive elements to 0
# use this for np.floats instead
# arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2.4, 2.4, 2.4, 2, 1, 3, 4, 4, 4, 5])
# res = np.hstack([True, ~np.isclose(arr[:-1], arr[1:])])
idxs = np.flatnonzero(res) # get indices of non zero elements
values = arr[idxs]
counts = np.diff(idxs, append=len(arr)) # difference between consecutive indices are the length

cond = counts > 2
values[cond], counts[cond], idxs[cond]

Output

(array([2]), array([4]), array([8]))
# (array([2.4, 4. ]), array([3, 3]), array([ 8, 14]))

Locate start and end of consecutive values below threshold


vec<-c(vec, "Dummy"=-1) #add a dummy that takes a value that doesnt exist in the threshold, because runs$length has a blank col name for the last column

reclass <- c(vec)
reclass[vec>thrs] <- 1
reclass[vec<=thrs & vec>=0] <- 0 #be careful not to assign these categories to the dummy
runs <- rle(reclass)

then purely by looking at the pattern....

> runs$lengths
20160315 20160410 20160515 20160605 20160725 20160815 20160905 20161115 Dummy
2 2 1 1 2 1 1 2 2 1
> runs$values
20160215 20160330 20160410 20160515 20160630 20160725 20160815 20161005 20161225 Dummy
0 1 0 1 0 1 0 1 0 -1
> (endingDates<-names(runs$values[runs$values==0 & runs$lengths >=2]))
[1] "20160215" "20160630" "20161225"
> (offset<-runs$lengths[which(names(runs$values) %in% endingDates)]-1)
20160315 20160725 Dummy
1 1 1
> (startingDates <- names(reclass)[which(names(reclass) %in% endingDates) - offset])
[1] "20160101" "20160605" "20161115"

How to find the indices where there are n consecutive zeroes in a row

Here are two base R approaches:

1) rle First run rle and then compute ok to pick out the sequences of zeros that are more than 3 long. We then compute the starts and ends of all repeated sequences subsetting to the ok ones at the end.

with(rle(x), {
ok <- values == 0 & lengths > 3
ends <- cumsum(lengths)
starts <- ends - lengths + 1
data.frame(starts, ends)[ok, ]
})

giving:

  starts ends
1 6 17
2 34 58
3 72 89

2) gregexpr Take the sign of each number -- that will be 0 or 1 and then concatenate those into a long string. Then use gregexpr to find the location of at least 4 zeros. The result gives the starts and the ends can be computed from that plus the match.length attribute minus 1.

s <- paste(sign(x), collapse = "")
g <- gregexpr("0{4,}", s)[[1]]
data.frame(starts = 0, ends = attr(g, "match.length") - 1) + g

giving:

  starts ends
1 6 17
2 34 58
3 72 89

Find indices of sequential duplicates in string in R

We can use split by the run-length-id of 'string' into a list, get the range of values, and rbind the list elements

rl <- rle(string)
lst <- lapply(split(seq_along(string), rep(seq_along(rl$values), rl$lengths)), range)
names(lst) <- r1$values
do.call(rbind, lst)
# [,1] [,2]
#A 1 3
#C 4 4
#G 5 6
#C 7 8
#T 9 12

Or in a compact way

library(data.table)
data.table(letter = string)[, .(letter = letter[1], start = .I[1],
end = .I[.N]), rleid(letter)]

Or with tidyverse

library(tidyverse)
library(data.table)
string %>%
tibble(letter = .) %>%
mutate(rn = row_number()) %>%
group_by(grp = rleid(letter)) %>%
summarise(letter = first(letter),
start = first(rn),
end = last(rn)) %>%
ungroup %>%
select(-grp)

Finding Runs of Consecutive Ones

We could get the index of consecutive tuple of ones, using stri_locate_all. We paste the vector 'v1' to a single string, and use the regex lookaround ((?=11)) to match the pattern. The stri_locate_all gives the 'start' and 'end' index of all those tuples of 11. Here I extracted only the start column ([,1])

library(stringi)
stri_locate_all(paste(v1, collapse=""), regex="(?=11)")[[1]][,1]
#[1] 4 5 8

Regarding the OP's function, it has two input variables, 'x' and 'k' where 'x' represents the vector ('v1'), 'k' the tuple length which I guess would be 2. We assign 'n' as the length of the vector, create a NULL vector 'runs' for allocating the output index. Then, we loop through the sequence of the vector until the 6th element (n-k+1), and for each 'i', we again take the sequence beginning from 'i' to i+k-1 i.e. if the 'i' is 1, the index will be '2', and the sequence is 1:2, get the elements in the vector corresponds to that v1[1:2], check whether it is equal to 1, if all the elements are 1, then we concatenate the 'runs' with the corresponding index ('i')

data

 v1 <- c(1,0,0,1,1,1,0,1,1)


Related Topics



Leave a reply



Submit