Sequence Length Encoding Using R

Sequence length encoding using R

EDIT : added control to do the character vectors as well.

Based on rle, I come to following solution :

somefunction <- function(x){

if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + 1L
i <- c(which(y|,n)

lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)]


> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)

> somefunction(s)
[1] 1 1 5 1 1 5 1 1 1

[1] 2 4 1 5 5 5 4 4 4

This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.

> S <- c(4,2,1:5,5, "other" , "other",4:6,2)

> somefunction(S)
[1] 1 1 5 1 1 1 3 1

[1] 4 2 1 5 NA NA 4 2

Warning message:
In somefunction(S) : NAs introduced by coercion

Run length encoding of sequences

juxt can be used in the pack function:

(defn pack [xs]
(map (juxt count first) (partition-by identity xs)))

(defn unpack [xs]
(mapcat #(apply repeat %) xs))

Random sampling using running length encoding (rle)

Here's a function to do it. You probably need some big numbers to make this worthwhile over just expanding out the rle explicitly.

x <- rle(c(1,1,1,1,1,2,2))

sample_rle <- function(x, ...) {
sample(sum(x$lengths), ...),

sample_rle(x, size = 2, replace = FALSE)
#> [1] 2 1
sample_rle(x, size = 7, replace = FALSE)
#> [1] 2 1 2 1 1 1 1

Run-length encoding and group by

We can use rleid in data.table

dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]

If there are multiple 'date' for each grouping variables (second data), then the previous one will select only the first 'date' ([1]). Suppose, we wanted to get both the 'date' then either use

dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22 3
#3: 1 2 2017-11-21 3

But, this also have multiple rows for each group. If we need only a single row per group, either create a list column (preserves the class)

dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22,2017-11-21 3

Or paste the unique elements together


Based on the update in the OP's post for expected output (from second dataset), we need to use the 'date' also as grouping variable

dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
# bike_id date n
#1: 1 2017-11-21 1
#2: 1 2017-11-22 3
#3: 1 2017-11-22 2

count average number of increasing consecutive integers R with rle

There is probably a nicer way, but...

aggregate(data$moment,list(data$id), function(x) mean(rle(diffinv(diff(x)!=1))$lengths))
# Group.1 x
# 1 1 1.428571
# 2 2 2.500000


We first take the difference. We then look for those number that are not consecutive (diff(x)!=1). We then take the inverse of the difference (diffinv) to go back to the original length. We now have a vector that increments when at non-consecutive numbers. Take rle of that, then the lenghts and finally apply mean, and you're done.

Edit1: Removed a step that was unnecessary.

Finding length of flats of a sequence in R

We can use sequence which is a wrapper for unlist(lapply(yourvector, seq_len)). It loops (lapply) through the values of the vector, get the sequence (seq_len) and unlist it.

#[1] 0 0 0 1 2 0 0 1 0

We are subtracting 1 from the output to get the desired output.

Another option is using rleid from the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here

setDT(list(v1))[, seq_along(V1)-1,rleid(V1)]$V1
#[1] 0 0 0 1 2 0 0 1 0

We convert the 'v1' to 'data.table', grouped by rleid(V1), get the sequence of 'V1' and subtract from 1.


 v1 <- c(1,2,3,3,3,4,5,5,1)
runs <- rle(v1)

Calculate lengths of sequences of repeating numbers in a vector in R

Using dplyr and data.table's rleid function.


tibble(marker) %>%
#Drop rows before first 1
filter(row_number() >= match(1, marker)) %>%
#Count samples in each group
add_count(grp = data.table::rleid(marker), name = 'n_samples') %>%
#Create trial number
mutate(trial_number = with(rle(!marker %in% c(1, 0)),
rep(cumsum(values) * values, lengths))) %>%

This returns -

#   marker n_samples trial_number
#1 1 2 0
#2 1 2 0
#3 2 4 1
#4 2 4 1
#5 2 4 1
#6 2 4 1
#7 0 2 0
#8 0 2 0
#9 1 3 0
#10 1 3 0
#11 1 3 0
#12 3 3 2
#13 3 3 2
#14 3 3 2
#15 1 2 0
#16 1 2 0
#17 2 3 3
#18 2 3 3
#19 2 3 3
#20 0 2 0
#21 0 2 0
#22 1 3 0
#23 1 3 0
#24 1 3 0
#25 5 4 4
#26 5 4 4
#27 5 4 4
#28 5 4 4

Related Topics

Leave a reply
