Sequence Length Encoding Using R

Sequence length encoding using R

EDIT : added control to do the character vectors as well.

Based on rle, I come to following solution :

somefunction <- function(x){

    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    y <- x[-1L] != x[-n] + 1L
    i <- c(which(y|is.na(y)),n)

    list(
      lengths = diff(c(0L,i)),
      values = x[head(c(0L,i)+1L,-1L)]
    )

}

> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)

    > somefunction(s)
    $lengths
    [1] 1 1 5 1 1 5 1 1 1

    $values
    [1] 2 4 1 5 5 5 4 4 4

This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.

> S <- c(4,2,1:5,5, "other" , "other",4:6,2)

> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1

$values
[1]  4  2  1  5 NA NA  4  2

Warning message:
In somefunction(S) : NAs introduced by coercion

Run length encoding of sequences

juxt can be used in the pack function:

(defn pack [xs]
  (map (juxt count first) (partition-by identity xs)))

(defn unpack [xs]
  (mapcat #(apply repeat %) xs))

Random sampling using running length encoding (rle)

Here's a function to do it. You probably need some big numbers to make this worthwhile over just expanding out the rle explicitly.

x <- rle(c(1,1,1,1,1,2,2))

sample_rle <- function(x, ...) {
  x$values[1+findInterval(
    sample(sum(x$lengths), ...),
    cumsum(x$lengths), 
    left.open=TRUE)]
}

sample_rle(x, size = 2, replace = FALSE)
#> [1] 2 1
sample_rle(x, size = 7, replace = FALSE)
#> [1] 2 1 2 1 1 1 1

Run-length encoding and group by

We can use rleid in data.table

dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]

If there are multiple 'date' for each grouping variables (second data), then the previous one will select only the first 'date' ([1]). Suppose, we wanted to get both the 'date' then either use

dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp       date n
#1:       1   1 2017-11-22 3
#2:       1   2 2017-11-22 3
#3:       1   2 2017-11-21 3

But, this also have multiple rows for each group. If we need only a single row per group, either create a list column (preserves the class)

dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp                  date n
#1:       1   1            2017-11-22 3
#2:       1   2 2017-11-22,2017-11-21 3

Or paste the unique elements together

Update

Based on the update in the OP's post for expected output (from second dataset), we need to use the 'date' also as grouping variable

dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
#   bike_id       date n
#1:       1 2017-11-21 1
#2:       1 2017-11-22 3
#3:       1 2017-11-22 2

count average number of increasing consecutive integers R with rle

There is probably a nicer way, but...

aggregate(data$moment,list(data$id), function(x) mean(rle(diffinv(diff(x)!=1))$lengths))
#   Group.1        x
# 1       1 1.428571
# 2       2 2.500000

Explanation

We first take the difference. We then look for those number that are not consecutive (diff(x)!=1). We then take the inverse of the difference (diffinv) to go back to the original length. We now have a vector that increments when at non-consecutive numbers. Take rle of that, then the lenghts and finally apply mean, and you're done.

Edit1: Removed a step that was unnecessary.

Finding length of flats of a sequence in R

We can use sequence which is a wrapper for unlist(lapply(yourvector, seq_len)). It loops (lapply) through the values of the vector, get the sequence (seq_len) and unlist it.

 sequence(runs$lengths)-1
 #[1] 0 0 0 1 2 0 0 1 0

We are subtracting 1 from the output to get the desired output.

Another option is using rleid from the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here

 library(data.table)#v1.9.5+
 setDT(list(v1))[, seq_along(V1)-1,rleid(V1)]$V1
 #[1] 0 0 0 1 2 0 0 1 0

We convert the 'v1' to 'data.table', grouped by rleid(V1), get the sequence of 'V1' and subtract from 1.

data

 v1 <- c(1,2,3,3,3,4,5,5,1)
 runs <- rle(v1)

Calculate lengths of sequences of repeating numbers in a vector in R

Using dplyr and data.table's rleid function.

library(dplyr)

tibble(marker) %>%
  #Drop rows before first 1
  filter(row_number() >= match(1, marker)) %>%
  #Count samples in each group
  add_count(grp = data.table::rleid(marker), name = 'n_samples') %>%
  #Create trial number
  mutate(trial_number = with(rle(!marker %in% c(1, 0)), 
                            rep(cumsum(values) * values, lengths))) %>%
  select(-grp)

This returns -

#   marker n_samples trial_number
#1       1         2            0
#2       1         2            0
#3       2         4            1
#4       2         4            1
#5       2         4            1
#6       2         4            1
#7       0         2            0
#8       0         2            0
#9       1         3            0
#10      1         3            0
#11      1         3            0
#12      3         3            2
#13      3         3            2
#14      3         3            2
#15      1         2            0
#16      1         2            0
#17      2         3            3
#18      2         3            3
#19      2         3            3
#20      0         2            0
#21      0         2            0
#22      1         3            0
#23      1         3            0
#24      1         3            0
#25      5         4            4
#26      5         4            4
#27      5         4            4
#28      5         4            4

Sequence Length Encoding Using R