Sequence length encoding using R
EDIT : added control to do the character vectors as well.
Based on rle, I come to following solution :
somefunction <- function(x){
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + 1L
i <- c(which(y|is.na(y)),n)
list(
lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)]
)
}
> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)
> somefunction(s)
$lengths
[1] 1 1 5 1 1 5 1 1 1
$values
[1] 2 4 1 5 5 5 4 4 4
This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.
> S <- c(4,2,1:5,5, "other" , "other",4:6,2)
> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1
$values
[1] 4 2 1 5 NA NA 4 2
Warning message:
In somefunction(S) : NAs introduced by coercion
Run length encoding of sequences
juxt
can be used in the pack
function:
(defn pack [xs]
(map (juxt count first) (partition-by identity xs)))
(defn unpack [xs]
(mapcat #(apply repeat %) xs))
Random sampling using running length encoding (rle)
Here's a function to do it. You probably need some big numbers to make this worthwhile over just expanding out the rle explicitly.
x <- rle(c(1,1,1,1,1,2,2))
sample_rle <- function(x, ...) {
x$values[1+findInterval(
sample(sum(x$lengths), ...),
cumsum(x$lengths),
left.open=TRUE)]
}
sample_rle(x, size = 2, replace = FALSE)
#> [1] 2 1
sample_rle(x, size = 7, replace = FALSE)
#> [1] 2 1 2 1 1 1 1
Run-length encoding and group by
We can use rleid
in data.table
dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]
If there are multiple 'date' for each grouping variables (second data), then the previous one will select only the first 'date' ([1]
). Suppose, we wanted to get both the 'date' then either use
dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22 3
#3: 1 2 2017-11-21 3
But, this also have multiple rows for each group. If we need only a single row per group, either create a list
column (preserves the class
)
dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22,2017-11-21 3
Or paste
the unique
elements together
Update
Based on the update in the OP's post for expected output (from second dataset), we need to use the 'date' also as grouping variable
dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
# bike_id date n
#1: 1 2017-11-21 1
#2: 1 2017-11-22 3
#3: 1 2017-11-22 2
count average number of increasing consecutive integers R with rle
There is probably a nicer way, but...
aggregate(data$moment,list(data$id), function(x) mean(rle(diffinv(diff(x)!=1))$lengths))
# Group.1 x
# 1 1 1.428571
# 2 2 2.500000
Explanation
We first take the difference. We then look for those number that are not consecutive (diff(x)!=1
). We then take the inverse of the difference (diffinv
) to go back to the original length. We now have a vector that increments when at non-consecutive numbers. Take rle
of that, then the lenghts and finally apply mean
, and you're done.
Edit1: Removed a step that was unnecessary.
Finding length of flats of a sequence in R
We can use sequence
which is a wrapper for unlist(lapply(yourvector, seq_len))
. It loops (lapply
) through the values of the vector
, get the sequence (seq_len
) and unlist
it.
sequence(runs$lengths)-1
#[1] 0 0 0 1 2 0 0 1 0
We are subtracting 1 from the output to get the desired output.
Another option is using rleid
from the devel version of data.table
i.e. v1.9.5. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(list(v1))[, seq_along(V1)-1,rleid(V1)]$V1
#[1] 0 0 0 1 2 0 0 1 0
We convert the 'v1' to 'data.table', grouped by rleid(V1)
, get the sequence of 'V1' and subtract from 1.
data
v1 <- c(1,2,3,3,3,4,5,5,1)
runs <- rle(v1)
Calculate lengths of sequences of repeating numbers in a vector in R
Using dplyr
and data.table
's rleid
function.
library(dplyr)
tibble(marker) %>%
#Drop rows before first 1
filter(row_number() >= match(1, marker)) %>%
#Count samples in each group
add_count(grp = data.table::rleid(marker), name = 'n_samples') %>%
#Create trial number
mutate(trial_number = with(rle(!marker %in% c(1, 0)),
rep(cumsum(values) * values, lengths))) %>%
select(-grp)
This returns -
# marker n_samples trial_number
#1 1 2 0
#2 1 2 0
#3 2 4 1
#4 2 4 1
#5 2 4 1
#6 2 4 1
#7 0 2 0
#8 0 2 0
#9 1 3 0
#10 1 3 0
#11 1 3 0
#12 3 3 2
#13 3 3 2
#14 3 3 2
#15 1 2 0
#16 1 2 0
#17 2 3 3
#18 2 3 3
#19 2 3 3
#20 0 2 0
#21 0 2 0
#22 1 3 0
#23 1 3 0
#24 1 3 0
#25 5 4 4
#26 5 4 4
#27 5 4 4
#28 5 4 4
Related Topics
Expression and New Line in Plot Labels
Ggplot: How to Set Default Color for All Geoms
Assign Names to Data Frame with As.Data.Frame Function
Identifying the Outliers in a Data Set in R
Print to PDF File Using Grid.Table in R - Too Many Rows to Fit on One Page
Convert Roman Numerals to Numbers in R
Applying a Function to a Backreference Within Gsub in R
Join Datasets Using a Quosure as the by Argument
Shapes and Linetypes in Ggplot
R: Find Vector in List of Vectors
Remove Lines from Color and Fill Legends
Minus Operation of Data Frames
How to Turn Gpclibpermit() to True
Replacing Values in a Column with Another Column R
Ggpairs Plot with Heatmap of Correlation Values
Any Way to Force Fread() of Data.Table Not to Stop on Empty Lines