Is There a Dplyr Equivalent to Data.Table::Rleid

Is there a dplyr equivalent to data.table::rleid?

You can just do (when you have both data.table and dplyr loaded):

DT <- DT %>% mutate(rlid = rleid(grp))

this gives:

> DT
grp value rlid
1: A 1 1
2: A 2 1
3: B 3 2
4: B 4 2
5: C 5 3
6: C 6 3
7: C 7 3
8: A 8 4
9: B 9 5
10: B 10 5

When you don't want to load data.table separately you can also use (as mentioned by @DavidArenburg in the comments):

DT <- DT %>% mutate(rlid = data.table::rleid(grp))

And as @RichardScriven said in his comment you can just copy/steal it:

myrleid <- data.table::rleid

Is there a R function to sign a value when it is equal to its 5 previous values?

data.table::rleid is very useful for things like this:

## sample data
d = data.frame(x = c(1, 2,2, 3,3,3,3,3,3,3,3, 4,4,4 2,2,2,2,2,2,2))

library(dplyr)
d %>%
group_by(rleid = data.table::rleid(x)) %>%
mutate(bad_flag = row_number() > 5)
# # A tibble: 18 x 3
# # Groups: rleid [5]
# x rleid bad_flag
# <dbl> <int> <lgl>
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 2 2 FALSE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 3 3 FALSE
# 7 3 3 FALSE
# 8 3 3 FALSE
# 9 3 3 TRUE
# 10 3 3 TRUE
# 11 4 4 FALSE
# 12 4 4 FALSE
# 13 2 5 FALSE
# 14 2 5 FALSE
# 15 2 5 FALSE
# 16 2 5 FALSE
# 17 2 5 FALSE
# 18 2 5 TRUE

This does assume that your measurements are not floating point numbers (integers are fine). If they are floating points, you may need to round them to be safe.

Group observations chronologically and by group R / data.table

data.table::rleid(): Consecutive runs of identical values belong to the same group

dt[, tracker := rleid(track), by = student]

student year track tracker
1: 1 2001 Highschool 1
2: 1 2002 Highschool 1
3: 1 2003 Highschool 1
4: 1 2004 Vocational 2
5: 1 2005 Vocational 2
6: 1 2006 Uni 3
7: 1 2007 Vocational 4
8: 1 2008 Vocational 4
9: 2 2001 Vocational 1
10: 2 2002 Vocational 1
11: 2 2003 Highschool 2
12: 2 2004 Highschool 2
13: 2 2005 Highschool 2
14: 2 2006 Highschool 2
15: 2 2007 Vocational 3
16: 2 2008 Vocational 3

Without rleid() just for fun:

dt[, tracker := cumsum(shift(track, fill = track[1]) != track) + 1L, by = student]

What is the equivalent of mutate_at (dplyr) in data.table?

With data.table, we can specify the columns of interest in .SDcols, loop through the .SD with lapply and apply the function of interest. Here, the funcion rollapply is repeated with only change in width parameter. So, it may be better to create a function to avoid repeating the whole arguments. Also, while applying the function (f1), the output can be kept in a list, later unlist with recursive = FALSE and assign (:=) to columns of interest

library(data.table)
library(zoo)
nm1 <- c("B", "C")
nm2 <- paste0(nm1, "_Roll.Mean.Week")
nm3 <- paste0(nm1, "_Roll.Mean.Two.Week")
f1 <- function(x, width) rollapply(x, width = width, mean,
align = "right", fill = 0, na.rm = TRUE, partial = TRUE)
setDT(Data)[, c(nm2, nm3) := unlist(lapply(.SD, function(x)
list(f1(x, 7), f1(x, 14))), recursive = FALSE), by = A, .SDcols = nm1]
head(Data)
# A B C B_Roll.Mean.Week C_Roll.Mean.Week B_Roll.Mean.Two.Week C_Roll.Mean.Two.Week
#1: 1 1 101 1 1 101 101
#2: 2 2 102 2 2 102 102
#3: 1 3 103 2 2 102 102
#4: 2 4 104 3 3 103 103
#5: 1 5 105 3 3 103 103
#6: 2 6 106 4 4 104 104

Note that funs is deprecated in tidyverse and in its place, can use list(~ or just ~

Data %>% 
group_by(A) %>%
mutate_at(vars(B,C), list(Roll.Mean.Week = ~f1(., 7),
Roll.Mean.Two.Week = ~ f1(., 14)))%>%
ungroup()

data.table equivalent of dplyr::filter_at

We can specify the 'cols' in .SDcols, loop through the Subset of Data.table (.SD) to compare whether the value is "Passed", Reduce it to a single vector with | and subset the rows

res2 <- DT[DT[,  Reduce(`|`, lapply(.SD, `==`, "Passed")), .SDcols = cols]]

Comparing with the dplyr output in the OP's post

identical(as.data.table(res1), res2)
#[1] TRUE

Is there a dplyr or data.table equivalent to plyr::join_all? Joining by a list of data frames?

Combining @SimonOHanlon's data.table method with @Jaap's Reduce and merge techniques appears to yield the most performant results:

library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]

Reduce(function(...) merge(..., all = TRUE, by = c("x")),
list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))

Updating to include a tidyverse / purrr (purrr::reduce) approach:

library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>%
reduce(left_join)

Dplyr or data.table consolidate consecutive rows within grouped data based on value in another column

Here's an option, using data.table::rleid to make an id for runs of the same ID and CLASS:

# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'),
END = as.Date(END, '%d-%b-%y'))
# More concise alternative:
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)

# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>%
# collapse with summarise, replacing START and END with their min and max for each group
summarise(START = min(START), END = max(END)) %>%
# clean up arrangement and get rid of added rleid column
ungroup() %>% arrange(rleid) %>% select(-rleid)

# Source: local data frame [7 x 4]
#
# ID CLASS START END
# (int) (fctr) (date) (date)
# 1 100 GA 2015-01-03 2015-02-01
# 2 100 G 2015-02-01 2015-02-22
# 3 100 GA 2015-02-28 2015-03-17
# 4 100 G 2015-04-01 2015-04-18
# 5 200 FA 2014-01-03 2014-02-22
# 6 200 G 2014-02-28 2014-03-15
# 7 200 F 2014-04-01 2014-04-20

Here's the pure data.table analogue:

library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]

df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]

Column that tracks when the value in another column changes

using rle

with(rle(df$Score >= 19), rep(seq_along(values), lengths))
#[1] 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4

Extract the first and last observation of each group, every time that group appears within a large dataset?

Here is a data.table approach. As mentioned by @Henrik in the comments, you can use rleid to create a new column to group by, instead of using station as values for station get repeated. rleid. Then, for each group, it will include the first and last .N values. Note that unique is added to consider situations when only one row of data may be present for a given group. I hope this may be a fast solution for you.

library(data.table)

setDT(df)

df[ , id := rleid(station)][ , .SD[unique(c(1, .N))], by = id]

Output

   id tagID       date station temp depth
1: 1 8272 2020-07-12 4 10 6.14
2: 1 8272 2020-07-13 4 11 21
3: 2 8272 2020-07-13 5 12 23.5
4: 2 8272 2020-07-16 5 10 15.4
5: 3 8272 2020-07-17 6 12 54
6: 3 8272 2020-07-29 6 12 23
7: 4 8272 2020-07-30 4 12 33.3
8: 4 8272 2020-08-04 4 9 32.7


Related Topics



Leave a reply



Submit