Is there a dplyr equivalent to data.table::rleid?
You can just do (when you have both data.table and dplyr loaded):
DT <- DT %>% mutate(rlid = rleid(grp))
this gives:
> DT
grp value rlid
1: A 1 1
2: A 2 1
3: B 3 2
4: B 4 2
5: C 5 3
6: C 6 3
7: C 7 3
8: A 8 4
9: B 9 5
10: B 10 5
When you don't want to load data.table separately you can also use (as mentioned by @DavidArenburg in the comments):
DT <- DT %>% mutate(rlid = data.table::rleid(grp))
And as @RichardScriven said in his comment you can just copy/steal it:
myrleid <- data.table::rleid
Is there a R function to sign a value when it is equal to its 5 previous values?
data.table::rleid
is very useful for things like this:
## sample data
d = data.frame(x = c(1, 2,2, 3,3,3,3,3,3,3,3, 4,4,4 2,2,2,2,2,2,2))
library(dplyr)
d %>%
group_by(rleid = data.table::rleid(x)) %>%
mutate(bad_flag = row_number() > 5)
# # A tibble: 18 x 3
# # Groups: rleid [5]
# x rleid bad_flag
# <dbl> <int> <lgl>
# 1 1 1 FALSE
# 2 2 2 FALSE
# 3 2 2 FALSE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 3 3 FALSE
# 7 3 3 FALSE
# 8 3 3 FALSE
# 9 3 3 TRUE
# 10 3 3 TRUE
# 11 4 4 FALSE
# 12 4 4 FALSE
# 13 2 5 FALSE
# 14 2 5 FALSE
# 15 2 5 FALSE
# 16 2 5 FALSE
# 17 2 5 FALSE
# 18 2 5 TRUE
This does assume that your measurements are not floating point numbers (integers are fine). If they are floating points, you may need to round them to be safe.
Group observations chronologically and by group R / data.table
data.table::rleid()
: Consecutive runs of identical values belong to the same group
dt[, tracker := rleid(track), by = student]
student year track tracker
1: 1 2001 Highschool 1
2: 1 2002 Highschool 1
3: 1 2003 Highschool 1
4: 1 2004 Vocational 2
5: 1 2005 Vocational 2
6: 1 2006 Uni 3
7: 1 2007 Vocational 4
8: 1 2008 Vocational 4
9: 2 2001 Vocational 1
10: 2 2002 Vocational 1
11: 2 2003 Highschool 2
12: 2 2004 Highschool 2
13: 2 2005 Highschool 2
14: 2 2006 Highschool 2
15: 2 2007 Vocational 3
16: 2 2008 Vocational 3
Without rleid()
just for fun:
dt[, tracker := cumsum(shift(track, fill = track[1]) != track) + 1L, by = student]
What is the equivalent of mutate_at (dplyr) in data.table?
With data.table
, we can specify the columns of interest in .SDcols
, loop through the .SD
with lapply
and apply the function of interest. Here, the funcion rollapply
is repeated with only change in width
parameter. So, it may be better to create a function to avoid repeating the whole arguments. Also, while applying the function (f1
), the output can be kept in a list
, later unlist
with recursive = FALSE
and assign (:=
) to columns of interest
library(data.table)
library(zoo)
nm1 <- c("B", "C")
nm2 <- paste0(nm1, "_Roll.Mean.Week")
nm3 <- paste0(nm1, "_Roll.Mean.Two.Week")
f1 <- function(x, width) rollapply(x, width = width, mean,
align = "right", fill = 0, na.rm = TRUE, partial = TRUE)
setDT(Data)[, c(nm2, nm3) := unlist(lapply(.SD, function(x)
list(f1(x, 7), f1(x, 14))), recursive = FALSE), by = A, .SDcols = nm1]
head(Data)
# A B C B_Roll.Mean.Week C_Roll.Mean.Week B_Roll.Mean.Two.Week C_Roll.Mean.Two.Week
#1: 1 1 101 1 1 101 101
#2: 2 2 102 2 2 102 102
#3: 1 3 103 2 2 102 102
#4: 2 4 104 3 3 103 103
#5: 1 5 105 3 3 103 103
#6: 2 6 106 4 4 104 104
Note that funs
is deprecated in tidyverse
and in its place, can use list(~
or just ~
Data %>%
group_by(A) %>%
mutate_at(vars(B,C), list(Roll.Mean.Week = ~f1(., 7),
Roll.Mean.Two.Week = ~ f1(., 14)))%>%
ungroup()
data.table equivalent of dplyr::filter_at
We can specify the 'cols' in .SDcols
, loop through the Subset of Data.table (.SD
) to compare whether the value is "Passed", Reduce
it to a single vector
with |
and subset the rows
res2 <- DT[DT[, Reduce(`|`, lapply(.SD, `==`, "Passed")), .SDcols = cols]]
Comparing with the dplyr
output in the OP's post
identical(as.data.table(res1), res2)
#[1] TRUE
Is there a dplyr or data.table equivalent to plyr::join_all? Joining by a list of data frames?
Combining @SimonOHanlon's data.table
method with @Jaap's Reduce
and merge
techniques appears to yield the most performant results:
library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]
Reduce(function(...) merge(..., all = TRUE, by = c("x")),
list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))
Updating to include a tidyverse
/ purrr
(purrr::reduce
) approach:
library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>%
reduce(left_join)
Dplyr or data.table consolidate consecutive rows within grouped data based on value in another column
Here's an option, using data.table::rleid
to make an id for runs of the same ID
and CLASS
:
# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'),
END = as.Date(END, '%d-%b-%y'))
# More concise alternative:
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)
# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>%
# collapse with summarise, replacing START and END with their min and max for each group
summarise(START = min(START), END = max(END)) %>%
# clean up arrangement and get rid of added rleid column
ungroup() %>% arrange(rleid) %>% select(-rleid)
# Source: local data frame [7 x 4]
#
# ID CLASS START END
# (int) (fctr) (date) (date)
# 1 100 GA 2015-01-03 2015-02-01
# 2 100 G 2015-02-01 2015-02-22
# 3 100 GA 2015-02-28 2015-03-17
# 4 100 G 2015-04-01 2015-04-18
# 5 200 FA 2014-01-03 2014-02-22
# 6 200 G 2014-02-28 2014-03-15
# 7 200 F 2014-04-01 2014-04-20
Here's the pure data.table analogue:
library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]
df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]
Column that tracks when the value in another column changes
using rle
with(rle(df$Score >= 19), rep(seq_along(values), lengths))
#[1] 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4
Extract the first and last observation of each group, every time that group appears within a large dataset?
Here is a data.table
approach. As mentioned by @Henrik in the comments, you can use rleid
to create a new column to group by, instead of using station
as values for station
get repeated. rleid
. Then, for each group, it will include the first and last .N
values. Note that unique
is added to consider situations when only one row of data may be present for a given group. I hope this may be a fast solution for you.
library(data.table)
setDT(df)
df[ , id := rleid(station)][ , .SD[unique(c(1, .N))], by = id]
Output
id tagID date station temp depth
1: 1 8272 2020-07-12 4 10 6.14
2: 1 8272 2020-07-13 4 11 21
3: 2 8272 2020-07-13 5 12 23.5
4: 2 8272 2020-07-16 5 10 15.4
5: 3 8272 2020-07-17 6 12 54
6: 3 8272 2020-07-29 6 12 23
7: 4 8272 2020-07-30 4 12 33.3
8: 4 8272 2020-08-04 4 9 32.7
Related Topics
Change the Class from Factor to Numeric of Many Columns in a Data Frame
Convert Dataframe Column to 1 or 0 for "True"/"False" Values and Assign to Dataframe
Loop Through Data Frame and Variable Names
Using Ifelse Statement on the Whole Dataset Instead of a Single Column
Remove Unwanted Symbols from Expression Function - R
R - Getting Characters After Symbol
Calculate Row Means on Subset of Columns
Duplicate Columns in Spark Dataframe
Combine Two Lists in a Dataframe in R
Easier Way to Use Grepl and Ifelse Across Multiple Columns
Create Counter Within Consecutive Runs of Values
R: Error in Usemethod("Group_By_"):Applied to an Object of Class
Calculate the Area Under a Curve
How to Add a Diagonal Line to a Plot
Gsub a Every Element After a Keyword in R
Delete Rows That Exist in Another Data Frame