dplyr::slice in data.table
We can use .I
to extract the row index and should be faster
out <- df[df[, .I[seq_len(10)], by = b]$V1]
dim(out)
#[1] 5000 2
Checking if there are NAs (as the OP commented)
any(out[, Reduce(`|`, lapply(.SD, is.na))])
#[1] FALSE
dim(df)
#[1] 374337 2
Benchmarks
f3 <- function(df) {
df[df[, .I[seq_len(10)], by = b]$V1]
}
microbenchmark(f1(df), f2(df), f3(df), unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# f1(df) 5.727822 5.480741 4.945486 5.672206 4.317531 5.10003 10 b
# f2(df) 24.572633 23.774534 17.842622 23.070634 16.099822 11.58287 10 c
# f3(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
Slice a list for a new column in a data.table in R
One possible solution with transpose
:
dt[, transpose(stringr::str_split(a,"\t"))]
V1 V2 V3
<char> <char> <char>
1: feature1 item1 item2
2: feature2 item3 item4
dplyr on data.table, am I really using data.table?
There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.
Operations involving i
(== filter()
and slice()
in dplyr)
Assume DT
with say 10 columns. Consider these data.table expressions:
DT[a > 1, .N] ## --- (1)
DT[a > 1, mean(b), by=.(c, d)] ## --- (2)
(1) gives the number of rows in DT
where column a > 1
. (2) returns mean(b)
grouped by c,d
for the same expression in i
as (1).
Commonly used dplyr
expressions would be:
DT %>% filter(a > 1) %>% summarise(n()) ## --- (3)
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)
Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter()
returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d
for the successive operations. To overcome this, we have to select()
columns apriori:
DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)
It is essential to highlight a major philosophical difference between the two packages:
In
data.table
, we like to keep these related operations together, and that allows to look at thej-expression
(from the same function call) and realise there's no need for any columns in (1). The expression ini
gets computed, and.N
is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just columnb,c,d
are materialised in the subset, other columns are ignored.But in
dplyr
, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation afterfilter()
needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.
Note that in (5) and (6), we still subset column a
which we don't require. But I'm not sure how to avoid that. If filter()
function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).
Sub-assign by reference
dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.
For example, in data.table you can do:
DT[a %in% some_vals, a := NA]
which updates column a
by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.
But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.
Other functionalities
In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.
data.table's rolling joins feature is not supported in dplyr's syntax as well.
We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function
foverlaps()
at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).But ultimately, our goal is to integrate it into
[.data.table
so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex:
DT[x == 1]
andDT[x %in% some_vals]
will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.From the way
filter()
is implemented for data.tables, it doesn't take advantage of this feature.A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.
So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.
HTH
(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.
R: Slicing a grouped data frame conditional on a column
You can use filter()
to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
Slice based on multiple date ranges and multiple columns to formating a new dataframe with R
You may use use dplyr::case_when
library(dplyr)
df %>%
mutate(type = case_when(
date>='2021-10-12' & date<='2021-10-15' ~ 1,
date>='2021-10-16' & date<='2021-10-18' ~ 2,
date>='2021-10-21' & date<='2021-10-23' ~ 3,
TRUE ~ NA_real_
),
value = case_when(
type == 1 ~ value1,
type == 2 ~ value2,
type == 3 ~ value3,
TRUE ~ NA_real_
)) %>%
select(date, value, type) %>%
filter(!is.na(type))
date value type
1 2021-10-12 1.015000 1
2 2021-10-13 NA 1
3 2021-10-14 NA 1
4 2021-10-15 1.015000 1
5 2021-10-16 1.072135 2
6 2021-10-17 1.061520 2
7 2021-10-18 1.051010 2
8 2021-10-21 1.160541 3
9 2021-10-22 1.177949 3
10 2021-10-23 1.195618 3
Group data table and apply cut function to columns by reference
For the given sample dataset with 1 grouping column and 3 value columns to be transformed, the data.table equivalent of OP's dplyr code simply is
library(data.table)
mycut <- \(x) cut(x, unique(quantile(x, probs = seq(0, 1, 0.025))), include.lowest = TRUE)
cutme <- setDT(cutme)[, lapply(.SD, mycut), .SDcols = colstoCut, by = Date]
cutme
Date val1 val2 val3
<Date> <fctr> <fctr> <fctr>
1: 2022-01-01 (1.9,2] (305.4,306] (278.09,278.12]
2: 2022-01-01 [1,1.1] [291,291.9] (275.12,275.21]
3: 2022-01-01 [1,1.1] (305.4,306] (277.58,277.84]
4: 2022-01-01 [1,1.1] (299.1,300] (274.14,274.38]
5: 2022-01-01 (1.9,2] (305.4,306] [271.98,272.22]
6: 2022-01-02 [0,0.1] (294.6,295] (314.5,314.7]
7: 2022-01-02 [0,0.1] (298.9,299] (322.4,322.6]
8: 2022-01-02 (0.9,1] [291,291.4] (312.4,312.6]
9: 2022-01-02 [0,0.1] (301.7,302] (320.7,321.3]
10: 2022-01-02 [0,0.1] (297.7,298] [310.7,310.9]
11: 2022-01-03 [0,0.1] (300.9,301] [294.8,295.9]
12: 2022-01-03 (0.9,1] (299.7,300] (304.8,305.9]
13: 2022-01-03 (0.9,1] [291,291.6] (316.6,317.1]
14: 2022-01-03 (1.9,2] (300.9,301] (319.2,319.4]
15: 2022-01-03 (0.9,1] (296.4,297] (311.7,312.3]
16: 2022-01-04 (0.9,1] [290,290.3] [309.3,309.39]
17: 2022-01-04 (1.9,2] (293.9,294] (313.59,313.97]
18: 2022-01-04 (0.9,1] (297.6,298] (317.02,317.36]
19: 2022-01-04 [0,0.1] (292.7,293] (310.12,310.21]
20: 2022-01-04 [0,0.1] (293.9,294] (319.97,320.26]
21: 2022-01-05 (0.9,1] (309.5,310] (310.2,310.4]
22: 2022-01-05 (0.9,1] (304.4,305] (296.4,296.6]
23: 2022-01-05 [0,0.1] [293,293.6] [294.6,294.8]
24: 2022-01-05 [0,0.1] (320.8,322] (305,305.9]
25: 2022-01-05 (0.9,1] (298.4,299] (308.1,308.3]
Date val1 val2 val3
dplyr: How to slice row1 of group1, row2 of group2, row3 of group3, ...rowN of groupN
df %>%
group_by(group) %>%
slice(cur_group_id())
# # A tibble: 5 x 2
# # Groups: group [5]
# group value
# <dbl> <dbl>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
dplyr group_by() and slice() within group
Using slice
with group_by
library(dplyr)
x %>%
group_by(name) %>%
slice(if(!all(is.na(colour))) row_number() else 1) %>%
ungroup
-output
# A tibble: 4 x 2
# name colour
# <chr> <chr>
#1 alice <NA>
#2 bob green
#3 mary orange
#4 mary orange
Slicing data by dynamic variable names in dplyr
Here is one way using group_by_at
which takes string as input and filter_at
library(dplyr)
rawData %>%
filter(complete.cases(theValue)) %>%
group_by_at(theID) %>%
distinct(theValue) %>%
filter_at(vars(theID), any_vars(. == 10001))
# A tibble: 1 x 2
# Groups: userID [1]
# theValue userID
# <chr> <dbl>
#1 foo 10001
Or by converting to symbol (sym
) and evaluate (!!
)
rawData %>%
filter(complete.cases(theValue)) %>%
group_by(!! rlang::sym(theID)) %>%
distinct(theValue) %>%
filter(!! rlang::sym(theID) == 10001)
# A tibble: 1 x 2
# Groups: userID [1]
# theValue userID
# <chr> <dbl>
#1 foo 10001
The issue in the OP's code is trying to apply tidyverse
methods outside the tidyverse environment i.e. in base R
.
More efficient way of using group_by mutate slice
summarize
makes more sense to me than mutate
and slice
. This should save you some time.
library(dplyr)
result <- df %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups: Month [3]
# Month ID Qty Sales Leads Region
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 April 11 230 2100 22 East
# 2 June 11 260 2450 15 North
# 3 May 10 110 1000 8 East
# 4 May 12 110 900 9 North
Here is a data.table
solution.
library(data.table)
setDT(df)
cols <- c("Qty", "Sales", "Leads")
df[, c(lapply(.SD, sum, na.rm = TRUE),
Region = first(Region)), .SDcols = cols,
by = .(Month, ID)][]
# Month ID Qty Sales Leads Region
# 1: April 11 230 2100 22 East
# 2: May 12 110 900 9 North
# 3: May 10 110 1000 8 East
# 4: June 11 260 2450 15 North
Related Topics
Ggplot2 Equivalent of 'Factorization or Categorization' in Googlevis in R
Dynamically Formatting Individual Axis Labels in Ggplot2
Using Mutate Rowwise Over a Subset of Columns
R Plotly: Preserving Appearance of Two Legends When Converting Ggplot2 with Ggplotly
Under What Circumstances Does R Recycle
Segfault in R Using Reshape2 Package and Dcast
Ifelse Assignment in Data.Table
Ggplot2: Shape, Color and Linestyle into One Legend
Ggplot2 Positive and Negative Values Different Color Gradient
How to Find The Indices Where There Are N Consecutive Zeroes in a Row
How to Find Changing Points in a Dataset