Data.Table Alternative for Dplyr Case_When

data.table alternative for dplyr case_when

FYI, a more recent answer for those coming across this post 2019. data.table versions above 1.13.0 have the fcase function that can be used. Note that it is not a drop-in replacement for dplyr::case_when as the syntax is different, but will be a "native" data.table way of calculation.

# Lazy evaluation
x = 1:10
data.table::fcase(
x < 5L, 1L,
x >= 5L, 3L,
x == 5L, stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3

dplyr::case_when(
x < 5L ~ 1L,
x >= 5L ~ 3L,
x == 5L ~ stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
# provided value is an unexpected one!

# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
data.table::fcase(
x < 10L, 0L,
x < 20L, 10L,
x < 30L, 20L,
x < 40L, 30L,
x < 50L, 40L,
x < 60L, 50L,
x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
# expr min lq mean median uq max neval
# dplyr::case_when 11.57 11.71 12.22 11.82 12.00 14.02 5
# data.table::fcase 1.49 1.55 1.67 1.71 1.73 1.86 5

Source, data.table NEWS for 1.13.0, released (24 Jul 2020).

data.table alternative for slow group_by() and case_when() function

Here's how you can improve your existing code using dplyr more efficiently:

lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE), 
Customer_Type = c("Customer with X in first order",
"Customer without x in first order"))

df %>%
group_by(id) %>%
mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>%
ungroup() %>%
left_join(lookup, by = "First_Order_contains_x")

# A tibble: 3,000 x 5
id date indicator First_Order_contains_x Customer_Type
<fct> <date> <dbl> <lgl> <fct>
1 5056 2018-03-10 1 TRUE Customer with X in first order
2 5291 2018-12-28 0 FALSE Customer without x in first order
3 5173 2018-04-19 0 FALSE Customer without x in first order
4 5159 2018-11-13 0 TRUE Customer with X in first order
5 5252 2018-05-30 0 TRUE Customer with X in first order
6 5200 2018-01-20 0 FALSE Customer without x in first order
7 4578 2018-12-18 1 FALSE Customer without x in first order
8 5308 2018-03-24 1 FALSE Customer without x in first order
9 5234 2018-05-29 1 TRUE Customer with X in first order
10 5760 2018-06-12 1 TRUE Customer with X in first order
# … with 2,990 more rows

data.table fcase versus dplyr case_when

You don't explicitly state a question, but I assume you'd like to know if there's a way to use a vectorized default value with fcase(). One way to do that would be to construct a vector of TRUEs of equal length to your other conditions as the last element, similarly to how case_when() works:

library(data.table)
library(dplyr)

set.seed(123)

tbl.test <- data.table(x = rnorm(1e6))

bench::mark(
fcase = tbl.test[, .(fcase(x < 1, x + 1, rep_len(TRUE, length(x)), x))],
case_when = tbl.test[, .(case_when(x < 1 ~ x + 1, TRUE ~ x))]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fcase 16.5ms 24.3ms 33.3 36.1MB 43.1
#> 2 case_when 146.4ms 147.9ms 6.60 129.9MB 28.1

R: any perfect alternative to case_when() when detecting strings with multiple conditions and replacing them?

You may use case_when with grepl and a regex alternation:

statement$col <- case_when(
grepl("(addiction|mental|Diabetes|health|healthy|Oranga|unwell|AOD| well| surgery|dental|recovery|oranga|Mirimiri|asthma|anger|checks|alcohol|pregnant|clinical|clinic)", statement$stmt) ~ "APC",
grepl("(whanau direct|whānau direct|money|transport|home|repairs|social|budget|job|housing|house|financial|finance|Ohanga|furniture|accommodation|welfare|living|work|babies arrival|AT hop card|Entitlements|ohunga|bills|electricity|water|employment)", statement$stmt) ~ "PDP",
grepl("(Kaupapa|Te reo|language|Tikanga|Iwi|relationship|Tikinga|Reunite)", statement$stmt) ~ "APGA",
grepl("(Studying|training|NCEA|ECE|Counseling|counsel|Knowledge|School|Education|matauranga|parenting|skills)", statement$stmt) ~ "APP",
grepl("(self-management|Rangitiratanga|custody|police|court|CYFS|advocacy|Oranga Tamariki|rangatiratanga|section 101|EPOA|Familly issues)", statement$stmt) ~ "rangatiratanga",
TRUE ~ NA_character_
)

Using case_when and between with a correspondance threshold table

cut in base R can do this relatively easily:

bigdataset <- data.frame(seq(1, 25000,1))
names(bigdataset) <- 'id'
thresholds <- data.frame(
c(50, 1500, 8900, 10000, 12000, 13000, 14000, 15000, 16000, 25000),
c('grp1','grp2','grp3','grp4', 'grp5','grp6', 'grp7','grp8','grp9','grp10'))
names(thresholds) <- c('last_id','group_name')

cut(bigdataset$id, breaks=breaks=c(min(bigdataset$id),thresholds$last_id + 1), labels=thresholds$group_name[1:10], right=FALSE) -> bigdataset$group_name

Output:

> bigdataset
id group_name
1 1 grp1
2 2 grp1
3 3 grp1
4 4 grp1
5 5 grp1
6 6 grp1
7 7 grp1
8 8 grp1
9 9 grp1
10 10 grp1
11 11 grp1
12 12 grp1
13 13 grp1
14 14 grp1
15 15 grp1
16 16 grp1
17 17 grp1
18 18 grp1
19 19 grp1
20 20 grp1
21 21 grp1
22 22 grp1
23 23 grp1
24 24 grp1
25 25 grp1
26 26 grp1
27 27 grp1
28 28 grp1
29 29 grp1
30 30 grp1
31 31 grp1
32 32 grp1
33 33 grp1
34 34 grp1
35 35 grp1
36 36 grp1
37 37 grp1
38 38 grp1
39 39 grp1
40 40 grp1
41 41 grp1
42 42 grp1
43 43 grp1
44 44 grp1
45 45 grp1
46 46 grp1
47 47 grp1
48 48 grp1
49 49 grp1
50 50 grp2
51 51 grp2
52 52 grp2
53 53 grp2
54 54 grp2
55 55 grp2
56 56 grp2
57 57 grp2
58 58 grp2
59 59 grp2
60 60 grp2

Note that you need to pad your thresholds with min(bigdataset$id); that way you have 11 cutpoints for 10 classes.



Related Topics



Leave a reply



Submit