Efficiently Locf by Groups in a Single R Data.Table

efficiently locf by groups in a single R data.table

A very simple na.locf can be built by forwarding (cummax) the non-NA indices ((!is.na(x)) * seq_along(x)) and subsetting accordingly:

x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2

This replicates na.locf with an na.rm = TRUE argument, to get na.rm = FALSE behavior we simply need to make sure the first element in the cummax is TRUE:

x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA 1 1 2

In this case, we need to take into account not only the non-NA indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE

Combining the above:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)

x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1] 1 1 NA 6 4 5 4 4 NA 2

Note, that here we OR the first element with TRUE, i.e. make it equal to TRUE, thus getting the na.rm = FALSE behavior.

And for this example:

id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
# id aa bb cc
# 1: 1 A NA 1
# 2: 1 A NA 1
# 3: 1 B NA 1
# 4: 1 C NA 1
# 5: 2 NA NA NA
# 6: 2 NA NA 4
# 7: 2 D NA 4
# 8: 2 E NA 5
# 9: 3 F NA 6
#10: 3 F NA 6
#11: 3 F NA 7
#12: 3 F NA 7

Last observation carried forward by group over multiple columns

cols = grep("^K|^L|^M", names(diagnosis), value = T)

diagnosis[, (cols) := na.locf(.SD, na.rm = F), by = patient, .SDcols = cols]

Also take a look at efficiently locf by groups in a single R data.table.

Efficiently fill out (locf/nocb) values of data.table column, then aggregate by another column

In my limited testing this is faster than either of your options (btw use CJ instead of data.table(expand.grid), and doesn't use much memory:

dat[dat, on = .(day >= day), mean(val[!duplicated(custid)]), by = .EACHI]

This assumes data is sorted by day as in OP.

Efficiently fill NAs by group

This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE:
It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE
Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
data01 <- data %>%
group_by(group) %>% #by group
fill(value) %>% #default direction down
fill(value, .direction = "up") #also fill NAs upwards
return(data01)
}

#Process02
P02 <- function (data){
data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE),
fromLast = TRUE), group]
return(data02)
}

#Process03
P03 <- function (data){
dataU <- subset(unique(data), value!='NA') #keep row number
dataM <- merge(data, dataU, by = "group", all=T) #merge tables
data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
"P02_zoo" = {data02 <- P02(data)},
"P03_data.table" = {data03 <- P03(data)},
replications = 10,
columns = c("test", "replications", "elapsed")
)

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1 P01_dplyr 10 257.78
2 P02_zoo 10 10.35
3 P03_data.table 10 0.09

How to efficiently sample from a datatable by column in R?

You can use sample on .N for each group and select 1 random row.

library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]

# A B C
#1: A 31 143
#2: D 16 175
#3: B 100 165
#4: E 27 190
#5: C 90 197

dplyr has slice_sample (previously sample_n) function for it :

library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)

data.table way of complete+fill from tidyr with groups of difference length

Here is something raw:

DT <- setDT(copy(df))
DT[DT[, .(observation_id = ind1[1]:ind2[1]), by = person], on = .(person, observation_id)
][, value := nafill(value, "locf"), by = person][]

# person observation_id value ind1 ind2
# 1: 1 2 NA NA NA
# 2: 1 3 1 2 5
# 3: 1 4 1 NA NA
# 4: 1 5 1 NA NA
# 5: 2 4 NA NA NA
# 6: 2 5 1 4 7
# 7: 2 6 1 NA NA
# 8: 2 7 1 NA NA

Note 1: you (still) need the development version of data.table to have nafill().

Note 2: the final [] is just for printing the results and can be skipped.

Expand last observed values within group in data table in R

This should be faster.

Using na.locf (forward filling NA) from zoo package, you can do:

dtable[, c('value_a','value_b') := lapply(.SD, na.locf, na.rm=F), .SDcols = c('value_a','value_b'), .(id)]

print(dtable)

id time value_a value_b
1: 1 1 NA No
2: 1 2 Yes Yes
3: 1 3 Yes Yes
4: 2 2 No NA
5: 2 3 No NA
6: 2 4 Yes NA

data.table fill missing values from other rows by group

With data.table and zoo:

library(data.table)
library(zoo)

# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]

# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]

Or in a single chain:

dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
, colB := na.locf(colB), by = colA][]

Both return:

    colA colB
1: 1 4
2: 1 1
3: 1 1
4: 1 1
5: 2 4
6: 2 3
7: 2 3
8: 2 3
9: 3 4
10: 3 2
11: 3 2
12: 3 2

Data:

text <- "colA colB
1 4
1 NA
1 NA
1 1
2 4
2 3
2 NA
2 NA
3 4
3 NA
3 2
3 NA"

dt <- fread(input = text, stringsAsFactors = FALSE)


Related Topics



Leave a reply



Submit