efficiently locf by groups in a single R data.table
A very simple na.locf
can be built by forwarding (cummax
) the non-NA
indices ((!is.na(x)) * seq_along(x)
) and subsetting accordingly:
x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2
This replicates na.locf
with an na.rm = TRUE
argument, to get na.rm = FALSE
behavior we simply need to make sure the first element in the cummax
is TRUE
:
x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA 1 1 2
In this case, we need to take into account not only the non-NA
indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:
id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
Combining the above:
id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1] 1 1 NA 6 4 5 4 4 NA 2
Note, that here we OR
the first element with TRUE
, i.e. make it equal to TRUE
, thus getting the na.rm = FALSE
behavior.
And for this example:
id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
# id aa bb cc
# 1: 1 A NA 1
# 2: 1 A NA 1
# 3: 1 B NA 1
# 4: 1 C NA 1
# 5: 2 NA NA NA
# 6: 2 NA NA 4
# 7: 2 D NA 4
# 8: 2 E NA 5
# 9: 3 F NA 6
#10: 3 F NA 6
#11: 3 F NA 7
#12: 3 F NA 7
Last observation carried forward by group over multiple columns
cols = grep("^K|^L|^M", names(diagnosis), value = T)
diagnosis[, (cols) := na.locf(.SD, na.rm = F), by = patient, .SDcols = cols]
Also take a look at efficiently locf by groups in a single R data.table.
Efficiently fill out (locf/nocb) values of data.table column, then aggregate by another column
In my limited testing this is faster than either of your options (btw use CJ
instead of data.table(expand.grid
), and doesn't use much memory:
dat[dat, on = .(day >= day), mean(val[!duplicated(custid)]), by = .EACHI]
This assumes data is sorted by day as in OP.
Efficiently fill NAs by group
This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.
UPDATE:
It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.
Last UPDATE
Function comparison with rbenchmark:
library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)
#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value
#Process01
P01 <- function (data){
data01 <- data %>%
group_by(group) %>% #by group
fill(value) %>% #default direction down
fill(value, .direction = "up") #also fill NAs upwards
return(data01)
}
#Process02
P02 <- function (data){
data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE),
fromLast = TRUE), group]
return(data02)
}
#Process03
P03 <- function (data){
dataU <- subset(unique(data), value!='NA') #keep row number
dataM <- merge(data, dataU, by = "group", all=T) #merge tables
data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
return(data03)
}
benchmark("P01_dplyr" = {data01 <- P01(data)},
"P02_zoo" = {data02 <- P02(data)},
"P03_data.table" = {data03 <- P03(data)},
replications = 10,
columns = c("test", "replications", "elapsed")
)
Results with data=10.000, 10 reps and I5 7400:
test replications elapsed
1 P01_dplyr 10 257.78
2 P02_zoo 10 10.35
3 P03_data.table 10 0.09
How to efficiently sample from a datatable by column in R?
You can use sample
on .N
for each group and select 1 random row.
library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]
# A B C
#1: A 31 143
#2: D 16 175
#3: B 100 165
#4: E 27 190
#5: C 90 197
dplyr
has slice_sample
(previously sample_n
) function for it :
library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)
data.table way of complete+fill from tidyr with groups of difference length
Here is something raw:
DT <- setDT(copy(df))
DT[DT[, .(observation_id = ind1[1]:ind2[1]), by = person], on = .(person, observation_id)
][, value := nafill(value, "locf"), by = person][]
# person observation_id value ind1 ind2
# 1: 1 2 NA NA NA
# 2: 1 3 1 2 5
# 3: 1 4 1 NA NA
# 4: 1 5 1 NA NA
# 5: 2 4 NA NA NA
# 6: 2 5 1 4 7
# 7: 2 6 1 NA NA
# 8: 2 7 1 NA NA
Note 1: you (still) need the development version of data.table
to have nafill()
.
Note 2: the final []
is just for printing the results and can be skipped.
Expand last observed values within group in data table in R
This should be faster.
Using na.locf
(forward filling NA) from zoo
package, you can do:
dtable[, c('value_a','value_b') := lapply(.SD, na.locf, na.rm=F), .SDcols = c('value_a','value_b'), .(id)]
print(dtable)
id time value_a value_b
1: 1 1 NA No
2: 1 2 Yes Yes
3: 1 3 Yes Yes
4: 2 2 No NA
5: 2 3 No NA
6: 2 4 Yes NA
data.table fill missing values from other rows by group
With data.table
and zoo
:
library(data.table)
library(zoo)
# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]
# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]
Or in a single chain:
dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
, colB := na.locf(colB), by = colA][]
Both return:
colA colB
1: 1 4
2: 1 1
3: 1 1
4: 1 1
5: 2 4
6: 2 3
7: 2 3
8: 2 3
9: 3 4
10: 3 2
11: 3 2
12: 3 2
Data:
text <- "colA colB
1 4
1 NA
1 NA
1 1
2 4
2 3
2 NA
2 NA
3 4
3 NA
3 2
3 NA"
dt <- fread(input = text, stringsAsFactors = FALSE)
Related Topics
R: Using Rgl to Generate 3D Rotatable Plots That Can Be Viewed in a Web Browser
Ggplot2: Geom_Text() with Facet_Grid()
How to Save Summary(Lm) to a File
Differencebetween Geoms and Stats in Ggplot2
Center-Align Legend Title and Legend Keys in Ggplot2 for Long Legend Titles
Model.Matrix() with Na.Action=Null
Creating a Facet_Wrap Plot with Ggplot2 with Different Annotations in Each Plot
How to Specify Lib Directory When Installing Development Version R Packages from Github Repository
Dynamically Converting a List of Excel Files to CSV Files in R
Plotting a Curve Around a Set of Points
R 3.4.1 "Single Candle" Personal Library Path Error: Unable to Create 'Na'
Creating Professional Looking Powerpoints in R
How to Refer to a Variable Name with Spaces
Centering Image and Text in R Markdown for a PDF Report
R: Why Does Read.Table Stop Reading a File
Coding Variable Values into Classes Using R
How to Show the Progress of Code in Parallel Computation in R