Fastest way to replace NAs in a large data.table
Here's a solution using data.table's :=
operator, building on Andrie and Ramnath's answers.
require(data.table) # v1.6.6
require(gdata) # v2.8.2
set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
[1] 200000 200 # more columns than Ramnath's answer which had 5 not 200
f_andrie = function(dt) remove_na(dt)
f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un)
f_dowle = function(dt) { # see EDIT later for more elegant solution
na.replace = function(v,value=0) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}
system.time(a_gdata = f_gdata(dt1))
user system elapsed
18.805 12.301 134.985
system.time(a_andrie = f_andrie(dt1))
Error: cannot allocate vector of size 305.2 Mb
Timing stopped at: 14.541 7.764 68.285
system.time(f_dowle(dt1))
user system elapsed
7.452 4.144 19.590 # EDIT has faster than this
identical(a_gdata, dt1)
[1] TRUE
Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the copy
function is needed to make a local copy of the whole dataset. data.table's setkey
, key<-
and :=
do not copy-on-write.
Next, let's see where f_dowle is spending its time.
Rprof()
f_dowle(dt1)
Rprof(NULL)
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"na.replace" 5.10 49.71 6.62 64.52
"[.data.table" 2.48 24.17 9.86 96.10
"is.na" 1.52 14.81 1.52 14.81
"gc" 0.22 2.14 0.22 2.14
"unique" 0.14 1.36 0.16 1.56
... snip ...
There, I would focus on na.replace
and is.na
, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updates NA
by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?
The reason f_andrie
fails may be because it copies the whole of dt1
, or creates a logical matrix as big as the whole of dt1
, a few times. The other 2 methods work on one column at a time (although I only briefly looked at NAToUnknown
).
EDIT (more elegant solution as requested by Ramnath in comments) :
f_dowle2 = function(DT) {
for (i in names(DT))
DT[is.na(get(i)), (i):=0]
}
system.time(f_dowle2(dt1))
user system elapsed
6.468 0.760 7.250 # faster, too
identical(a_gdata, dt1)
[1] TRUE
I wish I did it that way to start with!
EDIT2 (over 1 year later, now)
There is also set()
. This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling [,:=,]
in a loop. set
is a loopable :=
. See ?set
.
f_dowle3 = function(DT) {
# either of the following for loops
# by name :
for (j in names(DT))
set(DT,which(is.na(DT[[j]])),j,0)
# or by number (slightly faster than by name) :
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}
Replace NAs with previous values using data.table
You can use zoo::na.locf()
.
assets[,XPTO := zoo::na.locf(XPTO)]
To answer jblood94 question, this function fills all NAs with the latest Non-NA value.
Replace NAs in a Single Column of a Data Table in R
Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""
data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')
data <- data.table(data)
> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.
EDIT: This is, I just saw, what @dracodoc suggested in comment
Replacing all missing values in R data.table with a value
is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <-
here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
Replace NAs with 0 in all numeric columns using data.table in R
We can do this with set
by looping over the numeric columns (needed_names
) and set
the elements that are NA specified in i
to 0
for(j in needed_names){
set(dt, i = which(is.na(dt[[j]])), j=j, value = 0)
}
dt
# a b c
#1: 1 0 a
#2: 2 0 b
#3: 3 0 c
#4: 0 20 d
#5: 0 21 e
#6: 0 22 f
#7: 10 23 g
#8: 11 24 NA
#9: 12 25 NA
Regarding the OP's code, when there are more than one element, we use ifelse
or replace
and also the output should be assigned back to the columns of interest otherwise, we will only the columns specified in the .SDcols
and will not be updated in the original dataset
dt[, (needed_names) := lapply(.SD, function(x)
replace(x, is.na(x), 0)), .SDcols = needed_names]
R data.table fill row NAs, by row
One good method, using a for
loop. It's not row-by-row, it operates on "all rows with an NA
in column 'X'" at one time, for each column in cols
.
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
dt
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 3 3 2 3 1 1 1 4 3 1
# 2: 1 1 2 2 1 2 2 1 2 4 1
# 3: 3 2 3 1 1 4 3 3 2 1 2
# 4: 2 3 3 1 2 2 1 4 3 4 2
# 5: 1 2 3 4 4 3 2 2 2 4 3
# 6: 4 1 4 2 1 4 4 3 3 4 3
Admittedly the use of get(.)
is not perfect, but I think it'll generally be okay.
Another method, about as fast (depending on the size of data):
dt[, (cols) := Reduce(function(prev,this) fcoalesce(this, prev), .SD, accumulate = TRUE), .SDcols = cols]
# same results
Benchmarking, since you said that with 2M rows, performance is important.
I'll go with 2M rows and update the method for randomizing the NA
s.
library(data.table)
set.seed(456)
n <- 2e6 # 6e5
dt <- data.table(a = sample(1:4,n, replace=T), b = sample(1:4,n, replace=T), c = sample(1:4,n, replace=T), d = sample(1:4,n, replace=T), e = sample(1:4,n, replace=T), f = sample(1:4,n, replace=T), g = sample(1:4,n, replace=T), h = sample(1:4,n, replace=T), i = sample(1:4,n, replace=T), j = sample(1:4,n, replace=T), xx = sample(1:4,n, replace=T))
mtx <- cbind(sample(nrow(dt), ceiling(n*11/20), replace=TRUE), sample(ncol(dt), ceiling(n*11/20), replace=TRUE))
mtx <- mtx[!duplicated(mtx),]
dt[mtx] <- NA
head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 2 3 2 1 2 3 3 2 2
# 2: 1 3 4 1 4 4 3 2 4 3 3
# 3: 3 4 2 2 3 4 2 2 1 NA 1
# 4: 2 1 4 1 2 3 NA 4 4 4 3
# 5: 1 2 3 3 4 3 3 NA 1 4 1
# 6: 4 3 4 2 2 NA 4 1 2 4 2
Unfortunately, the transpose
method fails:
system.time({
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
})
# Error: cannot allocate vector of size 30.6 Gb
but the for
loop (and Reduce
, incidentally) works fine:
cols <- setdiff(names(dt),"N")
system.time({
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
})
# user system elapsed
# 0.14 0.00 0.11
head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 2 3 2 1 2 3 3 2 2
# 2: 1 3 4 1 4 4 3 2 4 3 3
# 3: 3 4 2 2 3 4 2 2 1 1 1
# 4: 2 1 4 1 2 3 3 4 4 4 3
# 5: 1 2 3 3 4 3 3 3 1 4 1
# 6: 4 3 4 2 2 2 4 1 2 4 2
If I simplify the problem-set to 600K rows, then I can get both to work. (I don't know the tipover point for my system ... it might be 1M, who knows, I just wanted to compare them side-by-side.) With n <- 6e5
and generating dt
, I see the following data and simple timing:
head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 3 1 3 4 NA 3 3 3 3
# 2: 1 3 2 2 4 3 1 2 2 4 1
# 3: 3 4 2 1 1 1 1 4 2 4 2
# 4: 2 4 1 NA 1 4 3 1 4 1 1
# 5: 1 NA 4 2 NA NA 4 4 2 2 NA
# 6: 4 1 4 4 1 2 3 3 1 1 2
sum(is.na(dt))
# [1] 321782
system.time({
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
})
# user system elapsed
# 4.27 4.50 7.74
sum(is.na(dt)) # 'dt' is unchanged, only important here to compare the 'for' loop
# [1] 321782
sum(is.na(dt2)) # rows with leading columns having 'NA', nothing to coalesce, not surprising
# [1] 30738
cols <- setdiff(names(dt),"N")
system.time({
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
})
# user system elapsed
# 0.10 0.03 0.06
identical(dt, dt2)
# [1] TRUE
### regenerate `dt` so it has `NA`s again
system.time({
dt[, (cols) := Reduce(function(prev,this) fcoalesce(this,prev), .SD, accumulate = TRUE), .SDcols = cols]
})
# user system elapsed
# 0.03 0.00 0.03
identical(dt, dt2)
# [1] TRUE
A more robust benchmark such as bench::mark
is going to be encumbered a little by the need to copy(dt)
every pass. Though this overhead is not huge,
bench::mark(copy(dt))
# # A tibble: 1 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 copy(dt) 7.77ms 20.9ms 45.1 25.2MB 0 23 0 510ms <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [14 x 3]> <bch:tm [23]> <tibble [23 x 3]>
it is still extra. As such, I'll compare the transpose
code twice, once with and once without, in order to better compare it to the for
and reduce
answers more honestly. (Note that bench::mark
's default action is to verify that all outputs are identical
. This can be disabled, but I have not done that, so all code blocks return the same results.)
bench::mark(
transpose1 = {
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
dt2
},
transpose2 = {
dt0 = copy(dt)
dt2 = transpose(dt0)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt0))
dt2
},
forloop = {
dt0 <- copy(dt)
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt0[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
dt0
},
reduce = {
dt0 <- copy(dt)
dt0[, (cols) := Reduce(function(prev,this) fcoalesce(this,prev), .SD, accumulate = TRUE), .SDcols = cols]
},
min_iterations = 10)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 transpose1 4.94s 5.48s 0.154 1.28GB 0.201 10 13 1.08m <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [33,008 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 2 transpose2 5.85s 6.29s 0.130 1.3GB 0.259 10 20 1.29m <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [15,316 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 3 forloop 48.37ms 130.91ms 2.87 71.14MB 0 10 0 3.49s <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [191 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 4 reduce 48.08ms 75.82ms 4.70 71MB 0.470 10 1 2.13s <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [38 x 3]> <bch:tm [10]> <tibble [10 x 3]>
From this:
- Time: after normalizing to milliseconds, 4840ms compares poorly against 48ms; and
- Memory: 1.28GB compares poorly with 71MB.
(Edited to increase the benchmark's minimum iterations to 10.)
Fastest way to replace multiple values in large data.frame
I looked at some sed
commands and I figured I should post what I found just in case someone has a similar issue.
The sed
commands that I found to work in terminal are (This creates a new file, but you don't have to create new files)
sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt
or this works as well in R
system(paste(sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt))
You can also use the data.table::fread
method mentioned by IceCreamToucan
df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")
It interesting to note that typically the sed
command you use is
sed 's/old text/new text/g' file > new.file
but since what I needed to replace had a forward slash already /
I had to use the +
plus sign so sed
doesn't get confused.
I am going to do a performance test using my two older methods (posted above), the new sed
method, and F. Prive's method that he posted as an answer. I am going to make a smaller subset of the full dataset because it would take too long to test the four methods.
EDIT
So I tested the four different methods out to see which one was fastest. I created a smaller file to test the four methods out. The file I created had 1000000 rows and 340 columns.
METHOD 1
lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])
Runtime - 8 minutes
METHOD 2
replacement<-function(x){
x=replace(x,which(x=='./.'),0.01)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)
Runtime - 46 seconds
METHOD 3
df <- df %>% mutate_at(
vars(- CHROM, - POS),
funs(case_when(
. == "0/0" ~ 0,
. == "0/1" ~ 1,
. == "1/1" ~ 2,
. == "./." ~ 0.01
))
)
Runtime - 42 seconds
METHOD 4
df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")
Runtime - 2 min 34 seconds, which was surprising
Conclusion - I wasted my time
Related Topics
Extract Row Corresponding to Minimum Value of a Variable by Group
Unique Combination of All Elements from Two (Or More) Vectors
Run R Script from Command Line
Dcast Warning: 'Aggregation Function Missing: Defaulting to Length'
Collapse Text by Group in Data Frame
Combine (Rbind) Data Frames and Create Column With Name of Original Data Frames
Formula With Dynamic Number of Variables
Cluster Analysis in R: Determine the Optimal Number of Clusters
How to Name Variables on the Fly
Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector
How to Get Summary Statistics by Group
Shading a Kernel Density Plot Between Two Points.
Determine Path of the Executing Script
How to Get Rowsums for Selected Columns in R
Remove Quotes from a Character Vector in R