Fastest Way to Replace Nas in a Large Data.Table

Fastest way to replace NAs in a large data.table

Here's a solution using data.table's := operator, building on Andrie and Ramnath's answers.

require(data.table)  # v1.6.6
require(gdata) # v2.8.2

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
[1] 200000 200 # more columns than Ramnath's answer which had 5 not 200

f_andrie = function(dt) remove_na(dt)

f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un)

f_dowle = function(dt) { # see EDIT later for more elegant solution
na.replace = function(v,value=0) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}

system.time(a_gdata = f_gdata(dt1))
user system elapsed
18.805 12.301 134.985

system.time(a_andrie = f_andrie(dt1))
Error: cannot allocate vector of size 305.2 Mb
Timing stopped at: 14.541 7.764 68.285

system.time(f_dowle(dt1))
user system elapsed
7.452 4.144 19.590 # EDIT has faster than this

identical(a_gdata, dt1)
[1] TRUE

Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the copy function is needed to make a local copy of the whole dataset. data.table's setkey, key<- and := do not copy-on-write.

Next, let's see where f_dowle is spending its time.

Rprof()
f_dowle(dt1)
Rprof(NULL)
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"na.replace" 5.10 49.71 6.62 64.52
"[.data.table" 2.48 24.17 9.86 96.10
"is.na" 1.52 14.81 1.52 14.81
"gc" 0.22 2.14 0.22 2.14
"unique" 0.14 1.36 0.16 1.56
... snip ...

There, I would focus on na.replace and is.na, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updates NA by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?

The reason f_andrie fails may be because it copies the whole of dt1, or creates a logical matrix as big as the whole of dt1, a few times. The other 2 methods work on one column at a time (although I only briefly looked at NAToUnknown).

EDIT (more elegant solution as requested by Ramnath in comments) :

f_dowle2 = function(DT) {
for (i in names(DT))
DT[is.na(get(i)), (i):=0]
}

system.time(f_dowle2(dt1))
user system elapsed
6.468 0.760 7.250 # faster, too

identical(a_gdata, dt1)
[1] TRUE

I wish I did it that way to start with!

EDIT2 (over 1 year later, now)

There is also set(). This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling [,:=,] in a loop. set is a loopable :=. See ?set.

f_dowle3 = function(DT) {
# either of the following for loops

# by name :
for (j in names(DT))
set(DT,which(is.na(DT[[j]])),j,0)

# or by number (slightly faster than by name) :
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}

Replace NAs with previous values using data.table

You can use zoo::na.locf().

assets[,XPTO := zoo::na.locf(XPTO)]

To answer jblood94 question, this function fills all NAs with the latest Non-NA value.

Replace NAs in a Single Column of a Data Table in R

Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""

data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')

data <- data.table(data)

> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.

EDIT: This is, I just saw, what @dracodoc suggested in comment

Replacing all missing values in R data.table with a value

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

Why shouldn't you use <- here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187

Replace NAs with 0 in all numeric columns using data.table in R

We can do this with set by looping over the numeric columns (needed_names) and set the elements that are NA specified in i to 0

for(j in needed_names){
set(dt, i = which(is.na(dt[[j]])), j=j, value = 0)
}
dt
# a b c
#1: 1 0 a
#2: 2 0 b
#3: 3 0 c
#4: 0 20 d
#5: 0 21 e
#6: 0 22 f
#7: 10 23 g
#8: 11 24 NA
#9: 12 25 NA

Regarding the OP's code, when there are more than one element, we use ifelse or replace and also the output should be assigned back to the columns of interest otherwise, we will only the columns specified in the .SDcols and will not be updated in the original dataset

dt[, (needed_names) := lapply(.SD, function(x) 
replace(x, is.na(x), 0)), .SDcols = needed_names]

R data.table fill row NAs, by row

One good method, using a for loop. It's not row-by-row, it operates on "all rows with an NA in column 'X'" at one time, for each column in cols.

for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}

dt
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 3 3 2 3 1 1 1 4 3 1
# 2: 1 1 2 2 1 2 2 1 2 4 1
# 3: 3 2 3 1 1 4 3 3 2 1 2
# 4: 2 3 3 1 2 2 1 4 3 4 2
# 5: 1 2 3 4 4 3 2 2 2 4 3
# 6: 4 1 4 2 1 4 4 3 3 4 3

Admittedly the use of get(.) is not perfect, but I think it'll generally be okay.

Another method, about as fast (depending on the size of data):

dt[, (cols) := Reduce(function(prev,this) fcoalesce(this, prev), .SD, accumulate = TRUE), .SDcols = cols]
# same results

Benchmarking, since you said that with 2M rows, performance is important.

I'll go with 2M rows and update the method for randomizing the NAs.

library(data.table)
set.seed(456)
n <- 2e6 # 6e5
dt <- data.table(a = sample(1:4,n, replace=T), b = sample(1:4,n, replace=T), c = sample(1:4,n, replace=T), d = sample(1:4,n, replace=T), e = sample(1:4,n, replace=T), f = sample(1:4,n, replace=T), g = sample(1:4,n, replace=T), h = sample(1:4,n, replace=T), i = sample(1:4,n, replace=T), j = sample(1:4,n, replace=T), xx = sample(1:4,n, replace=T))
mtx <- cbind(sample(nrow(dt), ceiling(n*11/20), replace=TRUE), sample(ncol(dt), ceiling(n*11/20), replace=TRUE))
mtx <- mtx[!duplicated(mtx),]
dt[mtx] <- NA
head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 2 3 2 1 2 3 3 2 2
# 2: 1 3 4 1 4 4 3 2 4 3 3
# 3: 3 4 2 2 3 4 2 2 1 NA 1
# 4: 2 1 4 1 2 3 NA 4 4 4 3
# 5: 1 2 3 3 4 3 3 NA 1 4 1
# 6: 4 3 4 2 2 NA 4 1 2 4 2

Unfortunately, the transpose method fails:

system.time({
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
})
# Error: cannot allocate vector of size 30.6 Gb

but the for loop (and Reduce, incidentally) works fine:

cols <- setdiff(names(dt),"N")
system.time({
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
})
# user system elapsed
# 0.14 0.00 0.11
head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 2 3 2 1 2 3 3 2 2
# 2: 1 3 4 1 4 4 3 2 4 3 3
# 3: 3 4 2 2 3 4 2 2 1 1 1
# 4: 2 1 4 1 2 3 3 4 4 4 3
# 5: 1 2 3 3 4 3 3 3 1 4 1
# 6: 4 3 4 2 2 2 4 1 2 4 2

If I simplify the problem-set to 600K rows, then I can get both to work. (I don't know the tipover point for my system ... it might be 1M, who knows, I just wanted to compare them side-by-side.) With n <- 6e5 and generating dt, I see the following data and simple timing:

head(dt)
# a b c d e f g h i j xx
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 3 1 3 4 NA 3 3 3 3
# 2: 1 3 2 2 4 3 1 2 2 4 1
# 3: 3 4 2 1 1 1 1 4 2 4 2
# 4: 2 4 1 NA 1 4 3 1 4 1 1
# 5: 1 NA 4 2 NA NA 4 4 2 2 NA
# 6: 4 1 4 4 1 2 3 3 1 1 2

sum(is.na(dt))
# [1] 321782
system.time({
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
})
# user system elapsed
# 4.27 4.50 7.74

sum(is.na(dt)) # 'dt' is unchanged, only important here to compare the 'for' loop
# [1] 321782
sum(is.na(dt2)) # rows with leading columns having 'NA', nothing to coalesce, not surprising
# [1] 30738

cols <- setdiff(names(dt),"N")
system.time({
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
})
# user system elapsed
# 0.10 0.03 0.06

identical(dt, dt2)
# [1] TRUE

### regenerate `dt` so it has `NA`s again
system.time({
dt[, (cols) := Reduce(function(prev,this) fcoalesce(this,prev), .SD, accumulate = TRUE), .SDcols = cols]
})
# user system elapsed
# 0.03 0.00 0.03

identical(dt, dt2)
# [1] TRUE

A more robust benchmark such as bench::mark is going to be encumbered a little by the need to copy(dt) every pass. Though this overhead is not huge,

bench::mark(copy(dt))
# # A tibble: 1 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 copy(dt) 7.77ms 20.9ms 45.1 25.2MB 0 23 0 510ms <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [14 x 3]> <bch:tm [23]> <tibble [23 x 3]>

it is still extra. As such, I'll compare the transpose code twice, once with and once without, in order to better compare it to the for and reduce answers more honestly. (Note that bench::mark's default action is to verify that all outputs are identical. This can be disabled, but I have not done that, so all code blocks return the same results.)

bench::mark(
transpose1 = {
dt2 = transpose(dt)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt))
dt2
},
transpose2 = {
dt0 = copy(dt)
dt2 = transpose(dt0)
setnafill(dt2, type = 'locf')
dt2 = transpose(dt2)
setnames(dt2, names(dt0))
dt2
},
forloop = {
dt0 <- copy(dt)
for (i in seq_along(cols)[-1]) {
prevcol <- cols[i-1]
thiscol <- cols[i]
dt0[is.na(get(thiscol)), (thiscol) := fcoalesce(get(thiscol), get(prevcol)) ]
}
dt0
},
reduce = {
dt0 <- copy(dt)
dt0[, (cols) := Reduce(function(prev,this) fcoalesce(this,prev), .SD, accumulate = TRUE), .SDcols = cols]
},
min_iterations = 10)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 transpose1 4.94s 5.48s 0.154 1.28GB 0.201 10 13 1.08m <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [33,008 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 2 transpose2 5.85s 6.29s 0.130 1.3GB 0.259 10 20 1.29m <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [15,316 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 3 forloop 48.37ms 130.91ms 2.87 71.14MB 0 10 0 3.49s <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [191 x 3]> <bch:tm [10]> <tibble [10 x 3]>
# 4 reduce 48.08ms 75.82ms 4.70 71MB 0.470 10 1 2.13s <data.table[,11] [600,000 x 11]> <Rprofmem[,3] [38 x 3]> <bch:tm [10]> <tibble [10 x 3]>

From this:

  • Time: after normalizing to milliseconds, 4840ms compares poorly against 48ms; and
  • Memory: 1.28GB compares poorly with 71MB.

(Edited to increase the benchmark's minimum iterations to 10.)

Fastest way to replace multiple values in large data.frame

I looked at some sed commands and I figured I should post what I found just in case someone has a similar issue.

The sed commands that I found to work in terminal are (This creates a new file, but you don't have to create new files)

sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt

or this works as well in R

system(paste(sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt))

You can also use the data.table::fread method mentioned by IceCreamToucan

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

It interesting to note that typically the sed command you use is

sed 's/old text/new text/g' file > new.file

but since what I needed to replace had a forward slash already / I had to use the + plus sign so sed doesn't get confused.

I am going to do a performance test using my two older methods (posted above), the new sed method, and F. Prive's method that he posted as an answer. I am going to make a smaller subset of the full dataset because it would take too long to test the four methods.

EDIT

So I tested the four different methods out to see which one was fastest. I created a smaller file to test the four methods out. The file I created had 1000000 rows and 340 columns.

METHOD 1

lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])

Runtime - 8 minutes

METHOD 2

replacement<-function(x){
x=replace(x,which(x=='./.'),0.01)
x=replace(x,which(x=='0/0'),0)
x=replace(x,which(x=='0/1'),1)
x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)

Runtime - 46 seconds

METHOD 3

df <- df %>% mutate_at(
vars(- CHROM, - POS),
funs(case_when(
. == "0/0" ~ 0,
. == "0/1" ~ 1,
. == "1/1" ~ 2,
. == "./." ~ 0.01
))
)

Runtime - 42 seconds

METHOD 4

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

Runtime - 2 min 34 seconds, which was surprising

Conclusion - I wasted my time



Related Topics



Leave a reply



Submit