Replacing All Missing Values in R Data.Table with a Value

Replacing all missing values in R data.table with a value

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

Why shouldn't you use <- here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187

Replace NULL with NA in r data.table with lists

My toy example is too small to compare timings, but combining both solutions suggested by @B. Christian Kamgang and @Ronak Shah works well for me:

# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)

# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]

Two things I found advantageous with this approach (thanks to both respondants for suggesting):

  1. Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.

  2. The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.

  3. I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).

Replace NAs with previous values using data.table

You can use zoo::na.locf().

assets[,XPTO := zoo::na.locf(XPTO)]

To answer jblood94 question, this function fills all NAs with the latest Non-NA value.

Data.table replace sequence of values with NA

Try to use

dt[2:5, (specific_column) := NA]

replacing all NA with a 0 in data.table in R

We can either specify the .SDcols with the names of the columns ('nm1'), loop over the .SD (Subset of Data.table) and assign the NA to 0 (replace_na from tidyr)

library(data.table)
library(tidyr)
nm1 <- paste0("claim", 9:12, "month")
setDT(claimsMonthly)[, (nm1) := lapply(.SD, replace_na, 0), .SDcols = nm1]

Or as @jangorecki mentioned in the comments, nafill from data.table would be better

setDT(claimsMonthly)[, (nm1) := lapply(.SD, nafill, fill = 0), .SDcols = nm1]

or using a loop with set, assign the columns of interest with 0 based on the NA values in each column by specifying the i (for row index) and j for column index/name

for(j in nm1){
set(claimsMonthly, i = which(is.na(claimsMonthly[[j]])), j =j, value = 0)
}

Or with setnafill

setnafill(claimsMonthly, cols = nm1, fill = 0)

Replace NAs in a Single Column of a Data Table in R

Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""

data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')

data <- data.table(data)

> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.

EDIT: This is, I just saw, what @dracodoc suggested in comment

Replace all values in a data.table given a condition

A native data.table way to do this would be:

for(col in names(ppp)) set(ppp, i=which(ppp[[col]]==6), j=col, value=NA)
# Test
> ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8

This approach - while perhaps more verbose - is nevertheless going to be significantly faster than ppp[ppp == 6] <- NA, because it avoids the copying of all columns.



Related Topics



Leave a reply



Submit