Replacing all missing values in R data.table with a value
is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <-
here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
Replace NULL with NA in r data.table with lists
My toy example is too small to compare timings, but combining both solutions suggested by @B. Christian Kamgang and @Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r
ifelse
,dplyr::if_else
anddata.table::fifelse
; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario becauseNA
is interpreted as differing in type from the other values in the list.The use of the function
lengths(dtcol) == 0L
targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.I've gone with
replace()
rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).
Replace NAs with previous values using data.table
You can use zoo::na.locf()
.
assets[,XPTO := zoo::na.locf(XPTO)]
To answer jblood94 question, this function fills all NAs with the latest Non-NA value.
Data.table replace sequence of values with NA
Try to use
dt[2:5, (specific_column) := NA]
replacing all NA with a 0 in data.table in R
We can either specify the .SDcols
with the names of the columns ('nm1'), loop over the .SD
(Subset of Data.table) and assign the NA to 0 (replace_na
from tidyr
)
library(data.table)
library(tidyr)
nm1 <- paste0("claim", 9:12, "month")
setDT(claimsMonthly)[, (nm1) := lapply(.SD, replace_na, 0), .SDcols = nm1]
Or as @jangorecki mentioned in the comments, nafill
from data.table
would be better
setDT(claimsMonthly)[, (nm1) := lapply(.SD, nafill, fill = 0), .SDcols = nm1]
or using a loop with set
, assign the columns of interest with 0 based on the NA values in each column by specifying the i
(for row index) and j
for column index/name
for(j in nm1){
set(claimsMonthly, i = which(is.na(claimsMonthly[[j]])), j =j, value = 0)
}
Or with setnafill
setnafill(claimsMonthly, cols = nm1, fill = 0)
Replace NAs in a Single Column of a Data Table in R
Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""
data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')
data <- data.table(data)
> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.
EDIT: This is, I just saw, what @dracodoc suggested in comment
Replace all values in a data.table given a condition
A native data.table
way to do this would be:
for(col in names(ppp)) set(ppp, i=which(ppp[[col]]==6), j=col, value=NA)
# Test
> ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8
This approach - while perhaps more verbose - is nevertheless going to be significantly faster than ppp[ppp == 6] <- NA
, because it avoids the copying of all columns.
Related Topics
Shiny Dynamic Filter Variable Selection and Display of Variable Values for Selection
Create an Expression from a Function for Data.Table to Eval
Extracting Coefficient Variable Names from Glmnet into a Data.Frame
Alternatives to Nested Ifelse Statements in R
Change Stringsasfactors Settings for Data.Frame
In R, How to Subset a Data.Frame by Values from Another Data.Frame
Run R Script from .Bat (Batch File)
Combine Separate Year and Month Columns into Single Date Column
Multiple Histograms in Ggplot2
How to Remove an Element in ... (Dot-Dot-Dot) and Pass It On
R Reading in a Zip Data File Without Unzipping It
Ggplot2 - Multi-Group Histogram with In-Group Proportions Rather Than Frequency
Colorize Parts of the Title in a Plot
Apply Function to Each Column in a Data Frame Observing Each Columns Existing Data Type
How to Write an R Function That Evaluates an Expression Within a Data-Frame