Update a Column of Nas in One Data Table with the Value from a Column in Another Data Table

Update a column of NAs in one data table with the value from a column in another data table

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.


The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

  • If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

In this case, since your first column in i is also tract, we'll skip setting key on i.

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).


Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

R: Updating NAs in a data table with values of another data table

We can join the two datasets on 'ID', for NA values in 'P', we assign 'P' as 'P0', and then remove the 'P0' by assigning it to 'NULL'.

library(data.table)#v1.9.6+
DT2[DT1, on='ID'][is.na(P), P:= P0][, P0:= NULL][]

Or as @DavidArenburg mentioned, we can use ifelse condition after joining on 'ID' to replace the NA elements in 'P'.

DT2[DT1, P := ifelse(is.na(P), i.P0, P), on = 'ID']

Fill NA values in one data table with observed values from a second data table in R

We can do this with a join on the 'Country', 'Year' columns

library(data.table)
nm1 <- names(dt1)[3:4]
nm2 <- paste0("i.", nm1)
dt3 <- copy(dt1)
dt3[dt2, (nm1) := Map(function(x, y)
fifelse(is.na(x), y, x), mget(nm1), mget(nm2)), on = .(Country, Year)]
dt3
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 60
#3: DEU 2010 10 15
#4: DEU 2020 60 80
#5: CHE 2010 20 40
#6: CHE 2020 70 100

Or to make this compact, use fcoalesce from data.table (comments from @IceCreamToucan)

dt3[dt2,  (nm1) := Map(fcoalesce, mget(nm1), mget(nm2)), on = .(Country, Year)]

If the datasets are of same dimensions and have the same values for 'Country', 'Year', then another option is

library(purrr)
library(dplyr)
list(dt1[, .(acctm, acctf)], dt2[, .(acctm, acctf)]) %>%
reduce(coalesce) %>%
bind_cols(dt1[, .(Country, Year)], .)

Replace NAs in a Single Column of a Data Table in R

Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""

data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')

data <- data.table(data)

> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.

EDIT: This is, I just saw, what @dracodoc suggested in comment

Update a subset of values in data.table column with values from another data.table column

You can use get to grab the i.name variable programmatically in the update join, and stay within standard data.table join operations. Example data and code:

library(data.table)
data <- data.table(snp.gene.key=1:5, dval = letters[1:5])
all_tmp <- data.table(snp.gene.key=1:3, dval=letters[11:13])
setkey(data, snp.gene.key)
setkey(all_tmp, snp.gene.key)

data
# snp.gene.key dval
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: 5 e

Then specify (name) on the RHS of the := assignment so it is interpreted rather than treated literally, along with using get on the LHS to grab the variable you want for the update join.

name <- "dval"
data[all_tmp, (name) := get(paste0("i.", name)) ]

data
# snp.gene.key dval
#1: 1 k
#2: 2 l
#3: 3 m
#4: 4 d
#5: 5 e

updating a data.table column based on matching to another data.table

We can do a join on the 'spp' and assign the 'state' with the second dataset 'state' column i.e. i.state

d1[d2, state := i.state, on = .(spp)]

Find all NAs in R data.table

For the sake of completeness, here is a Minimal, Reproducible Example where only row 6 is complete, i.e., without any NA and the columns are of a different types:

library(data.table)
options(datatable.print.class = TRUE)
n <- 7
dt <- data.table(1:n, pi * as.numeric(1:n),
letters[1:n], rep(c(TRUE, FALSE), length.out = n),
factor(LETTERS[1:n]))

for (i in 1:ncol(dt)) set(dt, i, i, NA)
for (i in 1:ncol(dt)) set(dt, nrow(dt), i, NA)
dt
      V1        V2     V3     V4     V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: 6 18.849556 f FALSE F
7: NA NA <NA> NA <NA>

alodi's answer

works as expected:

dt[!complete.cases(dt)]
      V1        V2     V3     V4     V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: NA NA <NA> NA <NA>

clemenskuehn's answer

fails

dt[is.na(rowSums(dt))]
Error: 'x' must be numeric

because it assumes all columns of dt are numeric.

Count the NAs in each row

dt[rowSums(is.na(dt)) > 0]
      V1        V2     V3     V4     V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: NA NA <NA> NA <NA>

This displays all rows where at least one NA is found.

Filling a column based on the value of another column in data.table

In two steps:

dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
amount_of_categories municipality type estimation

Note that I used zoo::na.locf because data.table::nafill(type='locf') doesn't yet work with characters.

An alternative approach by municipality with na.fill following you edit (example 2):

dat[,estimation:=zoo::na.fill(type,fill=type[which.max(!is.na(type))]),by=municipality][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_3
15: 1 Area B <NA> cat_3
16: 1 Area B cat_3 cat_3
17: 1 Area B cat_3 cat_3
18: 1 Area B cat_3 cat_3
19: 1 Area B cat_3 cat_3
20: 1 Area B cat_3 cat_3
amount_of_categories municipality type estimation

R: Change value to NA in one column if in another column the datapoint is also NA

Here is a possible solution using an ifelse statement:

library(dplyr)

df %>%
mutate(Sum_Size= ifelse(is.na(Category), NA_real_, Sum_Size))
   Name  Date       Category  Size Sum_Size
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 1 34 34
2 A 02.09.2018 1 23 23
3 A 03.09.2018 NA 12 NA
4 A 05.11.2021 NA 53 NA
5 A 06.11.2021 2 23 35
6 A 07.11.2021 2 53 53
7 B 01.09.2018 3 23 23
8 B 02.09.2018 3 54 54
9 B 03.09.2018 3 65 65
10 B 05.11.2021 NA 75 NA
11 B 06.11.2021 NA 67 NA
12 B 07.11.2021 4 45 45
13 C 01.09.2018 4 45 45
14 C 02.09.2018 2 23 23
15 C 03.09.2018 NA 23 NA
16 C 05.11.2021 2 12 12
17 C 06.11.2021 2 12 35
18 C 07.11.2021 NA NA NA


Related Topics



Leave a reply



Submit