Update a column of NAs in one data table with the value from a column in another data table
We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
The idea is to first setkey()
on DT1
on the column tract
.
setkey(DT1, tract)
In data.tables, a join of the form x[i]
requires key for x
, but not necessarily for i
. This results in two scenarios:
If
i
also has key set -- the first key column ofi
is matched against first key column ofx
, second against second and so on..If
i
doesn't have key set -- the first column ofi
is matched against the first key column ofx
, second column ofi
against second key column ofx
and so on..
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
HTH
R: Updating NAs in a data table with values of another data table
We can join the two datasets on
'ID', for NA values in 'P', we assign 'P' as 'P0', and then remove the 'P0' by assigning it to 'NULL'.
library(data.table)#v1.9.6+
DT2[DT1, on='ID'][is.na(P), P:= P0][, P0:= NULL][]
Or as @DavidArenburg mentioned, we can use ifelse
condition after joining on 'ID' to replace the NA elements in 'P'.
DT2[DT1, P := ifelse(is.na(P), i.P0, P), on = 'ID']
Fill NA values in one data table with observed values from a second data table in R
We can do this with a join on
the 'Country', 'Year' columns
library(data.table)
nm1 <- names(dt1)[3:4]
nm2 <- paste0("i.", nm1)
dt3 <- copy(dt1)
dt3[dt2, (nm1) := Map(function(x, y)
fifelse(is.na(x), y, x), mget(nm1), mget(nm2)), on = .(Country, Year)]
dt3
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 60
#3: DEU 2010 10 15
#4: DEU 2020 60 80
#5: CHE 2010 20 40
#6: CHE 2020 70 100
Or to make this compact, use fcoalesce
from data.table
(comments from @IceCreamToucan)
dt3[dt2, (nm1) := Map(fcoalesce, mget(nm1), mget(nm2)), on = .(Country, Year)]
If the datasets are of same dimensions and have the same values for 'Country', 'Year', then another option is
library(purrr)
library(dplyr)
list(dt1[, .(acctm, acctf)], dt2[, .(acctm, acctf)]) %>%
reduce(coalesce) %>%
bind_cols(dt1[, .(Country, Year)], .)
Replace NAs in a Single Column of a Data Table in R
Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""
data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')
data <- data.table(data)
> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.
EDIT: This is, I just saw, what @dracodoc suggested in comment
Update a subset of values in data.table column with values from another data.table column
You can use get
to grab the i.name
variable programmatically in the update join, and stay within standard data.table join operations. Example data and code:
library(data.table)
data <- data.table(snp.gene.key=1:5, dval = letters[1:5])
all_tmp <- data.table(snp.gene.key=1:3, dval=letters[11:13])
setkey(data, snp.gene.key)
setkey(all_tmp, snp.gene.key)
data
# snp.gene.key dval
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: 5 e
Then specify (name)
on the RHS of the :=
assignment so it is interpreted rather than treated literally, along with using get
on the LHS to grab the variable you want for the update join.
name <- "dval"
data[all_tmp, (name) := get(paste0("i.", name)) ]
data
# snp.gene.key dval
#1: 1 k
#2: 2 l
#3: 3 m
#4: 4 d
#5: 5 e
updating a data.table column based on matching to another data.table
We can do a join on
the 'spp' and assign the 'state' with the second dataset 'state' column i.e. i.state
d1[d2, state := i.state, on = .(spp)]
Find all NAs in R data.table
For the sake of completeness, here is a Minimal, Reproducible Example where only row 6 is complete, i.e., without any NA
and the columns are of a different types:
library(data.table)
options(datatable.print.class = TRUE)
n <- 7
dt <- data.table(1:n, pi * as.numeric(1:n),
letters[1:n], rep(c(TRUE, FALSE), length.out = n),
factor(LETTERS[1:n]))
for (i in 1:ncol(dt)) set(dt, i, i, NA)
for (i in 1:ncol(dt)) set(dt, nrow(dt), i, NA)
dt
V1 V2 V3 V4 V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: 6 18.849556 f FALSE F
7: NA NA <NA> NA <NA>
alodi's answer
works as expected:
dt[!complete.cases(dt)]
V1 V2 V3 V4 V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: NA NA <NA> NA <NA>
clemenskuehn's answer
fails
dt[is.na(rowSums(dt))]
Error: 'x' must be numeric
because it assumes all columns of dt
are numeric.
Count the NAs in each row
dt[rowSums(is.na(dt)) > 0]
V1 V2 V3 V4 V5
<int> <num> <char> <lgcl> <fctr>
1: NA 3.141593 a TRUE A
2: 2 NA b FALSE B
3: 3 9.424778 <NA> TRUE C
4: 4 12.566371 d NA D
5: 5 15.707963 e TRUE <NA>
6: NA NA <NA> NA <NA>
This displays all rows where at least one NA
is found.
Filling a column based on the value of another column in data.table
In two steps:
dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
amount_of_categories municipality type estimation
Note that I used zoo::na.locf
because data.table::nafill(type='locf')
doesn't yet work with characters.
An alternative approach by municipality
with na.fill
following you edit (example 2):
dat[,estimation:=zoo::na.fill(type,fill=type[which.max(!is.na(type))]),by=municipality][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_3
15: 1 Area B <NA> cat_3
16: 1 Area B cat_3 cat_3
17: 1 Area B cat_3 cat_3
18: 1 Area B cat_3 cat_3
19: 1 Area B cat_3 cat_3
20: 1 Area B cat_3 cat_3
amount_of_categories municipality type estimation
R: Change value to NA in one column if in another column the datapoint is also NA
Here is a possible solution using an ifelse
statement:
library(dplyr)
df %>%
mutate(Sum_Size= ifelse(is.na(Category), NA_real_, Sum_Size))
Name Date Category Size Sum_Size
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 1 34 34
2 A 02.09.2018 1 23 23
3 A 03.09.2018 NA 12 NA
4 A 05.11.2021 NA 53 NA
5 A 06.11.2021 2 23 35
6 A 07.11.2021 2 53 53
7 B 01.09.2018 3 23 23
8 B 02.09.2018 3 54 54
9 B 03.09.2018 3 65 65
10 B 05.11.2021 NA 75 NA
11 B 06.11.2021 NA 67 NA
12 B 07.11.2021 4 45 45
13 C 01.09.2018 4 45 45
14 C 02.09.2018 2 23 23
15 C 03.09.2018 NA 23 NA
16 C 05.11.2021 2 12 12
17 C 06.11.2021 2 12 35
18 C 07.11.2021 NA NA NA
Related Topics
Different Robust Standard Errors of Logit Regression in Stata and R
Lapply to Add Columns to Each Dataframe in a List
Why Is Subsetting on a "Logical" Type Slower Than Subsetting on "Numeric" Type
Dual Y Axis in Ggplot2 for Multiple Panel Figure
Add Text on Right of Shinydashboard Header
Changing Format of Some Axis Labels in Ggplot2 According to Condition
Ggpairs Plot with Heatmap of Correlation Values
As.Date(As.Posixct()) Gives the Wrong Date
Gsub in R with Unicode Replacement Give Different Results Under Windows Compared with Unix
How to Read the Source Code for an R Function
Automate Zip File Reading in R
Spatialpolygons - Creating a Set of Polygons in R from Coordinates
Legend of a Raster Map with Categorical Data
Format Text Inside R Code Chunk
Consolidating Data Frames in R