Data.Table Inner/Outer Join with Na in Join Column of Type Double Bug

data.table inner/outer join with NA in join column of type double bug?

Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...

#2453 NA in double key column messes up joins (NA in integer and character ok)

Now fixed in 1.8.7 (commit 780), from NEWS :

NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.

How does one do a full join using data.table?

You actually have it right there. Use merge.data.table which is exactly what you are doing when you call

merge(a, b, by = "dog", all = TRUE)

since a is a data.table, merge(a, b, ...) calls merge.data.table(a, b, ...)

merge.data.table with all=True introduces NA row. Is this correct?

The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!

Here's a better example :

> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
P Q
1: 1 3
2: 2 4
> b
P R
1: 2 5
2: 3 6
> merge(a,b) # correct
P Q R
1: 2 4 5
> merge(a,b,all=TRUE) # correct.
P Q R
1: 1 3 NA
2: 2 4 5
3: 3 NA 6
> merge(a,b[0],all=TRUE) # incorrect result when y is empty, agreed
P Q R
1: NA NA NA
2: NA NA NA
3: 1 3 NA
4: 2 4 NA
> merge.data.frame(a,b[0],all=TRUE) # correct
P Q R
1 1 3 NA
2 2 4 NA

Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :

merge no longer returns spurious NA row(s) when y is empty and
all.y=TRUE (or all=TRUE), #2633. Thanks
to Vinicius Almendra for reporting. Test added.

Left join using data.table

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')

B[A]

R Data.Table Join - Transform 'Missing' from NA to a Default Value

You can do it this way :

right[left, list(value=default(value,"none"))]

Which gives :

    k value
1: NA none
2: 1 a
3: 2 b
4: 3 none
5: 4 none

Your solution doesn't work because when you do value := default(value,"none"), the default function is only applied to the value column of right, ie default(c("a","b"),"none"). The value column is then updated with the result of the function for the lines who have a value before the join. The other leftrows, who don't have any corresponding row in right, get a NA instead.

Sorry, not sure my explanation is very clear...

merging data.tables based on columns names

Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table() does accept and nicely handles arguments by.x= and by.y=. Here's an updated link to the FR (now closed) referenced below.


Yes this is a feature request not yet implemented :

FR#2033 Add by.x and by.y to merge.data.table

There isn't anything preventing it. Just something that wasn't done. I very rarely need merge and was slow to realise its usefulness more generally. We've made good progress in bringing merge performance as fast as X[Y], and this feature request is at the highest priority. If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table and commit the change yourself. We try to keep source code short and together in one function/file, so by looking at merge.data.table source hopefully you can follow it and see what needs to be done.

Can someone explain how mult works in data.table when it performs update in joins (using .EACHI and mult)

I'd like to use the first value in the right table to override the value of the left table

Select the first values and update with them alone:

X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]

x y t
1: a 6 0
2: a 6 1
3: b NA 2
4: c 3 3
5: d 2 4

fromLast= can select the first or last row when dropping dupes.


How multiple matches are handled:

In x[i, mult=], if a row of i has multiple matches, mult determines which matching row(s) of x are selected. This explains the results shown in the OP.

In x[i, v := i.v], if multiple rows of i match to the same row in x, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):

options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
a b
1: 1 4

Rolling join two data.tables with date in R

We may use non-equi join

dt1[dt2, date_2 := date2, on = .(group, date1 > date2), mult = "first"]

pandas - Merging on string columns not working (bug?)

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

Merge multiple data tables with duplicate column names

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
function(x, y) {
counter <<- counter + 1
d = merge(x, y, all = T, by = 'x')
setnames(d, c(head(names(d), -1), paste0('y.', counter)))
}})(), list(DT1, DT2, DT3, DT4, DT5))
# x y.x y.1 y.2 y.3 y.4
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19


Related Topics



Leave a reply



Submit