Why Is Allow.Cartesian Required at Times When When Joining Data.Tables with Duplicate Keys

Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won't get this error, even if you've duplicates. It is basically a precautionary measure.

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you're really sure.

Here's an (exaggerated) example that illustrates the idea behind this error message:

require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),
y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")

# not run
# DT1[DT2] ## error

dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000 2

The duplicates in DT2 resulted in 3 times the total number of "a" in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there's the allow.cartesian argument which by default is FALSE.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.

data.table cartesian join warning on legitimate join

Question 1:

Looks great! But the equivalent of inner join is to also add nomatch=0L. Otherwise you'd also get all the rows from dtMoveB. We can't make use of by=.EACHI here AFAICT.

Read this answer and the answer linked under this comment for understanding the purpose of allow.cartesian = TRUE.

Question 2: From ?data.table under the entry for allow.cartesian:

FALSE prevents joins that would result in more than max(nrow(x), nrow(i)) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in x over and over again: a misspecified join.

Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x) * nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.

Does this answer your question?

Question 3:

Yes. Joins are of the form x[i]. When x and i both share a column name and will be in the joined result, those columns have a i. prefix added to it. It's the same prefix that also allows you to access i's columns in j, where both x and i share the column name in operations of the form x[i, j] or x[i, j, by=.EACHI].

While joining you can change the names to whatever you like though. Here, you could change to position.x and position.y with:

dtMoveA[dtMoveB, .(RobotID, Position.x=Position, Position.y=i.Position,
AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE]

HTH


PS: If you've any suggestions, feel free to add a FR here. Please have a look at the posting guidelines before to do so.

Joining data.table with by argument

If, I understood your requirement correctly, There is a direct merge option that you can use,

dx <- data.table(a = c(1,1,2,2), b = 3:6)
dy <- data.table(a = c(1,1,2), c = 7:9)
merge(x = dx, y = dy, by = "a", all = TRUE)

It gives your desired output that you mentioned.
How to join (merge) data frames (inner, outer, left, right)?

I hope it clears your doubt if not, I am sory.

r data.table ( = 1.9.4) join behaviour

Just to keep this answered, this behaviour with allow.cartesian has been fixed in the current development version v1.9.5, and will be soon available on CRAN as v1.9.6. Odd versions are devel, and even stable. From NEWS:


  1. allow.cartesian is ignored during joins when:

    • i has no duplicates and mult="all". Closes #742. Thanks to @nigmastar for the report.
    • assigning by reference, i.e., j has :=. Closes #800. Thanks to @matthieugomez for the report.

    In both these cases (and during a not-join which was already fixed in 1.9.4), allow.cartesian can be safely ignored.



Related Topics



Leave a reply



Submit