Why is allow.cartesian required at times when when joining data.tables with duplicate keys?
You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i))
, you won't get this error, even if you've duplicates. It is basically a precautionary measure.
When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table
knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE
if you're really sure.
Here's an (exaggerated) example that illustrates the idea behind this error message:
require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),
y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")
# not run
# DT1[DT2] ## error
dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000 2
The duplicates in DT2
resulted in 3 times the total number of "a" in DT1
(=1e7). Imagine if you performed the join with 1e4 values in DT2
, the results would explode! To avoid this, there's the allow.cartesian
argument which by default is FALSE.
That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.
data.table cartesian join warning on legitimate join
Question 1:
Looks great! But the equivalent of inner join is to also add nomatch=0L
. Otherwise you'd also get all the rows from dtMoveB
. We can't make use of by=.EACHI
here AFAICT.
Read this answer and the answer linked under this comment for understanding the purpose of allow.cartesian = TRUE
.
Question 2: From ?data.table
under the entry for allow.cartesian
:
FALSE
prevents joins that would result in more thanmax(nrow(x), nrow(i))
rows. This is usually caused by duplicate values ini
's join columns, each of which join to the same group inx
over and over again: a misspecified join.Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in
i
joins to every row inx
(anrow(x) * nrow(i)
row result). 'cartesian' is just meant in a 'large multiplicative' sense.
Does this answer your question?
Question 3:
Yes. Joins are of the form x[i]
. When x
and i
both share a column name and will be in the joined result, those columns have a i.
prefix added to it. It's the same prefix that also allows you to access i
's columns in j
, where both x
and i
share the column name in operations of the form x[i, j]
or x[i, j, by=.EACHI]
.
While joining you can change the names to whatever you like though. Here, you could change to position.x
and position.y
with:
dtMoveA[dtMoveB, .(RobotID, Position.x=Position, Position.y=i.Position,
AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE]
HTH
PS: If you've any suggestions, feel free to add a FR here. Please have a look at the posting guidelines before to do so.
Joining data.table with by argument
If, I understood your requirement correctly, There is a direct merge option that you can use,
dx <- data.table(a = c(1,1,2,2), b = 3:6)
dy <- data.table(a = c(1,1,2), c = 7:9)
merge(x = dx, y = dy, by = "a", all = TRUE)
It gives your desired output that you mentioned.
How to join (merge) data frames (inner, outer, left, right)?
I hope it clears your doubt if not, I am sory.
r data.table ( = 1.9.4) join behaviour
Just to keep this answered, this behaviour with allow.cartesian
has been fixed in the current development version v1.9.5
, and will be soon available on CRAN as v1.9.6
. Odd versions are devel, and even stable. From NEWS:
allow.cartesian
is ignored during joins when:
i
has no duplicates andmult="all"
. Closes #742. Thanks to @nigmastar for the report.- assigning by reference, i.e.,
j
has:=
. Closes #800. Thanks to @matthieugomez for the report.
In both these cases (and during a
not-join
which was already fixed in 1.9.4),allow.cartesian
can be safely ignored.
Related Topics
Why Does Unlist() Kill Dates in R
Adding Space Between Bars in Ggplot2
Subset a Column in Data Frame Based on Another Data Frame/List
Remove All Punctuation Except Apostrophes in R
Dplyr Mutate Rowwise Max of Range of Columns
Twitter, Roauth and Windows: Register Ok, But Certificate Verify Failed
How to Use R with Google Colaboratory
Check for Installed Packages Before Running Install.Packages()
How to Use Objects from Global Environment in Rstudio Markdown
Remove Multiple Objects with Rm()
Creating a Prompt/Answer System to Input Data into R
How to Install Development Version of R Packages Github Repository
Display a Time Clock in the R Command Line
Adaptive Moving Average - Top Performance in R