Translating SQL Joins on Foreign Keys to R Data.Table Syntax

Translating SQL joins on foreign keys to R data.table syntax

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
id y_id
[1,] 11 14
[2,] 12 14
[3,] 13 11
[4,] 14 12
[5,] 15 12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)] # binary search for each item of (unsorted and unkeyed) i
id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE

data.table joins - Select all columns in the i argument

How about constructing the j-expression and just eval'ing it?

nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))

> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

Better syntax for adding a column from other data.table

We can do this with a join

A[B, b := b, on = .(index)]

The setkey step is not needed here

SQL convert foreign key table into a specific value from other table

You need to join familyMember to relations twice

SELECT *
FROM
relations r
INNER JOIN familyMember c ON r.child = c.id
INNER JOIN familyMember f ON r.father = f.id

The details of the child are given by the c.* table. The details of the father are given by the f.* table. Always remember that you can join the same table multiple times and in cases like these you must because there is no single row of familyMember that is simultaneously a parent and a child, so you can't say "relations join familymember on child = x and father = x" - you'll get no rows (unless there is an error in the relations table data and someone has been listed as a parent of themselves)

When joining a table multiple times, always give them a good alias. Here I use f for father and c for child. fm1 and fm2 would be an example of bad aliases to use

Merge data tables like data frames in R

so following on from Translating SQL joins on foreign keys to R data.table syntax

x2 = data.table(index = 1:10, key ="index")
y2 = data.table(index = c(2,4,6), weight= c(.3,.5,.2),key="index")
y2[J(x2$index)]

Tips or tricks for translating sql joins from literal language to SQL syntax?

  • Use SQL Query Designer to easily buid Join queries from the visual table collection right there, then if you want to learn how it works, simply investigate it, that's how I learned it.
    You won't notice how charming it is till you try it.

  • Visual Representation of SQL Joins - A walkthrough explaining SQL JOINs.

  • Complete ref of SQL-Server Join, Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, in SQL-Server 2005 (View snapshot bellow).

  • ToTraceString of Entity Frameork' ObjectQuery (that you add Include shapings to it) is also a good way to learn it.

  • SQL-Server Join types (with detailed examples for each join type):

    INNER JOIN - Match rows between the two tables specified in the INNER JOIN statement based on one or more columns having matching data. Preferably the join is based on referential integrity enforcing the relationship between the tables to ensure data integrity.

    Just to add a little commentary to the basic definitions above, in general the INNER JOIN option is considered to be the most common join needed in applications and/or queries. Although that is the case in some environments, it is really dependent on the database design, referential integrity and data needed for the application. As such, please take the time to understand the data being requested then select the proper join option.

    Although most join logic is based on matching values between the two columns specified, it is possible to also include logic using greater than, less than, not equals, etc.

    LEFT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the left table. On the right table, the matching data is returned in addition to NULL values where a record exists in the left table, but not in the right table.

    Another item to keep in mind is that the LEFT and RIGHT OUTER JOIN logic is opposite of one another. So you can change either the order of the tables in the specific join statement or change the JOIN from left to right or vice versa and get the same results.

    RIGHT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the right table. On the left table, the matching data is returned in addition to NULL values where a record exists in the right table but not in the left table.

    Self Join - In this circumstance, the same table is specified twice with two different aliases in order to match the data within the same table.

    CROSS JOIN - Based on the two tables specified in the join clause, a Cartesian product is created if a WHERE clause does filter the rows. The size of the Cartesian product is based on multiplying the number of rows from the left table by the number of rows in the right table. Please heed caution when using a CROSS JOIN.

    FULL JOIN - Based on the two tables specified in the join clause, all data is returned from both tables regardless of matching data.

SQL Query designer sample

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

How to do a data.table merge operation

You are quoting the wrong part of documentation. If you have a look at the doc of [.data.table you will read:

When i is a data.table, x must have a
key, meaning join i to x and return
the rows in x that match
. An equi-join
is performed between each column in i
to each column in x’s key in order.
This is similar to base R
functionality of sub- setting a matrix
by a 2-column matrix, and in higher
dimensions subsetting an n-dimensional
array by an n-column matrix

I admit the description of the package (the part you quoted) is somewhat confusing, because it seems to say that the "["-operation can be used instead of merge. But I think what it says is: if x and y are both data.tables we use a join on an index (which is invoked like merge) instead of binary search.


One more thing:

The data.table library I installed via install.packages was missing the merge.data.table method, so using merge would call merge.data.frame. After installing the package from R-Forge R used the faster merge.data.table method.

You can check if you have the merge.data.table method by checking the output of:

methods(generic.function="merge")

EDIT [Answer no longer valid]: This answer refers to data.table version 1.3. In version 1.5.3 the behaviour of data.table changed and x[y] returns the expected results. Thank you Matthew Dowle, author of data.table, for pointing this out in the comments.



Related Topics



Leave a reply



Submit