R Data.Table Join: SQL "Select *" Alike Syntax in Joined Tables

R data.table join: SQL select * alike syntax in joined tables?

This should precisely answer your need.

It uses very powerful R feature called computing on the language (or meta programming) well described in official R Language Definition manual. This is an exceptional feature of R language and should not be forgotten IMO.

library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)

jj = as.call(c(
list(as.name(".")),
list(sum = quote(a+b)),
lapply(unique(c(names(DT1), names(DT2))), as.name)
))
print(jj)
#.(sum = a + b, x, a, b)
DT1[DT2, eval(jj), on="x"]
# sum x a b
#1: NA d NA 6
#2: 8 c 1 7
#3: 11 b 3 8
#4: 13 b 5 8

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

data.table joins - Select all columns in the i argument

How about constructing the j-expression and just eval'ing it?

nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))

> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

Left join using data.table

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')

B[A]

Translating SQL joins on foreign keys to R data.table syntax

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
id y_id
[1,] 11 14
[2,] 12 14
[3,] 13 11
[4,] 14 12
[5,] 15 12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)] # binary search for each item of (unsorted and unkeyed) i
id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE

In a join, how to prefix all column names with the table it came from

You could

select ah.*, l.*, u.*, pi.* from ...

then the columns will be returned ordered by table at least.

For better distinction between every two sets of columns, you could also add "delimiter" columns like this:

select ah.*, ':', l.*, ':', u.*, ':', pi.* from ...

(Edited to remove explicit aliases as unnecessary, see comments.)

Alias for column name on a SELECT * join

It looks like you have to alias you'r column names with aliases.

SELECT client.column1 as col1, client.column2 as col2, person.column1 as colp1 FROM client INNER JOIN person ON client.personID = person.personID

Of course, replace the column names into the real column names as use more appealing aliases

Let us know if it helps

UPDATE #1

I tried creating 2 tables with sqlfiddle in mySQL 5.5 and 5.6

see link : http://sqlfiddle.com/#!9/e70ab/1

It works as expected.

Maybe you could share you tables schema.

Here's the example code :

CREATE TABLE Person
(
personID int,
name varchar(255)
);

CREATE TABLE Client
(
ID int,
name varchar(255),
personID int
);

insert into Person values(1, 'person1');
insert into Person values(2, 'person2');
insert into Person values(3, 'person3');

insert into Client values(1, 'client1', 1);
insert into Client values(2, 'client2', 1);
insert into Client values(3, 'client1', 1);

SELECT * FROM client
INNER JOIN person
ON client.personID = person.personID;

how to sort order of LEFT JOIN in SQL query?

Try using MAX with a GROUP BY.

SELECT u.userName, MAX(c.carPrice)
FROM users u
LEFT JOIN cars c ON u.id = c.belongsToUser
WHERE u.id = 4;
GROUP BY u.userName;

Further information on GROUP BY

The group by clause is used to split the selected records into groups based on unique combinations of the group by columns. This then allows us to use aggregate functions (eg. MAX, MIN, SUM, AVG, ...) that will be applied to each group of records in turn. The database will return a single result record for each grouping.

For example, if we have a set of records representing temperatures over time and location in a table like this:

Location   Time    Temperature
-------- ---- -----------
London 12:00 10.0
Bristol 12:00 12.0
Glasgow 12:00 5.0
London 13:00 14.0
Bristol 13:00 13.0
Glasgow 13:00 7.0
...

Then if we want to find the maximum temperature by location, then we need to split the temperature records into groupings, where each record in a particular group has the same location. We then want to find the maximum temperature of each group. The query to do this would be as follows:

SELECT Location, MAX(Temperature)
FROM Temperatures
GROUP BY Location;


Related Topics



Leave a reply



Submit