Perform a Semi-Join with Data.Table

Perform a semi-join with data.table

More possibilities :

w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y
x[w]

If there are duplicate key values in x, then that needs :

w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]

Or, the other way around :

setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]

If nrow(x) << nrow(y) then the y[x] approach should be faster.

If nrow(x) >> nrow(y) then the x[y] approach should be faster.

But the anti anti join appeals too :-)

Filter data.table with another data.table with different column names

If I'm understanding the question correctly, this is a merge of dt with dt_score_1 with the conditions area = zone, cluster = cluster_mode.

dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]

# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5

For a more detailed explanation of join-as-filter, see the link below posted by @Frank

Perform a semi-join with data.table

R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)

Essentially, there are two approaches I find that are both viable solutions.

Solution 1

First, proposed by lil_barnacle is an elegant one-liner that reads like following:

# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]

Original approach

Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.

# Rolling join without specified nomatch argument 
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")

Solution 2

Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:

# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]

# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

Unable to perform rolling join on time and another variable in data.table

This seemed to solve the problem

test = df2[df1, on=.(bike_id, time = time), roll="nearest", 
.(bike_id, time, st_x, st_y)]

R Multiple condition join using data.table

A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:

Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]

Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:

Indices = Prep[,list(min(which(A >= lookup$minA)), 
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]

Then you must get Val at the lowest point where all three conditions are satisfied:

Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]

See if this gets you what you're looking for:

   ID Val     A     B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000

How to perform a serial join in data.table?

IIUC, I'd formulate your question as follows: for each dict$cat, I'd like to get prod(y) corresponding to each sample for that cat, and then sum them all up.

Let's construct this step by step now:

  1. For each dict$cat - sounds like you need to group by cat:

    dict[, ,by=cat]

    All that's left is to fill up j properly.

  2. you need to get prod(y) from x for each sample for this group:

    x[samp %in% .SD$samp, prod(y), by=samp]

    extracts those rows from x corresponding to this group's samp (using .SD which stands for subset of data) and computes prod(y) on them, grouped by samp. Great!

  3. We still need to sum them.

    sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1)
  4. We've the complete j expression. Let's plug it all in:

    dict[, sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1), by=cat]
    # cat V1
    # 1: 1 1.7770272
    # 2: 2 0.7578771
    # 3: 3 -1.0295633

Hope this helps.


Note 1: that there's some redundant computation of prod(y) here, but the upside is that we don't materialise much intermediate data. So it's memory efficient. If you've too many groups, this might get slower.. and you might want to compute prod(y) in another variable like so:

x_p = x[, .(p = prod(y)), by=samp]

With this, we can simplify j as follows:

dict[, x_p[samp %in% .SD$samp, sum(p)], by=cat]

Note 2: that %in% expression creates an auto index on the first run on x's samp column to use binary search based subset from then on. Therefore there need not be any worries on performance due to vector scans.

How to perform join over date ranges using data.table?

You can use the foverlaps() function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments.

Note 1: You should install the development version of data.table - v1.9.5 as a bug with foverlaps() has been fixed there. You can find the installation instructions here.

Note 2: I'll call whatWasMeasured = dt1 and measurments = dt2 here for convenience.

require(data.table) ## 1.9.5+
dt2[, dummy := time]

setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

See ?foverlaps for more info and this post for a performance comparison.

What is semi-join in database?

Simple example. Let's select students with grades using left outer join:

SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL

Now the same with left semi-join:

SELECT s.id
FROM students s
WHERE EXISTS (SELECT 1 FROM grades g
WHERE g.student_id = s.id)

The latter is generally more efficient (depending on concrete DBMS and query optimizer).

Inner Join data.table without unwanted column(s)

One option is to use the sqldf package and do a database style join:

sql <- "SELECT t2.x, t1.y, t1.v, t2.z
FROM DT2 t2
LEFT JOIN DT1 t1
ON t2.x = t1.x"
result <- sqldf(sql)

One advantage of using sqldf here is that it makes it easy to choose which columns you want in the result, and in which order.



Related Topics



Leave a reply



Submit