Perform a semi-join with data.table
More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)
Filter data.table with another data.table with different column names
If I'm understanding the question correctly, this is a merge of dt
with dt_score_1
with the conditions area = zone, cluster = cluster_mode
.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by @Frank
Perform a semi-join with data.table
R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch
argument and setting it to 0 like this nomatch = 0
, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
Unable to perform rolling join on time and another variable in data.table
This seemed to solve the problem
test = df2[df1, on=.(bike_id, time = time), roll="nearest",
.(bike_id, time, st_x, st_y)]
R Multiple condition join using data.table
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A
and B
are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val
:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val
at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
How to perform a serial join in data.table?
IIUC, I'd formulate your question as follows: for each dict$cat
, I'd like to get prod(y)
corresponding to each sample
for that cat
, and then sum
them all up.
Let's construct this step by step now:
For each
dict$cat
- sounds like you need to group bycat
:dict[, ,by=cat]
All that's left is to fill up
j
properly.you need to get
prod(y)
fromx
for each sample for this group:x[samp %in% .SD$samp, prod(y), by=samp]
extracts those rows from
x
corresponding to this group'ssamp
(using.SD
which stands for subset of data) and computesprod(y)
on them, grouped bysamp
. Great!We still need to sum them.
sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1)
We've the complete
j
expression. Let's plug it all in:dict[, sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1), by=cat]
# cat V1
# 1: 1 1.7770272
# 2: 2 0.7578771
# 3: 3 -1.0295633
Hope this helps.
Note 1: that there's some redundant computation of prod(y)
here, but the upside is that we don't materialise much intermediate data. So it's memory efficient. If you've too many groups, this might get slower.. and you might want to compute prod(y)
in another variable like so:
x_p = x[, .(p = prod(y)), by=samp]
With this, we can simplify j
as follows:
dict[, x_p[samp %in% .SD$samp, sum(p)], by=cat]
Note 2: that %in%
expression creates an auto index on the first run on x's
samp
column to use binary search based subset from then on. Therefore there need not be any worries on performance due to vector scans.
How to perform join over date ranges using data.table?
You can use the foverlaps()
function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments
.
Note 1: You should install the development version of data.table -
v1.9.5
as a bug withfoverlaps()
has been fixed there. You can find the installation instructions here.Note 2: I'll call
whatWasMeasured
=dt1
andmeasurments
=dt2
here for convenience.
require(data.table) ## 1.9.5+
dt2[, dummy := time]
setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]
See ?foverlaps
for more info and this post for a performance comparison.
What is semi-join in database?
Simple example. Let's select students with grades using left outer join:
SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL
Now the same with left semi-join:
SELECT s.id
FROM students s
WHERE EXISTS (SELECT 1 FROM grades g
WHERE g.student_id = s.id)
The latter is generally more efficient (depending on concrete DBMS and query optimizer).
Inner Join data.table without unwanted column(s)
One option is to use the sqldf
package and do a database style join:
sql <- "SELECT t2.x, t1.y, t1.v, t2.z
FROM DT2 t2
LEFT JOIN DT1 t1
ON t2.x = t1.x"
result <- sqldf(sql)
One advantage of using sqldf
here is that it makes it easy to choose which columns you want in the result, and in which order.
Related Topics
Display a Time Clock in the R Command Line
Sort Columns of a Dataframe by Column Name
Create Categories by Comparing a Numeric Column with a Fixed Value
Sum Cells of Certain Columns for Each Row
How to Specify the Actual X Axis Values to Plot as X Axis Ticks in R
How to Place Grobs with Annotation_Custom() at Precise Areas of the Plot Region
Update Subset of Data.Table Based on Join
Error in Plot.New():Figure Margins Too Large, Scatter Plot
Remove Columns from Dataframe Where Some of Values Are Na
Examples of the Perils of Globals in R and Stata
Run a for Loop in Parallel in R
Perform a Semi-Join with Data.Table
Count Values Separated by a Comma in a Character String
How to Detect the Right Encoding for Read.Csv
Detecting Operating System in R (E.G. for Adaptive .Rprofile Files)