How to self join a data.table on a condition
Try the following:
unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.x * abs(t.x - t.y))), by=list(t.x, v.x)])
Breakdown for the above line:
You can merge a table with itself, the output will also be a data.table. Notice that the column names will be given a suffix of .x
and .y
merge(DT, DT, by="n")
... you can just filter and calculate as with any DT
# this will give you your desired rows
[abs(t.x - t.y), ]
# this is the expression you outlined
[ ... , sum(v.x * abs(t.x - t.y)) ]
# summing by t.x and v.x
[ ... , ... , by=list(t.x, v.x)]) ]
Then finally wrapping it all in unique
to remove any duplicated rows.
UPDATE: The line below is what matches your output. The only difference between this and the one at the top of this answer is the term v.y
in sum(v.y * ...)
however the by
statement still uses v.x
. Is that intentional?
unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.y * abs(t.x - t.y))), by=list(t.x, v.x)])
Conditional self join SQL Server
Assuming you're wanting to return all values, even if ParentID is null.
select a.EthnicityText + case when b.EthnicityText is null then '' else ' - ' + b.EthnicityText end
from DimEthnicity a
left join DimEthnicity b on b.EthnicityID = a.EthnicityParentID
Left join allows for the condition you're looking for (only return the self join row when it exists). More info on left joins: https://www.w3schools.com/sql/sql_join_left.asp
How do I self join a data.table in a manner like dcast
Update: faster versions of melt
and dcast
are now implemented (in C) in data.table
versions >= 1.9.0
. Check this post for more info.
Now you can just do:
dcast.data.table(DT, X~Y)
In case of dcast
alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2
). We'll try to fix this as soon as possible. For melt,
you can just use melt(.)
as you'd do normally.
The general idea is this:
setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]
You can name the columns V1
and V2
as A
and B
using setnames
.
But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.
I'll update this post once we've made significant progress with melt/cast.
self join ON clause in SQL
in some databases the != operator is write like <>,
the query will be the same
SELECT *
FROM
point_2d p1
Inner JOIN
point_2d p2
ON p1.x <> p2.y;
If you don't like use the expllicit join, you can also use this way
SELECT *
FROM
point_2d p1, point_2d p2
WHERE p1.x <> p2.y
But I prefere the first way because it more explicit and I think you can read better the query
If you have some doubts I have found for you a list of operators used in SQL
https://www.w3schools.com/sql/sql_operators.asp
SQL self join table to find next row that matches a condition
What I want is a query that returns all DELIVER events together with the preceding DEPART (or maybe NEWTRIP) events, to see how long the trip took.
If I understand correctly, you can use apply
:
select d.*, previous.*
from tmain d outer apply
(select top (1) e.*
from tmain e
where e.cVehicleId = d.cVehicleId and
e.cFixedId in ('DEPART', 'NEWTRIP') and
e.cDateTime < d.cDateTime
order by e.cDateTime desc
) previous
where d.ceventid = 'DELIVER';
This uses the string versions for clarity.
self join a table with data for condition coming from second table and joining with thrid table for some more data
SELECT e.*
FROM employeedetails e
JOIN employeesalary s
ON s.empid = e.empid
JOIN globaldata g
ON g.name = 'newincrmeent'
JOIN employeesalary x
ON x.empid <> s.empid
AND x.empid = 3
AND x.salary < s.salary+g.value;
Translate SQL self join query to data.table syntax
Final Edit: I replaced uniqueN
with length(unique())
. This provided fast results. Also, I had a typo on my previous edit for rule 7. I used unique(am_data)
to remove duplicates and that seemed to fix everything except rule_4.
> res_2[, lapply(.SD, sum), .SDcols = 2:8]
rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1: 17167 10448 17165 2 606 16040 17072
> res[, lapply(.SD,sum), .SDcols = 2:8]
rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1: 17167 10448 17165 0 606 16040 17072
am_data <- unique(am_data)
# Prepare for Rules 1 - 3 -------------------------------------------------
am_data2 <- copy(am_data)[!is.na(device_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))
# Make Rules 1-3 happen ---------------------------------------------------
self_join <- am_data2[a,
on = .(device_id = a.device_id,
sent_at < a.sent_at),
allow.cartesian = TRUE
,nomatch = 0L
][customer_id != a.customer_id]
rule_1 = self_join[, length(unique(customer_id)), by = a.app_id]
rule_2 = self_join[rejected == 1 , length(unique(customer_id)), by = a.app_id]
rule_3 = self_join[, length(unique(person_id)), by = a.app_id]
# Prepare for Rule 4 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(ip_address_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))
a[, a.sent_at_range := a.sent_at - 14]
# Make Rule 4 happen ------------------------------------------------------
self_join <- am_data2[rejected == 1
][a,
on = .(ip_address_id = a.ip_address_id,
sent_at < a.sent_at,
sent_at >= a.sent_at_range),
allow.cartesian = TRUE
,nomatch = 0L
][customer_id != a.customer_id]
rule_4 <- self_join[, length(unique(customer_id)), by = a.app_id]
# Prepare for Rule 5 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(contact_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))
# Make Rule 5 happen ------------------------------------------------------
self_join <- am_data2[rejected == 1
][a,
on = .(contact_phone_id = a.mobile_phone_id,
sent_at < a.sent_at),
allow.cartesian = TRUE
,nomatch = 0L
][customer_id != a.customer_id]
rule_5 <- self_join[, length(unique(customer_id)), by = a.app_id]
# Prepare for Rule 6 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(work_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))
# Make Rule 6 Happen ------------------------------------------------------
self_join <- am_data2[rejected == 1
][a,
on = .(work_phone_id = a.mobile_phone_id,
sent_at < a.sent_at),
allow.cartesian = TRUE
,nomatch = 0L
][customer_id != a.customer_id]
rule_6 <- self_join[, length(unique(customer_id)), by = a.app_id]
# Prepare for Rule 7 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(person_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))
# Make Rule 7 Happen ------------------------------------------------------
self_join <- am_data2[a,
on = .(person_id = a.person_id,
sent_at < a.sent_at),
allow.cartesian = TRUE
# ,nomatch = 0L
][customer_id != a.customer_id & passport_id != a.passport_id]
rule_7 <- self_join[, length(unique(customer_id)), by = a.app_id]
# Combine and cast the rules we made --------------------------------------
res_2 <- dcast(rbindlist(list(rule_1, rule_2, rule_3, rule_4, rule_5, rule_6, rule_7), idcol = 'rule'), formula = a.app_id ~ rule , fill = 0L)
setnames(res_2,2:8, paste0('rule_', 1:7))
Results
> res_2
a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1: 89033 0 0 0 0 0 1 0
2: 95775 0 0 0 0 0 1 0
3: 96542 0 0 0 0 0 1 0
4: 106447 0 0 0 0 0 1 0
5: 113040 0 0 0 0 0 1 0
---
21925: 34904219 1 1 1 0 0 1 0
21926: 34904725 1 1 1 0 0 0 1
21927: 34904750 1 0 1 0 0 1 1
21928: 34904921 1 0 1 0 0 0 1
21929: 34905033 0 0 0 0 0 1 1
> res[order(a.app_id) & (rule_1 > 0 | rule_2 > 0 | rule_3 > 0 |
rule_4 > 0 | rule_5 >0 | rule_6 > 0 | rule_7 > 0)]
a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1: 89033 0 0 0 0 0 1 0
2: 95775 0 0 0 0 0 1 0
3: 96542 0 0 0 0 0 1 0
4: 106447 0 0 0 0 0 1 0
5: 113040 0 0 0 0 0 1 0
---
22403: 34904219 1 1 1 0 0 1 1
22404: 34904725 1 1 1 0 0 0 1
22405: 34904750 1 0 1 0 0 1 1
22406: 34904921 1 0 1 0 0 0 1
22407: 34905033 0 0 0 0 0 1 1
Original: Kept as it's keyed by device and may be helpful.
This is the data.table equivalent of the SQL for rule1. I spot checked the first 5 and last 5 results and they match up.
tmp2 <- am_data[!is.na(device_id), ..cols]
tmp2[tmp2,
on = .(device_id = device_id,
sent_at > sent_at),
allow.cartesian = TRUE
][customer_id != i.customer_id | is.na(customer_id),
.N,
keyby = device_id]
Data.table self-join on condition using a matrix
This is how I finally solved the problem:
sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)[
ID1<ID2] [,
':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,
Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))
]
Step by step:
Generate all the combinations of DelayID, the number is large but each row has only two columns integers.
sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)
This cuts the size to half, since ID1 are given as they are created, ordered by DelayStartTime and DelayEndTime>DelayStartTime.
[ID1<ID2]
This enforces the external condition accessing the matrix, note the cbind:
[,':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,
This calculates the distance between Delays, that can be used to filter the ones where it is not strictly positive
Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))
]
I hope it helps someone else.
SQL - self join 'n' times with condition
Well, this would required some sequential ordering columns, but you could also express this as
select max(case when [Type] = 1 then Id end) OneId,
max(case when [Type] = 2 then Id end) TwoId,
max(case when [Type] = 3 then Id end) ThreeId
from (select *,
row_number() over (order by (select 1)) Seq
from table
) t
group by (Seq - [Type]);
EDIT :- However, if you want to include group
also then use them as in select statement as
select (Seq - [Type]) as GroupId,
max(case when [Type] = 1 then 'OneId' end) OneI,
max(case when [Type] = 2 then 'TwoId' end) TwoI,
max(case when [Type] = 3 then 'ThreeId' end) ThreeI
from (select *,
row_number() over (order by (select 1)) Seq
from table
) t
group by (Seq - [Type]);
For your updated table you can directly use table with group by
clause with your GroupId
column as then you don't use row_number()
function and subquery
select max(case when [Type] = 1 then 'OneId' end) OneI,
max(case when [Type] = 2 then 'TwoId' end) TwoI,
max(case when [Type] = 3 then 'ThreeId' end) ThreeI,
GroupId
from table t
group by GroupId;
Demo
Related Topics
Dplyr: Put Count Occurrences into New Variable
Sources on S4 Objects, Methods and Programming in R
How to Do a Data.Table Merge Operation
What Algorithm I Need to Find N-Grams
Get All the Rows with Rownames Starting with Abc111
What Is a Good Way to Read Line-By-Line in R
How to Drop Unused Levels from a Data Frame
Using Rcpp Functions Inside of R's Par*Apply Functions from the Parallel Package
Force Error Bars to Be in the Middle of Bar
Ggplot: How to Increase Spacing Between Faceted Plots
Formatting Mouse Over Labels in Plotly When Using Ggplotly
How to Find Index of Match Between Two Set of Data Frame
Here We Go Again: Append an Element to a List in R
R Not Finding Package Even After Package Installation
Replace Na with Groups Mean in a Non Specified Number of Columns