How to Self Join a Data.Table on a Condition

How to self join a data.table on a condition

Try the following:

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.x * abs(t.x - t.y))), by=list(t.x, v.x)])

Breakdown for the above line:

You can merge a table with itself, the output will also be a data.table. Notice that the column names will be given a suffix of .x and .y

merge(DT, DT, by="n")

... you can just filter and calculate as with any DT

# this will give you your desired rows
[abs(t.x - t.y), ]

# this is the expression you outlined
[ ... , sum(v.x * abs(t.x - t.y)) ]

# summing by t.x and v.x
[ ... , ... , by=list(t.x, v.x)]) ]

Then finally wrapping it all in unique to remove any duplicated rows.

UPDATE: The line below is what matches your output. The only difference between this and the one at the top of this answer is the term v.y in sum(v.y * ...) however the by statement still uses v.x. Is that intentional?

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.y * abs(t.x - t.y))), by=list(t.x, v.x)])

Conditional self join SQL Server

Assuming you're wanting to return all values, even if ParentID is null.

select a.EthnicityText + case when b.EthnicityText is null then '' else ' - ' + b.EthnicityText end
from DimEthnicity a
left join DimEthnicity b on b.EthnicityID = a.EthnicityParentID

Left join allows for the condition you're looking for (only return the self join row when it exists). More info on left joins: https://www.w3schools.com/sql/sql_join_left.asp

How do I self join a data.table in a manner like dcast

Update: faster versions of melt and dcast are now implemented (in C) in data.table versions >= 1.9.0. Check this post for more info.

Now you can just do:

dcast.data.table(DT, X~Y)

In case of dcast alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2). We'll try to fix this as soon as possible. For melt, you can just use melt(.) as you'd do normally.

The general idea is this:

setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]

You can name the columns V1 and V2 as A and B using setnames.

But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.

I'll update this post once we've made significant progress with melt/cast.

self join ON clause in SQL

in some databases the != operator is write like <>,
the query will be the same

SELECT *
FROM
point_2d p1
Inner JOIN
point_2d p2
ON p1.x <> p2.y;

If you don't like use the expllicit join, you can also use this way

SELECT *
FROM
point_2d p1, point_2d p2
WHERE p1.x <> p2.y

But I prefere the first way because it more explicit and I think you can read better the query

If you have some doubts I have found for you a list of operators used in SQL
https://www.w3schools.com/sql/sql_operators.asp

SQL self join table to find next row that matches a condition

What I want is a query that returns all DELIVER events together with the preceding DEPART (or maybe NEWTRIP) events, to see how long the trip took.

If I understand correctly, you can use apply:

select d.*, previous.*
from tmain d outer apply
     (select top (1) e.*
      from tmain e
      where e.cVehicleId = d.cVehicleId and
            e.cFixedId in ('DEPART', 'NEWTRIP') and
            e.cDateTime < d.cDateTime
      order by e.cDateTime desc
     ) previous
where d.ceventid = 'DELIVER';

This uses the string versions for clarity.

self join a table with data for condition coming from second table and joining with thrid table for some more data

SELECT e.*
  FROM employeedetails e 
  JOIN employeesalary s 
    ON s.empid = e.empid 
  JOIN globaldata g 
    ON g.name = 'newincrmeent' 
  JOIN employeesalary x 
    ON x.empid <> s.empid 
   AND x.empid = 3 
   AND x.salary < s.salary+g.value;

Translate SQL self join query to data.table syntax

Final Edit: I replaced uniqueN with length(unique()). This provided fast results. Also, I had a typo on my previous edit for rule 7. I used unique(am_data) to remove duplicates and that seemed to fix everything except rule_4.

> res_2[, lapply(.SD, sum), .SDcols = 2:8]
   rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1:  17167  10448  17165      2    606  16040  17072
> res[, lapply(.SD,sum), .SDcols = 2:8]
   rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1:  17167  10448  17165      0    606  16040  17072

am_data <- unique(am_data)

# Prepare for Rules 1 - 3 -------------------------------------------------

am_data2 <- copy(am_data)[!is.na(device_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))

# Make Rules 1-3 happen ---------------------------------------------------

self_join <- am_data2[a, 
                  on = .(device_id = a.device_id,
                         sent_at < a.sent_at),
                  allow.cartesian = TRUE
                  ,nomatch = 0L
                  ][customer_id != a.customer_id]

rule_1 = self_join[, length(unique(customer_id)), by = a.app_id]
rule_2 = self_join[rejected == 1 , length(unique(customer_id)), by = a.app_id]
rule_3 = self_join[, length(unique(person_id)), by = a.app_id]

# Prepare for Rule 4 ------------------------------------------------------

am_data2 <- copy(am_data)[!is.na(ip_address_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))
a[, a.sent_at_range := a.sent_at - 14]

# Make Rule 4 happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(ip_address_id = a.ip_address_id,
                               sent_at < a.sent_at,
                               sent_at >= a.sent_at_range),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_4 <- self_join[, length(unique(customer_id)), by = a.app_id]

# Prepare for Rule 5 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(contact_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))

# Make Rule 5 happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(contact_phone_id = a.mobile_phone_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_5 <- self_join[, length(unique(customer_id)), by = a.app_id]

# Prepare for Rule 6 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(work_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))

# Make Rule 6 Happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(work_phone_id = a.mobile_phone_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_6 <- self_join[, length(unique(customer_id)), by = a.app_id]

# Prepare for Rule 7 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(person_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))

# Make Rule 7 Happen ------------------------------------------------------
self_join <- am_data2[a,
                        on = .(person_id = a.person_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        # ,nomatch = 0L
                        ][customer_id != a.customer_id & passport_id != a.passport_id]

rule_7 <- self_join[, length(unique(customer_id)), by = a.app_id]

# Combine and cast the rules we made --------------------------------------

res_2 <- dcast(rbindlist(list(rule_1, rule_2, rule_3, rule_4, rule_5, rule_6, rule_7), idcol = 'rule'), formula = a.app_id ~ rule , fill = 0L)
setnames(res_2,2:8,  paste0('rule_', 1:7))

Results

> res_2
       a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
    1:    89033      0      0      0      0      0      1      0
    2:    95775      0      0      0      0      0      1      0
    3:    96542      0      0      0      0      0      1      0
    4:   106447      0      0      0      0      0      1      0
    5:   113040      0      0      0      0      0      1      0
   ---                                                          
21925: 34904219      1      1      1      0      0      1      0
21926: 34904725      1      1      1      0      0      0      1
21927: 34904750      1      0      1      0      0      1      1
21928: 34904921      1      0      1      0      0      0      1
21929: 34905033      0      0      0      0      0      1      1

> res[order(a.app_id) & (rule_1 > 0 | rule_2 > 0 | rule_3 > 0 |
 rule_4 > 0 | rule_5 >0 | rule_6 > 0 | rule_7 > 0)]

       a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
    1:    89033      0      0      0      0      0      1      0
    2:    95775      0      0      0      0      0      1      0
    3:    96542      0      0      0      0      0      1      0
    4:   106447      0      0      0      0      0      1      0
    5:   113040      0      0      0      0      0      1      0
   ---                                                          
22403: 34904219      1      1      1      0      0      1      1
22404: 34904725      1      1      1      0      0      0      1
22405: 34904750      1      0      1      0      0      1      1
22406: 34904921      1      0      1      0      0      0      1
22407: 34905033      0      0      0      0      0      1      1

Original: Kept as it's keyed by device and may be helpful.

This is the data.table equivalent of the SQL for rule1. I spot checked the first 5 and last 5 results and they match up.

tmp2 <- am_data[!is.na(device_id), ..cols]
tmp2[tmp2, 
        on = .(device_id = device_id,
               sent_at > sent_at),
        allow.cartesian = TRUE
     ][customer_id != i.customer_id | is.na(customer_id),
       .N,
       keyby = device_id]

Data.table self-join on condition using a matrix

This is how I finally solved the problem:

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)[
    ID1<ID2] [,
              ':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,
                   Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))
              ]

Step by step:

Generate all the combinations of DelayID, the number is large but each row has only two columns integers.

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)

This cuts the size to half, since ID1 are given as they are created, ordered by DelayStartTime and DelayEndTime>DelayStartTime.

[ID1<ID2]

This enforces the external condition accessing the matrix, note the cbind:

[,':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,

This calculates the distance between Delays, that can be used to filter the ones where it is not strictly positive

Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))  
              ]

I hope it helps someone else.

SQL - self join 'n' times with condition

Well, this would required some sequential ordering columns, but you could also express this as

select max(case when [Type] = 1 then Id end) OneId,
       max(case when [Type] = 2 then Id end) TwoId,
       max(case when [Type] = 3 then Id end) ThreeId
from (select *, 
             row_number() over (order by (select 1)) Seq 
      from table
     ) t
group by (Seq - [Type]);

EDIT :- However, if you want to include group also then use them as in select statement as

select (Seq - [Type]) as GroupId,
       max(case when [Type] = 1 then 'OneId' end) OneI,
       max(case when [Type] = 2 then 'TwoId' end) TwoI,
       max(case when [Type] = 3 then 'ThreeId' end) ThreeI
from (select *, 
             row_number() over (order by (select 1)) Seq 
      from table
      ) t
group by (Seq - [Type]);

For your updated table you can directly use table with group by clause with your GroupId column as then you don't use row_number() function and subquery

select max(case when [Type] = 1 then 'OneId' end) OneI,
       max(case when [Type] = 2 then 'TwoId' end) TwoI,
       max(case when [Type] = 3 then 'ThreeId' end) ThreeI,
       GroupId
from table t
group by GroupId;

Demo

How to Self Join a Data.Table on a Condition