How to Label "Transitive Groups" with SQL

How to label transitive groups with SQL?

Now, a new demand at 2013, I need to work with 10000 itens:
using the @GordonLinoff's elegant solution (above),
with 1000 itens need 1 second, with 2000 need 1 day... Not have a good performance. The problem of performance was also remembered here,

The @NealB algorithm

(this is the best solution, so fast!)
See original and didactical description. Here the table T1 is the same as the question-text, and a second (temporary) table R is used to process and to show the results,

 CREATE TABLE R (
id integer NOT NULL, -- PRIMARY KEY,
label integer NOT NULL DEFAULT 0
);
CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$
DECLARE
label1 integer;
label2 integer;
newlabel integer;
t t1%rowtype;
BEGIN
DELETE FROM R;
INSERT INTO R(id)
SELECT DISTINCT unnest(array[id1,id2])
FROM T1 ORDER BY 1;
newlabel:=0;
FOR t IN SELECT * FROM t1
LOOP -- -- BASIC LABELING: -- --
SELECT label INTO label1 FROM R WHERE id=t.id1;
SELECT label INTO label2 FROM R WHERE id=t.id2;
IF label1=0 AND label2=0 THEN
newlabel:=newlabel+1;
UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2);
ELSIF label1=0 AND label2!=0 THEN
UPDATE R set label=label2 WHERE ID=t.id1;
ELSIF label1!=0 AND label2=0 THEN
UPDATE R set label=label1 WHERE ID=t.id2;
ELSIF label1!=label2 THEN -- time consuming
UPDATE tmp.R set label=label1 WHERE label = label2;
END IF;
END LOOP;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;

Preparing and running,

 -- same CREATE TABLE T1 (id1 integer, id2 integer);
DELETE FROM T1;
INSERT INTO T1(id1,id2) -- populate the standard input
VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1);
-- or SELECT id1, id2 FROM table_with_1000000_items;

SELECT t1r_labeler(); -- run
SELECT * FROM R ORDER BY 2; -- show

Dealing with the worst case

The last condition, when label1!=label2,
is the most time-consuming operation, and must be avoided or can be separated in cases of high connectivity, that are the worst ones.

To report some kind of alert you can count the proportion of times that the procedure is running the last condition, and/cor can separete the last update.

If you separate, you can analyse and deal a little better with them
So, eliminating the last ELSIF and adding after the first loop your checks and this second loop:

      -- ... first loop and checks here ...
FOR t IN SELECT * FROM tmp.t1
LOOP -- -- MERGING LABELS: -- --
SELECT label INTO label1 FROM R WHERE id=t.id1;
SELECT label INTO label2 FROM R WHERE id=t.id2;
IF label1!=0 AND label2!=0 AND label1!=label2 THEN
UPDATE R set label=label1 WHERE label=label2;
END IF;
END LOOP;
-- ...

Example of worst case: a group with more than 1000 (connected) nodes into 10000 nodes with average length of "10 per labeled-group" (cores) and only few paths connecting cores.

Array-oriented algorithm

This other solution is slower (is a brute-force algorithm), but can be util when you need direct processing with arrays, and not need so fast solution (and not have "worst cases").

As @peter.petrov and @RBarryYoung suggest to use a more adequate data-structure... I was back to my arrays as "more adequate data-structure". After all, there are good speed-up (comparating with @GordonLinoff's algorithm) with the solution below (!).

The first step is to translate the table t1 of the question-text to a temporary one, transgroup1, where we can compute the new process,

 -- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items)
SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;

them, with these two functions we can solve the problem,

 CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;

CREATE FUNCTION transgroup1_loop() RETURNS void AS
$BODY$
DECLARE
cp_dels integer[];
i integer;
max_i integer;
BEGIN
i:=1;
max_i:=10; -- or 100 or more, but need some control to be secure
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items;

cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
EXIT WHEN i>max_i OR array_length(cp_dels,1)=0;

DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
END;
$BODY$ LANGUAGE plpgsql VOLATILE;

of course, to run and see results, you can use

 SELECT transgroup1_loop(); -- not 1 day but some hours!     
SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;

resulting

 id |   items   | ssg_label | dels | group 
----+-----------+-----------+------+-------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2

How to label a big set of “transitive groups” with a constraint?

An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.

Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number

Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:

Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0

Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:

 Case 1: ID1.Label == 0 and ID2.Label == 0

In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)

 Case 2: ID1.Label == 0 and ID2.Label <> 0

In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1

 Case 3: ID1.Label <> 0 and ID2.Label == 0

In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2

 Case 4: ID1.Label <> 0 and ID2.Label <> 0

In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...

EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.

Upon completion of the cursor, table R will contain Label values needed to update T2.

Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1

Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:

  table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1

This process is puerly iterative and should scale to fairly large tables without difficulty.

Convert transitive connections of elements into groups in R or SQL

You data is a graph, defined by the list of its edges,
and you want its connected components.
This is what the clusters function in the igraph package computes.

# Sample data
d <- structure(c("A", "B", "C", "C", "E", "I", "H", "J", "K", "B",
"C", "D", "G", "F", "E", "G", "K", "L"), .Dim = c(9L, 2L), .Dimnames = list(
NULL, c("e1", "e2")))

library(igraph)
g <- graph.edgelist( as.matrix(d) )
clusters(d)
# $membership
# [1] 1 1 1 1 1 2 2 2 1 3 3 3

Labelling groups of two columns in SQL (BigQuery SQL if possible)

Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.

Step 1.

select name, nest(ip) ips, group_concat(string(ip)) sip from 
(select 'a' name, 1 ip),
(select 'b' name, 1 ip),
(select 'c' name, 1 ip),
(select 'b' name, 2 ip),
(select 'c' name, 2 ip),
(select 'd' name, 3 ip),
(select 'e' name, 2 ip)
group by name

Store the results in temporary table x

Step 2.

select a.name name, group_concat(b.name) as cluster from (
select a.name, b.name from (
select a.*, b.* from dataset.x a cross join dataset.x b
) omit record if every(not b.sip contains string(a.ips))
group by 1, 2 order by 1, 2) group by 1

Store the results in temporary table y

Step 3.

select cluster from (
select group_concat(part) cluster from (
select name, part from (
select a.name name, split(b.cluster) part
from dataset.y a cross join dataset.y b
where b.cluster contains a.name) group by 1, 2 order by 1, 2)
group by name) group by cluster

This should produce all unique clusters, i.e.

a,b,c,e
d

Transitive matching in SQL

It took me 3 CTEs and a couple of cups of coffee but here you have it...
My primary concern is that I read this from the comments

It's a repeatable task. There will be several groups and we
will have to do it for each group. The total record count across all
groups could be millions.

This cannot be a repeatable task as the resources consumption will be high, I recommend you to use it to normalize your groups once and add the logic in your application or stored procedures to store the new data with the desired groups

DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

with
/* Find all matches
id Member MatchWith
1 M1 M3
2 M2 M5
3 M3 M1
3 M3 M4 ...
*/
matches as (
SELECT t.id, t.[Group], t.Member, a.member as MatchWith
from
@table t
outer apply (
select distinct member
from @table
where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
) a
)
/* Stuffing the matches per member
id Member AllMatches
1 M1 M1,M3
2 M2 M2,M5
3 M3 M1,M3,M4 .....
*/
, matchsummary as (
SELECT DISTINCT id, [Group], Member, STUFF((
SELECT ',' + Member FROM (
SELECT m.Member
UNION ALL
SELECT DISTINCT MatchWith
FROM matches
WHERE Member = m.Member) U
ORDER BY Member
FOR XML PATH('')
), 1, 1, '') as AllMatches
FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
3 M3 M1,M3,M4 1
4 M4 M3,M4,M8 1
*/
, tree as (
select *, ROW_NUMBER() over (order by id) as gr
from matchsummary where AllMatches LIKE member+'%'
/* The groups are created using the Members who are the first one in their matches
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
6 M6 M6,M7 3
10 M10 M10 4
*/
union all
select s.*, t.gr
from matchsummary s
join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)

Output:

ID Group Member NewGroup

1 G1 M1 1

2 G1 M2 2

3 G1 M3 1

4 G1 M4 1

5 G1 M5 2

6 G1 M6 3

7 G1 M7 3

8 G1 M8 1

9 G1 M9 3

10 G1 M10 4

Second Option

Given the size of your table I recommend you to use this, I am not a big fan of loops but here I think they worth it, that way you don't need to process all your data at once,

First, you need to add a new column on your table to store the new group, my first thought was that would be better to change your application's logic to calculate that group when a new record is inserted, but thinking it better, an insert can cause several groups becoming one, and you probably need fast response in your application. So, you can set a job to regroup your data as often as you need it, if you have an UpdatedDate field in your table you also can refine this solution using a Log table and reprocess only groups that were modified after their last execution.

 IF OBJECT_ID('tempdb..#table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))

INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

ALTER TABLE #table ADD newGroup INT

/******************************************************************
START HERE
******************************************************************/

IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
DROP TABLE #Groups;

SELECT DISTINCT [Group] INTO #Groups FROM #table

DECLARE @Group VARCHAR(3)

WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN

SELECT TOP 1 @Group = [Group] FROM #Groups

UPDATE #table SET newGroup = NULL
WHERE [Group] = @Group

DECLARE @newGroup INT = 1
DECLARE @member varchar(3)

WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
BEGIN

SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL

UPDATE #table SET newGroup = @newGroup
WHERE Member = @member

WHILE @@ROWCOUNT > 0
BEGIN
UPDATE T
SET newGroup = @newGroup
FROM #table T
WHERE [Group] = @group AND newGroup IS NULL
AND EXISTS (
SELECT 1 FROM #table
WHERE newGroup = @newGroup
AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
)
END
SET @newGroup += 1
END
DELETE #Groups WHERE [Group] = @Group
END

SELECT * FROM #table


Related Topics



Leave a reply



Submit