How to Label "Transitive Groups" with SQL

How to label transitive groups with SQL?

Now, a new demand at 2013, I need to work with 10000 itens:
using the @GordonLinoff's elegant solution (above),
with 1000 itens need 1 second, with 2000 need 1 day... Not have a good performance. The problem of performance was also remembered here,

The @NealB algorithm

(this is the best solution, so fast!)
See original and didactical description. Here the table T1 is the same as the question-text, and a second (temporary) table R is used to process and to show the results,

 CREATE TABLE R (
   id integer NOT NULL, -- PRIMARY KEY,
   label integer NOT NULL DEFAULT 0
 );
 CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$
   DECLARE
      label1 integer;
      label2 integer;
      newlabel integer;
      t t1%rowtype;
   BEGIN
       DELETE FROM R;
       INSERT INTO R(id) 
           SELECT DISTINCT unnest(array[id1,id2]) 
           FROM T1 ORDER BY 1;
      newlabel:=0;
      FOR t IN SELECT * FROM t1
      LOOP   -- --  BASIC LABELING:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1=0 AND label2=0 THEN 
              newlabel:=newlabel+1;
              UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2);
         ELSIF label1=0 AND label2!=0 THEN 
              UPDATE R set label=label2 WHERE ID=t.id1;
         ELSIF label1!=0 AND label2=0 THEN 
              UPDATE R set label=label1 WHERE ID=t.id2;
         ELSIF label1!=label2 THEN -- time consuming
              UPDATE tmp.R set label=label1 WHERE label = label2;
         END IF;
      END LOOP;
    END;
 $funcBody$ LANGUAGE plpgsql VOLATILE;

Preparing and running,

 -- same CREATE TABLE T1 (id1 integer, id2 integer);
 DELETE FROM T1; 
 INSERT INTO T1(id1,id2)  -- populate the standard input
 VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1);
   -- or  SELECT id1, id2 FROM table_with_1000000_items;

 SELECT t1r_labeler();       -- run
 SELECT * FROM R ORDER BY 2; -- show

Dealing with the worst case

The last condition, when label1!=label2,
is the most time-consuming operation, and must be avoided or can be separated in cases of high connectivity, that are the worst ones.

To report some kind of alert you can count the proportion of times that the procedure is running the last condition, and/cor can separete the last update.

If you separate, you can analyse and deal a little better with them
So, eliminating the last ELSIF and adding after the first loop your checks and this second loop:

      -- ... first loop and checks here ...
      FOR t IN SELECT * FROM tmp.t1 
      LOOP   -- --  MERGING LABELS:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1!=0 AND label2!=0 AND label1!=label2 THEN
             UPDATE R set label=label1 WHERE label=label2;
         END IF;
      END LOOP;
      -- ...

Example of worst case: a group with more than 1000 (connected) nodes into 10000 nodes with average length of "10 per labeled-group" (cores) and only few paths connecting cores.

Array-oriented algorithm

This other solution is slower (is a brute-force algorithm), but can be util when you need direct processing with arrays, and not need so fast solution (and not have "worst cases").

As @peter.petrov and @RBarryYoung suggest to use a more adequate data-structure... I was back to my arrays as "more adequate data-structure". After all, there are good speed-up (comparating with @GordonLinoff's algorithm) with the solution below (!).

The first step is to translate the table t1 of the question-text to a temporary one, transgroup1, where we can compute the new process,

 -- DROP table transgroup1;
 CREATE TABLE transgroup1 (
   id serial NOT NULL PRIMARY KEY,
   items integer[], -- two or more items in the transitive relationship
   dels integer[] DEFAULT array[]::integer[]
 );
 INSERT INTO transgroup1(items)
   SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;

them, with these two functions we can solve the problem,

 CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
     -- ensures distinct items of a concatemation
    SELECT ARRAY(SELECT  unnest($1) UNION SELECT  unnest($2))
 $$ LANGUAGE sql immutable;

  CREATE FUNCTION transgroup1_loop() RETURNS void AS
  $BODY$
    DECLARE
       cp_dels integer[];
       i integer;
       max_i integer;
    BEGIN
       i:=1;
       max_i:=10; -- or 100 or more, but need some control to be secure
       LOOP
           UPDATE transgroup1
           SET items = array_uunion(transgroup1.items,t2.items),
               dels =  transgroup1.dels || t2.id
           FROM transgroup1 AS t1, transgroup1 AS t2
           WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items;

           cp_dels := array(
               SELECT DISTINCT unnest(dels) FROM transgroup1
           ); -- ensures all itens to del
           EXIT WHEN i>max_i OR array_length(cp_dels,1)=0;

           DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels));
           UPDATE transgroup1 SET dels=array[]::integer[];
           i:=i+1;
       END LOOP;
       UPDATE transgroup1         -- only to beautify
       SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
     END;
  $BODY$ LANGUAGE plpgsql VOLATILE;

of course, to run and see results, you can use

 SELECT transgroup1_loop(); -- not 1 day but some hours!     
 SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;

resulting

 id |   items   | ssg_label | dels | group 
----+-----------+-----------+------+-------
  4 | {8,7,4}   | 1         | {}   |     1
  5 | {9,5,2,1} | 1         | {}   |     2

How to label a big set of “transitive groups” with a constraint?

An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.

Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number

Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:

Table R
ID Label
== =====
 1 0
 2 0
 4 0
 5 0
 7 0
 8 0
 9 0

Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:

 Case 1: ID1.Label == 0 and ID2.Label == 0

In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)

 Case 2: ID1.Label == 0 and ID2.Label <> 0

In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1

 Case 3: ID1.Label <> 0 and ID2.Label == 0

In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2

 Case 4: ID1.Label <> 0 and ID2.Label <> 0

In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...

EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.

Upon completion of the cursor, table R will contain Label values needed to update T2.

Table R
ID Label
== =====
 1 1
 2 1
 4 2
 5 1
 7 2
 8 2
 9 1

Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:

  table T2
  ID1 | ID2 | LABEL 
  1   | 2   | 1
  1   | 5   | 1
  4   | 7   | 2
  7   | 8   | 2
  9   | 1   | 1

This process is puerly iterative and should scale to fairly large tables without difficulty.

Convert transitive connections of elements into groups in R or SQL

You data is a graph, defined by the list of its edges,
and you want its connected components.
This is what the clusters function in the igraph package computes.

# Sample data
d <- structure(c("A", "B", "C", "C", "E", "I", "H", "J", "K", "B", 
"C", "D", "G", "F", "E", "G", "K", "L"), .Dim = c(9L, 2L), .Dimnames = list(
    NULL, c("e1", "e2")))

library(igraph)
g <- graph.edgelist( as.matrix(d) )
clusters(d)
# $membership
#  [1] 1 1 1 1 1 2 2 2 1 3 3 3

Labelling groups of two columns in SQL (BigQuery SQL if possible)

Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.

Step 1.

select name, nest(ip) ips, group_concat(string(ip)) sip from 
(select 'a' name, 1 ip),
(select 'b' name, 1 ip),
(select 'c' name, 1 ip),
(select 'b' name, 2 ip),
(select 'c' name, 2 ip),
(select 'd' name, 3 ip),
(select 'e' name, 2 ip)
group by name

Store the results in temporary table x

Step 2.

select a.name name, group_concat(b.name) as cluster from (
select a.name, b.name from (
select a.*, b.* from dataset.x a cross join dataset.x b
) omit record if every(not b.sip contains string(a.ips))
group by 1, 2 order by 1, 2) group by 1

Store the results in temporary table y

Step 3.

select cluster from (
select group_concat(part) cluster from (
select name, part from (
select a.name name, split(b.cluster) part 
from dataset.y a cross join dataset.y b
where b.cluster contains a.name) group by 1, 2 order by 1, 2) 
group by name) group by cluster

This should produce all unique clusters, i.e.

a,b,c,e
d

Transitive matching in SQL

It took me 3 CTEs and a couple of cups of coffee but here you have it...
My primary concern is that I read this from the comments

It's a repeatable task. There will be several groups and we
will have to do it for each group. The total record count across all
groups could be millions.

This cannot be a repeatable task as the resources consumption will be high, I recommend you to use it to normalize your groups once and add the logic in your application or stored procedures to store the new data with the desired groups

DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

with 
/* Find all matches
id  Member  MatchWith
1   M1  M3
2   M2  M5
3   M3  M1
3   M3  M4 ...
*/
matches as (
    SELECT t.id, t.[Group], t.Member, a.member as MatchWith
    from 
    @table t
    outer apply (
        select distinct member 
        from @table 
        where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
    ) a
)
/* Stuffing the matches per member
id  Member  AllMatches
1   M1  M1,M3
2   M2  M2,M5
3   M3  M1,M3,M4 .....
*/
, matchsummary as (
    SELECT DISTINCT id, [Group], Member, STUFF((
                SELECT ',' + Member FROM (
                SELECT m.Member
                UNION ALL
                SELECT DISTINCT MatchWith
                FROM matches
                WHERE Member = m.Member) U
                ORDER BY Member
                FOR XML PATH('')
                ), 1, 1, '') as AllMatches
    FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id  Member  AllMatches  gr
1   M1  M1,M3   1
2   M2  M2,M5   2
3   M3  M1,M3,M4    1
4   M4  M3,M4,M8    1
*/
, tree as (
    select *, ROW_NUMBER() over (order by id) as gr
    from matchsummary where AllMatches LIKE member+'%'
    /* The groups are created using the Members who are the first one in their matches 
    id  Member  AllMatches  gr
    1   M1  M1,M3   1
    2   M2  M2,M5   2
    6   M6  M6,M7   3
    10  M10 M10 4
    */
    union all
    select s.*, t.gr 
    from matchsummary s
    join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)

Output:

ID Group Member NewGroup
1 G1 M1 1
2 G1 M2 2
3 G1 M3 1
4 G1 M4 1
5 G1 M5 2
6 G1 M6 3
7 G1 M7 3
8 G1 M8 1
9 G1 M9 3
10 G1 M10 4

Second Option

Given the size of your table I recommend you to use this, I am not a big fan of loops but here I think they worth it, that way you don't need to process all your data at once,

First, you need to add a new column on your table to store the new group, my first thought was that would be better to change your application's logic to calculate that group when a new record is inserted, but thinking it better, an insert can cause several groups becoming one, and you probably need fast response in your application. So, you can set a job to regroup your data as often as you need it, if you have an UpdatedDate field in your table you also can refine this solution using a Log table and reprocess only groups that were modified after their last execution.

 IF OBJECT_ID('tempdb..#table') IS NOT NULL
    DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

ALTER TABLE #table ADD newGroup INT

/******************************************************************
START HERE
******************************************************************/

IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
    DROP TABLE #Groups;

SELECT DISTINCT [Group] INTO #Groups FROM #table

DECLARE @Group VARCHAR(3)

WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN

    SELECT TOP 1 @Group = [Group] FROM #Groups

    UPDATE #table SET newGroup = NULL 
    WHERE [Group] = @Group

    DECLARE @newGroup INT = 1
    DECLARE @member varchar(3)

    WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
    BEGIN

        SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL
    
        UPDATE #table SET newGroup = @newGroup
        WHERE Member = @member

        WHILE @@ROWCOUNT > 0
        BEGIN
            UPDATE T
            SET newGroup = @newGroup
            FROM #table T
            WHERE [Group] = @group AND newGroup IS NULL
            AND EXISTS (
                SELECT 1 FROM #table 
                WHERE newGroup = @newGroup
                AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
            )
        END
        SET @newGroup += 1
    END
    DELETE #Groups WHERE [Group] = @Group
END

SELECT * FROM #table

How to Label "Transitive Groups" with SQL