Group All Related Records in Many to Many Relationship, SQL Graph Connected Components

Find sql connected component between many to many entities

SQL Server solution

I guess the main problem is you need to compare by PR_ID then FP_ID. So in recursive part there must be a CASE statement. On 1 run we take data by FP_ID on second by PR_ID and etc with the help of modulo.

DECLARE @fp int = 1

;WITH cte AS (
    SELECT  f.FP_ID,
            f.PR_ID, 
            1 as lev
    FROM #FP_PR f
    WHERE f.FP_id = @fp
    UNION ALL
    SELECT  f.FP_ID,
            f.PR_ID,
            lev+1
    FROM cte c
    CROSS JOIN #FP_PR f -- You can use INNER JOIN instead
    WHERE CASE (lev+1)%2 WHEN 0 THEN f.PR_ID WHEN 1 THEN f.FP_ID END = CASE (lev+1)%2 WHEN 0 THEN c.PR_ID WHEN 1 THEN c.FP_ID END
    AND NOT (f.PR_ID = c.PR_ID AND f.FP_ID = c.FP_ID)
)

SELECT *
FROM cte

Output:

FP_ID   PR_ID   lev
1       1       1
2       1       2
2       2       3
3       2       4
4       2       4

Grouping a many-to-many relationship from a two-column map

Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph. The RBGL package isn't updated for R version 2.15.1, so that was out as well.

After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.

aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
    aubk[,grp := min(grp),by=author_id]
    aubk[,grp := min(grp), by=book_id]
    num.grp.new <- aubk[,length(unique(grp))] 
    if(num.grp.new == num.grp.old) {break}
    num.grp.old <- num.grp.new
    iterations <- iterations + 1
}

How to count all the connected nodes (rows) in a graph on Postgres?

You can use a recursive cte:

with recursive t(account_id, device_id) as (
       select 1, 10 union all
       select 1, 11 union all
       select 1, 12 union all
       select 2, 10 union all
       select 3, 11 union all
       select 3, 13 union all
       select 3, 14 union all
       select 4, 15 union all
       select 5, 15 union all
       select 6, 16
     ),
     a as (
      select distinct t.account_id as a, t2.account_id as a2
      from t join
           t t2
           on t2.device_id = t.device_id and t.account_id >= t2.account_id
     ),
     cte as (
      select a.a, a.a2 as mina
      from a
      union all
      select a.a, cte.a
      from cte join
           a
           on a.a2 = cte.a and a.a > cte.a
     )
select grp, array_agg(a)
from (select a, min(mina) as grp
      from cte
      group by a
     ) a
group by grp;

Here is a SQL Fiddle.

Which fields in my database diagram should I separate into relation tables?

I think the tables as you have them are mostly fine because as your data increases that's when you see the value of the design.
Here are a few ideas you can use for making these decisions if looking for optimizations:

Which tables are simply used for referential integrity with set values that will never change? I would think state and difficulty. The values in the recipe table like state should just show the state value so a lookup isn't needed. Sometimes these types of values that will not change end up in name-value pair tables(fieldname, fieldvalue, fieldtext) and the integrity handled on the front-end. Of course this means that when I look at the recipe table, I can see the difficulty and state right away without going to look up the value elsewhere.
If you had to move this into reporting or say analytics, do you have enough descriptors for an end-user to use? This might be a good reason to keep the description for difficulty.

Creating a linked ID

Nothing scary about this question, once you jump in and give yourself carpal tunnel from typing too much on a cell phone. This is just a slightly modified standard recursive hierarchical query problem. Of note, the join condition in the recursion is that the current account number is some parent's child account. As for the numbering, we just use DENSE_RANK over the top level parents.

WITH cte AS (
    SELECT m.*, DENSE_RANK() OVER (ORDER BY m.Account_Number) AS pos
    FROM temptable m
    WHERE Parent_Account IS NULL
    UNION ALL
    SELECT m.*, cte.pos
    FROM temptable m
    INNER JOIN cte
        ON m.Account_Number = cte.Child_Account
)

SELECT *
FROM cte
ORDER BY pos;

Demo

Note: I give massive credit to the brilliant accepted answer here, written by @Quassnoi.

group the related values in one group

We can do a simple check using set intersection to solve the problem.
(Not aware of GraphFrames :()

step 1: combine all parts in to a single array for each row

from pyspark.sql import functions as F
    
df_part_groups1= df_part_groups.withColumn('parts', F.array('partnumber', 'colVal1', 'colVal2', 'colVal3', 'colVal4', 'colVal5')  )

step 2: get all_parts which is a list of lists of combined parts, since the group needs to be determined amongst various rows.

def clean_lists(plists):
  return [ list(filter(None, pl)) for pl in plists]

all_parts = clean_lists((df_part_groups1.groupBy(F.lit(1)).agg(F.collect_list('parts').alias('parts')).collect())[0].parts)

step 3: get groups data using the collected all_parts

def part_of_existing_group(gps, pl):
  for key in gps.keys():
    if set(gps[key]) & set(pl):
      gps[key] = list(set(gps[key] + pl))      
      return True
      return False   
      
def findGroups(plists): 
  groups = {}    
  index = 1
  for pl in plists:
    if len(groups.keys()) == 0 or (not part_of_existing_group(groups, pl)):
      groups[f'G{index}'] = pl
      index +=1
  return groups

Step 4: Assign groups based on the groups map that you created.

 groups = findGroups(all_parts)
    
    @udf
def get_group_val(part):
  for key in groups.keys():
    if part in groups[key]:
      return key
  return -1

df_part_groups2 = df_part_groups1.withColumn('part', F.explode('parts')).dropDuplicates(['part']).where(~F.col('part').like('')).select('part', 'parts').withColumn('Group', get_group_val('part'))

    df_part_groups2.show()
+------+--------------------+-----+
|  part|               parts|Group|
+------+--------------------+-----+
| part0|[part0, part1, , ...|   G1|
| part1|[part0, part1, , ...|   G1|
|part10|[part10, part11, ...|   G2|
|part11|[part10, part11, ...|   G2|
|part13|[part11, part13, ...|   G2|
|part18|[part13, part21, ...|   G2|
| part2|[part1, , part2, ...|   G1|
|part20|[part13, part21, ...|   G2|
|part21|[part11, part13, ...|   G2|
| part3|[part2, part3, , ...|   G1|
| part4|[part1, , part2, ...|   G1|
| part5|[part2, part3, , ...|   G1|
| part6|[part2, part3, , ...|   G1|
| part7|[part2, part3, , ...|   G1|
+------+--------------------+-----+