Find Groups with Matching Rows

Find groups with matching rows

This counts the number of rows for each name using a common table expression(cte) with count() over().

Then the matches cte uses a self-join where the names do not match, the models match, the count of models for each name match, and one of those names is 'Lisa'. The having clause ensures that count of matched rows (count(*)) matches the number of models that name has.

matches itself would only return the name of each person, so we join back to the source table t to get the full list of models for each match.

;with cte as (
select *
, cnt = count(*) over (partition by name)
from t
)
, matches as (
select x2.name
from cte as x
inner join cte as x2
on x.name <> x2.name
and x.model = x2.model
and x.cnt = x2.cnt
and x.name = 'Lisa'
group by x2.name, x.cnt
having count(*) = x.cnt
)
select t.*
from t
inner join matches m
on t.name = m.name

rextester demo: http://rextester.com/SUKP78304

returns:

+-------+-------+
| name | model |
+-------+-------+
| Kevin | Civic |
| Kevin | Focus |
+-------+-------+

We could also write it without the ctes, but it makes it a little harder to follow:

select t.*
from t
inner join (
select x2.Name
from (
select *, cnt = count(*) over (partition by name)
from t
where name='Lisa'
) as x
inner join (
select *, cnt = count(*) over (partition by name)
from t
) as x2
on x.name <> x2.name
and x.model = x2.model
and x.cnt = x2.cnt
group by x2.name, x.cnt
having count(*) = x.cnt
) as m
on t.name = m.name

Find groups containing several rows matching condition

You have two conditions:

  1. hit is a subset of the group: x['B'].isin(hit).sum()==len(hit)

  2. the value at B is contained in hit: x['B'].isin(hit)

So you can express both conditions like this

hit = frozenset('GHI')
print(df[df.groupby('A')['B'].transform(hit.issubset) & df['B'].isin(hit)])

Output

    A  B
2 A G
3 A H
4 A I
11 C G
12 C H
16 C I

The expression:

df.groupby('A')['B'].transform(hit.issubset)

is the equivalent of condition 1.

Find group of records that match multiple values

You can do this with conditional aggregation:

select parentid 
from tablename
group by parentid
having sum(case when datavalue = 1 then 1 else 0 end) > 0 and
sum(case when datavalue = 6 then 1 else 0 end) > 0

Another way is use exists:

select distinct parentid
from tablename t1
where exists(select * from tablename where parentid = t1.parentid and datavalue = 1) and
exists(select * from tablename where parentid = t1.parentid and datavalue = 6)

Another way is counting distinct occurrences:

select parentid 
from tablename
where datavalue in(1, 6)
group by parentid
having count(distinct datavalue) = 2

Get only matching rows for groups in Pandas groupby

You can reset_index then use duplicated and boolean index filter your dataframe:

gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]

Output:

  Col1 Col2 Col3  sum  mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2

How do I find groups of rows where all rows in each group have a specific column value

SELECT ID1, ID2
FROM MyTable
GROUP BY ID1, ID2
HAVING COUNT(Type) = SUM(CASE WHEN Type = 'A' THEN 1 ELSE 0 END)

Lookup rows using pairs of values and find exact matching groups

You can use SQL relational division logic as described in this answer. You're interested in the part that says exact division/no remainder:

with project_list as (
select id
from project
where exists (
select *
from (values
('pete', '0.0.1'),
('swag', '0.0.1')
) as user_input(name, version)
where project.name = user_input.name and project.version = user_input.version
)
), person_project_copy as (
select person_id, case when project_list.id is not null then 1 end as is_required
from person_project
left join project_list on person_project.project_id = project_list.id
)
select person_id
from person_project_copy
group by person_id
having count(is_required) = (select count(*) from project_list)
and count(*) = (select count(*) from project_list)

DB<>Fiddle for all three examples

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

SQL Weird Grouping - Matching rows sharing common values for either of two columns?

I believe this does what is originally asked in the question with one caveat. Orders not repeated in either col1 or col2 are grouped together (uncomment line union all select 8, 'PO25303311', 331503 in the CTE for an example).

;with cache_resale_tbl as(
select
1 KeyID, 'PO25303309' col1, 255207 col2
union all select 2, 'PO25303304', 257459
union all select 3, 'PO25303305', 257459
union all select 4, 'PO25303306', 257459
union all select 5, 'PO25303307', 257459
union all select 6, 'PO25303309', 257459
union all select 7, 'PO25303310', 331502
--union all select 8, 'PO25303311', 331503
)

,CountRepeatVal AS(
select
cache_resale_tbl.*
,CntCol1 = COUNT(*) over (partition by col1)
,CntCol2 = COUNT(*) over (partition by col2)

from cache_resale_tbl

)

,Grouped AS(
select
CountRepeatVal.*
,Groups = case when (CountRepeatVal.CntCol1 > 1 OR CountRepeatVal.CntCol2 > 1) then 0 else 1 end

from CountRepeatVal
)

select
--Grouped.*
Grouped.KeyID
,Grouped.col1
,Grouped.col2
,GROUP_ID = DENSE_RANK() OVER(ORDER BY Groups)

FROM Grouped

Here's the db<>fiddle

SQL Server find contained groups

I have tested with the set of data that you have provided and it works.
First one, groups not contained in another group:

SELECT DISTINCT Group_Number FROM #T
WHERE NOT EXISTS (SELECT Group_Number G2
FROM #T AS T2
WHERE T2.Group_Number <> #t.Group_Number
AND T2.ID = #T.ID)

And... the other way is very easy having this one:

SELECT DISTINCT Group_Number FROM #T WHERE NOT Group_Number IN (
SELECT DISTINCT Group_Number FROM #T
WHERE NOT EXISTS (SELECT Group_Number G2
FROM #T AS T2
WHERE T2.Group_Number <> #t.Group_Number
AND T2.ID = #T.ID)
)

Just asking myself I realised that my response is not fully accurate.
First, I realised that adding:

INSERT INTO #t 
VALUES (6, 50),
(7, 60),
(8, 50),
(8, 60)

The group 8 did not appear as one item is present in group 6 and the other is in group 7.
So, I did a lot of checks and concluded that the following code is the one that guarantees the results and also gives traceability to verify if response is correct or not:

SELECT DISTINCT Group_Number FROM 
(
SELECT T1.Group_Number, T1.Rows, T2.Group_Number as Comparing_With_Other_Group, Count(DISTINCT T2.ID) AS Rows_On_Other_Group
FROM (
SELECT Group_Number, Count(DISTINCT ID) AS Rows
FROM #T
GROUP BY Group_Number
) T1
INNER JOIN #T AS T2
ON T1.Group_Number <> T2.Group_Number
AND EXISTS (SELECT 1 FROM #T WHERE #T.Group_Number = T1.Group_Number and #T.ID = T2.ID)
GROUP BY T1.Group_Number, T2.Group_Number, T1.Rows
) SubQry
WHERE Rows = Rows_On_Other_Group

If you run the SubQry only you will see the traceability while the full query will show you the Groups where the system can find another group that filtering the ID's for the ones in the Group that I am searching, finds the same number of ID's.



Related Topics



Leave a reply



Submit