SQL to Find Duplicate Entries (Within a Group)

SQL to find duplicate entries (within a group)

You can get the answer with a join instead of a subquery

select
a.*
from
event as a
inner join
(select groupid
from event
group by groupid
having count(*) <> 5) as b
on a.groupid = b.groupid

This is a fairly common way of obtaining the all the information out of the rows in a group.

Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.

If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.

Finding Duplicates: GROUP BY and DISTINCT giving different answers

Your counting logic is off, and mine was too, until I came up with a simple example to better understand your question. Imagine a simple table with only one column, text:

text
----
A
B
B
C
C
C

Running SELECT COUNT(*) just yields 6 records, as expected. SELECT DISTINCT text returns 3 records, for A,B,C. Finally, SELECT text with HAVING COUNT(*) > 1 returns only two records, for the B and C groups.

None of these numbers add up at all. The issue here is that a distinct select also returns records which are not duplicate, in addition to records which are duplicate. Also, a given duplicate record could occur more than two times. Your current comparison is somewhat apples to oranges.

Edit:

If you want to remove all duplicates in your six-column table, leaving only one distinct record from all columns, then try using a deletable CTE:

WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ScanNumber, DB_ID, PluginID,
PluginID_Version, Result, ActualValue
ORDER BY (SELECT NULL)) rn
FROM DBAScanResults
)

DELETE
FROM cte
WHERE rn > 1;

Finding duplicate values in a SQL table

SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1

Simply group on both of the columns.

Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.

Support is not consistent:

  • Recent PostgreSQL supports it.
  • SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
  • MySQL is unpredictable and you need sql_mode=only_full_group_by:

    • GROUP BY lname ORDER BY showing wrong results;
    • Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
  • Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).

Duplicate rows showing while using group by clause?

It depends on the output you want to produce.

If you want one row per business, but also want to return the additional columns (Start Period Date, End Period Date, brand, Committed Transaction Fee), you probably won't be able to use a GROUP BY since you'll need to include those columns in your GROUP BY clause.

You can use a QUALIFY function to do this and still return the columns, but you have to decide how to "pick" each row to be returned. For example:

QUALIFY ROW_NUMBER() OVER(
PARTITION BY [bg].BusinessRefId
ORDER BY Generate_Dates.Start_Date DESC
) = 1

This does the following:

  1. Split the result set rows into groups based on [bg].BusinessRefId
  2. Order the rows within those groups based on Generate_Dates.Start_Date
  3. Return the first row of each group

You can replace your GROUP BY clause with the QUALIFY.

WITH Generate_Dates(Start_Date, End_Date) AS (
SELECT
CAST(StartDateUtc AS Date),
CAST(
CASE
WHEN [bg].EndDateUtc IS NULL THEN GETDATE()
ELSE CAST([bg].EndDateUtc AS Date)
END AS DATE
)
FROM BusinessGoal AS [bg] WITH (NOLOCK)
WHERE [bg].Period = 'Year'

UNION ALL

SELECT
DATEADD(yy, 1, Start_Date),
End_Date
FROM Generate_Dates
WHERE DATEADD(yy, 1, Start_Date) < End_Date
)
SELECT DISTINCT
Generate_Dates.Start_Date AS 'Start Period Date',
Generate_Dates.End_Date AS 'End Period Date',
[bg].BusinessRefId AS 'brand',
CAST ([bg].Amount AS DECIMAL) AS 'Committed Transaction Fee',
(
SELECT
TOP 1 CAST (
SUM (
CASE
WHEN [s].Credit = 0 THEN - [s].SettlementAmount
ELSE [s].SettlementAmount
END
) AS DECIMAL
)
) AS 'Actual Transaction Fee',
(
SELECT
TOP 1 (
CASE [s].Culture
WHEN 'en-US' THEN 'USD'
WHEN 'en-CA' THEN 'CAD'
WHEN 'en-IE' THEN 'EUR'
WHEN 'en-AU' THEN 'AUD'
WHEN 'en-NZ' THEN 'NZD'
WHEN 'en-Gbg' THEN 'GbgP'
WHEN 'es-MX' THEN 'MXN'
WHEN 'jp-JP' THEN 'JPY'
WHEN 'nbg-NO' THEN 'NOK'
END
)
FROM Settlement AS [s] WITH (NOLOCK)
) AS 'Currency'
FROM BusinessGoal AS [bg] WITH (NOLOCK)
INNER JOIN Settlement AS [s] WITH (NOLOCK) ON [bg].BusinessRefId = [s].CompanyRefId
INNER JOIN OrderProductVariant AS [opv] WITH (NOLOCK) ON
(
[opv].MRefId = [s].CompanyRefId
AND [opv].Id = [s].OrderProductVariantId
)
INNER JOIN [Order] AS [o] WITH (NOLOCK) ON [o].Id = [opv].OrderId
INNER JOIN Generate_Dates ON GoalType = 'CommittedTransactionFee'
WHERE StartDateUtc <= [s].CreatedOnUtc
AND (EndDateUtc >= [s].CreatedOnUtc OR EndDateUtc IS NULL)
QUALIFY ROW_NUMBER() OVER(
PARTITION BY [bg].BusinessRefId
ORDER BY Generate_Dates.Start_Date DESC
) = 1
--GROUP BY Start_Date, End_Date, BusinessRefId, Amount, [s].Culture

Also, like the previous poster mentioned, you don't need the DISTINCT in there. If you're doing a GROUP BY without any aggregate functions, it's logically equivalent to doing a DISTINCT.

I haven't tested it, so not sure if it works...but something to try.

How to group by a column with duplicate values while pulling all associated columns in table with SQL?

So, not sure if you want to end up with a completely deduped table or not, but this window function + QUALIFY allow you to choose a record to keep for each dupe:

SELECT *
FROM F1_TABLE
QUALIFY row_number() OVER (PARTITION BY Group_Num ORDER BY Party_ID) = 1;

This example chooses the first Party_ID, but you can choose whatever you want (or let Snowflake choose for you by excluding the ORDER BY clause.

If this isn't what you are looking for, can you please provide what your resulting table would look like in your question?

How to count duplicates based on group by as well as see if values are present in the same field values?

You can count the number of different groups an id appears in using window functions and then aggregate. Here is an example counting singletons and pairs:

select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from table1 t
) t
group by group;

EDIT:

If you have duplicates, it is probably best to remove them before doing the above processing:

select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from (select distinct group, id
from table1 t
) t
) t
group by group

SQL: Find duplicates and for each duplicate group assign value of first duplicate of that group

Window function min can be used here:

select min(id) over (partition by first_name, last_name, company) id,
category
from t;

Find duplicate groups of rows in SQL Server

Build a XML string in a CTE that contains all constituents and use that string to figure out what materials is duplicate.

SQL Fiddle

MS SQL Server 2008 Schema Setup:

create table Materials
(
material_id int,
constituent_id int,
constituent_wt_pct decimal(10, 2)
);

insert into Materials values
(1, 1, 10.5),
(1, 2, 89.5),
(2, 1, 10.5),
(2, 5, 15.5),
(2, 7, 74),
(3, 1, 10.5),
(3, 2, 89.5);

Query 1:

with C as
(
select M1.material_id,
(
select M2.constituent_id as I,
M2.constituent_wt_pct as P
from Materials as M2
where M1.material_id = M2.material_id
order by M2.constituent_id,
M2.material_id
for xml path('')
) as constituents
from Materials as M1
group by M1.material_id
)
select row_number() over(order by 1/0) as ID,
stuff((
select ','+cast(C2.material_id as varchar(10))
from C as C2
where C1.constituents = C2.constituents
for xml path('')
), 1, 1, '') as MaterialIDs
from C as C1
group by C1.constituents
having count(*) > 1

Results:

| ID | MATERIALIDS |
--------------------
| 1 | 1,3 |

Duplicate rows that belong to multiple groups before group by

Consider using conditional aggregation:

select
avg(case when first_name like 'A%' then age end) avg_age_first_name_A
avg(case when last_name like 'A%' then age end) avg_age_last_name_A
from mytable
where first_name like 'A%' or last_name like 'A%'

Edit: on the other hand if you are just looking to generate two groups, one option is union all:

select t.*, 'group 1' grp from mytable t where first_name like 'A%'
union all select t.*, 'group 2' from mytable t where last_name like 'A%'


Related Topics



Leave a reply



Submit