SQL to find duplicate entries (within a group)
You can get the answer with a join instead of a subquery
select
a.*
from
event as a
inner join
(select groupid
from event
group by groupid
having count(*) <> 5) as b
on a.groupid = b.groupid
This is a fairly common way of obtaining the all the information out of the rows in a group.
Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.
If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.
Finding Duplicates: GROUP BY and DISTINCT giving different answers
Your counting logic is off, and mine was too, until I came up with a simple example to better understand your question. Imagine a simple table with only one column, text
:
text
----
A
B
B
C
C
C
Running SELECT COUNT(*)
just yields 6 records, as expected. SELECT DISTINCT text
returns 3 records, for A,B,C
. Finally, SELECT text
with HAVING COUNT(*) > 1
returns only two records, for the B
and C
groups.
None of these numbers add up at all. The issue here is that a distinct select also returns records which are not duplicate, in addition to records which are duplicate. Also, a given duplicate record could occur more than two times. Your current comparison is somewhat apples to oranges.
Edit:
If you want to remove all duplicates in your six-column table, leaving only one distinct record from all columns, then try using a deletable CTE:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ScanNumber, DB_ID, PluginID,
PluginID_Version, Result, ActualValue
ORDER BY (SELECT NULL)) rn
FROM DBAScanResults
)
DELETE
FROM cte
WHERE rn > 1;
Finding duplicate values in a SQL table
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1
Simply group on both of the columns.
Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":
In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.
Support is not consistent:
- Recent PostgreSQL supports it.
- SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
- MySQL is unpredictable and you need
sql_mode=only_full_group_by
:- GROUP BY lname ORDER BY showing wrong results;
- Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
- Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).
Duplicate rows showing while using group by clause?
It depends on the output you want to produce.
If you want one row per business, but also want to return the additional columns (Start Period Date, End Period Date, brand, Committed Transaction Fee
), you probably won't be able to use a GROUP BY
since you'll need to include those columns in your GROUP BY
clause.
You can use a QUALIFY
function to do this and still return the columns, but you have to decide how to "pick" each row to be returned. For example:
QUALIFY ROW_NUMBER() OVER(
PARTITION BY [bg].BusinessRefId
ORDER BY Generate_Dates.Start_Date DESC
) = 1
This does the following:
- Split the result set rows into groups based on
[bg].BusinessRefId
- Order the rows within those groups based on
Generate_Dates.Start_Date
- Return the first row of each group
You can replace your GROUP BY
clause with the QUALIFY
.
WITH Generate_Dates(Start_Date, End_Date) AS (
SELECT
CAST(StartDateUtc AS Date),
CAST(
CASE
WHEN [bg].EndDateUtc IS NULL THEN GETDATE()
ELSE CAST([bg].EndDateUtc AS Date)
END AS DATE
)
FROM BusinessGoal AS [bg] WITH (NOLOCK)
WHERE [bg].Period = 'Year'
UNION ALL
SELECT
DATEADD(yy, 1, Start_Date),
End_Date
FROM Generate_Dates
WHERE DATEADD(yy, 1, Start_Date) < End_Date
)
SELECT DISTINCT
Generate_Dates.Start_Date AS 'Start Period Date',
Generate_Dates.End_Date AS 'End Period Date',
[bg].BusinessRefId AS 'brand',
CAST ([bg].Amount AS DECIMAL) AS 'Committed Transaction Fee',
(
SELECT
TOP 1 CAST (
SUM (
CASE
WHEN [s].Credit = 0 THEN - [s].SettlementAmount
ELSE [s].SettlementAmount
END
) AS DECIMAL
)
) AS 'Actual Transaction Fee',
(
SELECT
TOP 1 (
CASE [s].Culture
WHEN 'en-US' THEN 'USD'
WHEN 'en-CA' THEN 'CAD'
WHEN 'en-IE' THEN 'EUR'
WHEN 'en-AU' THEN 'AUD'
WHEN 'en-NZ' THEN 'NZD'
WHEN 'en-Gbg' THEN 'GbgP'
WHEN 'es-MX' THEN 'MXN'
WHEN 'jp-JP' THEN 'JPY'
WHEN 'nbg-NO' THEN 'NOK'
END
)
FROM Settlement AS [s] WITH (NOLOCK)
) AS 'Currency'
FROM BusinessGoal AS [bg] WITH (NOLOCK)
INNER JOIN Settlement AS [s] WITH (NOLOCK) ON [bg].BusinessRefId = [s].CompanyRefId
INNER JOIN OrderProductVariant AS [opv] WITH (NOLOCK) ON
(
[opv].MRefId = [s].CompanyRefId
AND [opv].Id = [s].OrderProductVariantId
)
INNER JOIN [Order] AS [o] WITH (NOLOCK) ON [o].Id = [opv].OrderId
INNER JOIN Generate_Dates ON GoalType = 'CommittedTransactionFee'
WHERE StartDateUtc <= [s].CreatedOnUtc
AND (EndDateUtc >= [s].CreatedOnUtc OR EndDateUtc IS NULL)
QUALIFY ROW_NUMBER() OVER(
PARTITION BY [bg].BusinessRefId
ORDER BY Generate_Dates.Start_Date DESC
) = 1
--GROUP BY Start_Date, End_Date, BusinessRefId, Amount, [s].Culture
Also, like the previous poster mentioned, you don't need the DISTINCT
in there. If you're doing a GROUP BY
without any aggregate functions, it's logically equivalent to doing a DISTINCT
.
I haven't tested it, so not sure if it works...but something to try.
How to group by a column with duplicate values while pulling all associated columns in table with SQL?
So, not sure if you want to end up with a completely deduped table or not, but this window function + QUALIFY
allow you to choose a record to keep for each dupe:
SELECT *
FROM F1_TABLE
QUALIFY row_number() OVER (PARTITION BY Group_Num ORDER BY Party_ID) = 1;
This example chooses the first Party_ID, but you can choose whatever you want (or let Snowflake choose for you by excluding the ORDER BY
clause.
If this isn't what you are looking for, can you please provide what your resulting table would look like in your question?
How to count duplicates based on group by as well as see if values are present in the same field values?
You can count the number of different groups an id appears in using window functions and then aggregate. Here is an example counting singletons and pairs:
select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from table1 t
) t
group by group;
EDIT:
If you have duplicates, it is probably best to remove them before doing the above processing:
select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from (select distinct group, id
from table1 t
) t
) t
group by group
SQL: Find duplicates and for each duplicate group assign value of first duplicate of that group
Window function min
can be used here:
select min(id) over (partition by first_name, last_name, company) id,
category
from t;
Find duplicate groups of rows in SQL Server
Build a XML string in a CTE that contains all constituents and use that string to figure out what materials is duplicate.
SQL Fiddle
MS SQL Server 2008 Schema Setup:
create table Materials
(
material_id int,
constituent_id int,
constituent_wt_pct decimal(10, 2)
);
insert into Materials values
(1, 1, 10.5),
(1, 2, 89.5),
(2, 1, 10.5),
(2, 5, 15.5),
(2, 7, 74),
(3, 1, 10.5),
(3, 2, 89.5);
Query 1:
with C as
(
select M1.material_id,
(
select M2.constituent_id as I,
M2.constituent_wt_pct as P
from Materials as M2
where M1.material_id = M2.material_id
order by M2.constituent_id,
M2.material_id
for xml path('')
) as constituents
from Materials as M1
group by M1.material_id
)
select row_number() over(order by 1/0) as ID,
stuff((
select ','+cast(C2.material_id as varchar(10))
from C as C2
where C1.constituents = C2.constituents
for xml path('')
), 1, 1, '') as MaterialIDs
from C as C1
group by C1.constituents
having count(*) > 1
Results:
| ID | MATERIALIDS |
--------------------
| 1 | 1,3 |
Duplicate rows that belong to multiple groups before group by
Consider using conditional aggregation:
select
avg(case when first_name like 'A%' then age end) avg_age_first_name_A
avg(case when last_name like 'A%' then age end) avg_age_last_name_A
from mytable
where first_name like 'A%' or last_name like 'A%'
Edit: on the other hand if you are just looking to generate two groups, one option is union all
:
select t.*, 'group 1' grp from mytable t where first_name like 'A%'
union all select t.*, 'group 2' from mytable t where last_name like 'A%'
Related Topics
Redshift Split Single Dynamic Column into Multiple Rows in New Table
Mssql Dynamic Pivot Column Values to Column Header
Timezone Date Format in Oracle
How to Implement Logging and Error Reporting in SQL Stored Procedures
How to Get the Nth Row in a SQL Server Table
SQL Server Audit Logout Creates Huge Number of Reads
SQL Query Where Date = Today Minus 7 Days
What Timezone Does MySQL's Now() Follow
How to Get the Date and Time from Timestamp in Postgresql Select Query
Least Value But Not Null in Oracle SQL
Is It Better to Do an Equi Join in the from Clause or Where Clause
Extract Characters to the Right of a Delimited Value in a Select Statement