SQL Query - Delete Duplicates If More Than 3 Dups

SQL Query - Delete duplicates if more than 3 dups?

with cte as (
select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn
from table)
delete from cte
where rn > 2; -- or >3 etc

The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2;

Using this method you can change the ORDER BY to suit your preferred order (eg.ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.

Unlike this closely related question, as the condition becomes more complex, using CTEs and row_number() becomes simpler. Performance may be problematic still if no proper access index exists.

Delete duplicate records from a table only if the count is greater than 3

If I understand your question clearly, why not restrict the deletion to those rows with the desired values using subquery. Something like this:

DELETE FROM test5 x
WHERE x.ROWID >
ANY( select y.ROWID
FROM test5 y
WHERE X.AA = Y.AA
AND
X.BB = Y.BB
AND
X.CC = Y.CC)
AND (X.AA, X.BB, X.CC) IN (
SELECT AA, BB, CC FROM TEST5
GROUP BY AA, BB, CC
HAVING COUNT(AA) > 3);

This should delete duplicates of the first two tuples only.

Removing duplicate rows (based on values from multiple columns) from SQL table

Sample SQL FIDDLE

1) Use CTE to get max ship code value record based on ARDivisionNo, CustomerNo
for each Customers

WITH cte AS (
SELECT*,
row_number() OVER(PARTITION BY ARDivisionNo, CustomerNo ORDER BY ShipToCode desc) AS [rn]
FROM t
)
Select * from cte WHERE [rn] = 1

2) To Delete the record use Delete query instead of Select and change Where Clause to rn > 1. Sample SQL FIDDLE

WITH cte AS (
SELECT*,
row_number() OVER(PARTITION BY ARDivisionNo, CustomerNo ORDER BY ShipToCode desc) AS [rn]
FROM t
)
Delete from cte WHERE [rn] > 1;

select * from t;

How can I remove duplicate rows?

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

How to delete duplicate records in SQL?

You can delete duplicates using i.e. ROW_NUMBER():

with duplicates as
(
select
*
,ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, age ORDER BY FirstName) AS number
from yourTable
)
delete
from duplicates
where number > 1

Each row where number is bigger than 1 is a duplicate.

MySQL delete duplicate records but keep latest

Imagine your table test contains the following data:

  select id, email
from test;

ID EMAIL
---------------------- --------------------
1 aaa
2 bbb
3 ccc
4 bbb
5 ddd
6 eee
7 aaa
8 aaa
9 eee

So, we need to find all repeated emails and delete all of them, but the latest id.

In this case, aaa, bbb and eee are repeated, so we want to delete IDs 1, 7, 2 and 6.

To accomplish this, first we need to find all the repeated emails:

      select email 
from test
group by email
having count(*) > 1;

EMAIL
--------------------
aaa
bbb
eee

Then, from this dataset, we need to find the latest id for each one of these repeated emails:

  select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email;

LASTID EMAIL
---------------------- --------------------
8 aaa
4 bbb
9 eee

Finally we can now delete all of these emails with an Id smaller than LASTID. So the solution is:

delete test
from test
inner join (
select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email
) duplic on duplic.email = test.email
where test.id < duplic.lastId;

I don't have mySql installed on this machine right now, but should work

Update

The above delete works, but I found a more optimized version:

 delete test
from test
inner join (
select max(id) as lastId, email
from test
group by email
having count(*) > 1) duplic on duplic.email = test.email
where test.id < duplic.lastId;

You can see that it deletes the oldest duplicates, i.e. 1, 7, 2, 6:

select * from test;
+----+-------+
| id | email |
+----+-------+
| 3 | ccc |
| 4 | bbb |
| 5 | ddd |
| 8 | aaa |
| 9 | eee |
+----+-------+

Another version, is the delete provived by Rene Limon

delete from test
where id not in (
select max(id)
from test
group by email)

How to delete duplicate rows without unique identifier

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:

DELETE   FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;

If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.

SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.

If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.

  AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

T-SQL: Deleting all duplicate rows but keeping one

You didn't say what version you were using, but in SQL 2005 and above, you can use a common table expression with the OVER Clause. It goes a little something like this:

WITH cte AS (
SELECT[foo], [bar],
row_number() OVER(PARTITION BY foo, bar ORDER BY baz) AS [rn]
FROM TABLE
)
DELETE cte WHERE [rn] > 1

Play around with it and see what you get.

(Edit: In an attempt to be helpful, someone edited the ORDER BY clause within the CTE. To be clear, you can order by anything you want here, it needn't be one of the columns returned by the cte. In fact, a common use-case here is that "foo, bar" are the group identifier and "baz" is some sort of time stamp. In order to keep the latest, you'd do ORDER BY baz desc)

Eliminating duplicate values based on only one column of the table

This is where the window function row_number() comes in handy:

SELECT s.siteName, s.siteIP, h.date
FROM sites s INNER JOIN
(select h.*, row_number() over (partition by siteName order by date desc) as seqnum
from history h
) h
ON s.siteName = h.siteName and seqnum = 1
ORDER BY s.siteName, h.date


Related Topics



Leave a reply



Submit