Delete Duplicate Rows from Small Table

Delete duplicate rows from small table

DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
FROM dupes b
WHERE a.key = b.key);

How to delete duplicate rows without unique identifier

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:

DELETE   FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;

If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.

SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.

If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.

  AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

Removing duplicate rows from table in Oracle

Use the rowid pseudocolumn.

DELETE FROM your_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM your_table
GROUP BY column1, column2, column3);

Where column1, column2, and column3 make up the identifying key for each record. You might list all your columns.

SQL Delete duplicate records based on two columns

You can use distinct on:

select distinct on (car, shop) t.*
from t
order by car, shop, day;

If you want to actually delete the records:

delete from t
where t.day = (select min(t2.day)
from t2
where t2.car = t.car and t2.shop = t.shop
);

SQL Delete duplicate rows with lowest number

Query to remove duplicate in SQL-Server:

;with c as
(
select *, row_number() over(partition by [Key] order by Number desc) as n
from YouTable
)
delete from c
where n > 1

How to delete duplicate rows from table if they are written in a minute period?

The four columns which define the duplicates are: TableName, EmployerCode, UserInfo, OperationName. Something like this

delete t
from tTest t
where exists (select 1
from tTest t2
where t2.EmployerCode = t.EmployerCode and
t2.TableName = t.TableName and
t2.UserInfo=t.UserInfo and
t2.OperationName=t.OperationName and
t2.CreatedDate >= dateadd(minute, -1, t.CreatedDate) and
t2.CreatedDate < t.CreatedDate);

How to delete duplicate rows in SQL ( Clickhouse)?

First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.

If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:

OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here

https://clickhouse.com/docs/en/sql-reference/statements/optimize/

Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.

In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.

The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.

What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.

Two example :

  • ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
  • AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key

-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3

The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.

This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ... is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.

SELECT * FROM radio_data FINAL WHERE id = 1

For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.

My last words will be on the usage of ALTER TABLE statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.

You can find more material here :

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/

How can I remove duplicate rows?

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))


Related Topics



Leave a reply



Submit