How to Delete Duplicate Entries

How to delete duplicate entries?

For example you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

How to delete duplicate rows in SQL ( Clickhouse)?

First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.

If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:

OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here

https://clickhouse.com/docs/en/sql-reference/statements/optimize/

Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.

In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.

The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.

What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.

Two example :

  • ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
  • AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key

-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3

The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.

This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ... is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.

SELECT * FROM radio_data FINAL WHERE id = 1

For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.

My last words will be on the usage of ALTER TABLE statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.

You can find more material here :

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/

Delete duplicate rows from a BigQuery table

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

How can I remove duplicate rows?

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

Delete Duplicate rows in several Postgresql tables

As you want to automate your deduplication of all table, you need to use plpgsql function where you can write dynamic queries to achieve it.

Try This function:

create or replace function func_dedup(_schemaname varchar) returns void as
$$
declare
_rec record;
begin

for _rec in select table_name from information_schema. tables where table_schema=_schemaname

loop
execute format('CREATE TEMP TABLE tab_temp as select DISTINCT * from '||_rec.table_name);
execute format('truncate '||_rec.table_name);
execute format('insert into '||_rec.table_name||' select * from tab_temp');
execute format('drop table tab_temp');
end loop;

end;
$$
language plpgsql

Now call your function like below:

select * from func_dedup('your_schema'); --

demo

Steps:

  1. Get the list of all tables in your schema by using below query and loop it for each table.
select table_name from information_schema. tables where table_schema=_schemaname

  1. Insert all distinct records in a TEMP TABLE.
  2. Truncate your main table.
  3. Insert all your data from TEMP TABLE to main table.
  4. Drop the TEMP TABLE. (here dropping temp table is important we have to reuse it for next loop cycle.)

Note - if your tables are very large in size the consider using Regular Table instead of TEMP TABLE.

Which way is faster to delete duplicate rows in sql?

Not having a primary key for your table is a general bad idea. Here is one way you can delete duplicates, with the record retained per 23 columns is arbitrary:

WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY col1, col2, col3, ..., col22, col23
ORDER BY (SELECT NULL)) rn
FROM yourTable
)

DELETE
FROM cte
WHERE rn > 1;


Related Topics



Leave a reply



Submit