Deleting Duplicate Records

delete duplicate records with in

If you want to delete older duplicate values, you can use:

delete from foo
where foo.id < (select max(foo2.id)
from foo foo2
where foo2.a = foo.a and foo2.b = foo.b
);

Note that an index on (a, b, id) would help performance.

You can also phrase this as a join:

delete from foo
using (select a, b, max(id) as max_id
from foo
group by a, b
) ab
where foo.a = a.a and foo.b = ab.b and foo.id < ab.max_id;

How to delete duplicate rows in SQL ( Clickhouse)?

First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.

If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:

OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here

https://clickhouse.com/docs/en/sql-reference/statements/optimize/

Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.

In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.

The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.

What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.

Two example :

  • ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
  • AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key

-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3

The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.

This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ... is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.

SELECT * FROM radio_data FINAL WHERE id = 1

For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.

My last words will be on the usage of ALTER TABLE statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.

You can find more material here :

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/

Removing duplicate rows from table in Oracle

Use the rowid pseudocolumn.

DELETE FROM your_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM your_table
GROUP BY column1, column2, column3);

Where column1, column2, and column3 make up the identifying key for each record. You might list all your columns.

How to delete duplicate records in SQL?

You can delete duplicates using i.e. ROW_NUMBER():

with duplicates as
(
select
*
,ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, age ORDER BY FirstName) AS number
from yourTable
)
delete
from duplicates
where number > 1

Each row where number is bigger than 1 is a duplicate.

Remove old duplicate rows in BQ based on timestamp

You can try this script. Used COUNT() with HAVING to pull duplicate records with timestamp older than 120 minutes from current time using TIMESTAMP_DIFF.

DELETE
FROM `table_full_name`
WHERE ad_id in (SELECT ad_id
FROM `table_full_name`
GROUP BY ad_id
HAVING COUNT(ad_id) > 1)
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, MINUTE) > 120

Before:

Sample Image

After:

Sample Image

Remove duplicate records in mysql

You could use a delete join:

DELETE t1
FROM yourTable t1
INNER JOIN yourTable t2
ON t2.account_id = t1.account_id AND
t2.campaign_id <> 51
WHERE
t1.campaign_id = 51;

Delete duplicate records leaving only the latest one

You can use Count with partition by to find and insert the duplicate records into answersArchive table like following.

1- Find Duplicate and Insert into answersArchive table

--copy the duplicate records
;WITH cte
AS (SELECT id,
answer,
country_id,
question_id,
updated,
Count(*)
OVER(
partition BY question_id ) ct
FROM answers
WHERE country_id = 15)
INSERT INTO answersarchive
SELECT id,
answer,
country_id,
question_id,
updated
FROM cte
WHERE ct > 1 --Give you duplicate records

2- Delete all duplicates except the latest one.

You can use CTE to delete the records. To find the duplicate records you can use ROW_NUMBER() with PARTITION BY question_id like following query.

;WITH cte 
AS (SELECT id,
answer,
country_id,
question_id,
updated,
Row_number()
OVER(
partition BY question_id
ORDER BY updated DESC) RN
FROM answers
WHERE country_id = 15)

DELETE FROM cte
WHERE rn > 1


Related Topics



Leave a reply



Submit