delete duplicate records with in
If you want to delete older duplicate values, you can use:
delete from foo
where foo.id < (select max(foo2.id)
from foo foo2
where foo2.a = foo.a and foo2.b = foo.b
);
Note that an index on (a, b, id)
would help performance.
You can also phrase this as a join:
delete from foo
using (select a, b, max(id) as max_id
from foo
group by a, b
) ab
where foo.a = a.a and foo.b = ab.b and foo.id < ab.max_id;
How to delete duplicate rows in SQL ( Clickhouse)?
First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.
If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:
OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.
In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.
The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.
What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.
Two example :
- ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
- AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key
-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3
The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.
This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ...
is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.
SELECT * FROM radio_data FINAL WHERE id = 1
For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.
My last words will be on the usage of ALTER TABLE
statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.
You can find more material here :
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/
Removing duplicate rows from table in Oracle
Use the rowid
pseudocolumn.
DELETE FROM your_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM your_table
GROUP BY column1, column2, column3);
Where column1
, column2
, and column3
make up the identifying key for each record. You might list all your columns.
How to delete duplicate records in SQL?
You can delete duplicates using i.e. ROW_NUMBER()
:
with duplicates as
(
select
*
,ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, age ORDER BY FirstName) AS number
from yourTable
)
delete
from duplicates
where number > 1
Each row where number
is bigger than 1 is a duplicate.
Remove old duplicate rows in BQ based on timestamp
You can try this script. Used COUNT() with HAVING
to pull duplicate records with timestamp older than 120 minutes from current time using TIMESTAMP_DIFF
.
DELETE
FROM `table_full_name`
WHERE ad_id in (SELECT ad_id
FROM `table_full_name`
GROUP BY ad_id
HAVING COUNT(ad_id) > 1)
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, MINUTE) > 120
Before:
After:
Remove duplicate records in mysql
You could use a delete join:
DELETE t1
FROM yourTable t1
INNER JOIN yourTable t2
ON t2.account_id = t1.account_id AND
t2.campaign_id <> 51
WHERE
t1.campaign_id = 51;
Delete duplicate records leaving only the latest one
You can use Count
with partition by
to find and insert the duplicate records into answersArchive
table like following.
1- Find Duplicate and Insert into answersArchive
table
--copy the duplicate records
;WITH cte
AS (SELECT id,
answer,
country_id,
question_id,
updated,
Count(*)
OVER(
partition BY question_id ) ct
FROM answers
WHERE country_id = 15)
INSERT INTO answersarchive
SELECT id,
answer,
country_id,
question_id,
updated
FROM cte
WHERE ct > 1 --Give you duplicate records
2- Delete all duplicates except the latest one.
You can use CTE
to delete the records. To find the duplicate records you can use ROW_NUMBER()
with PARTITION BY question_id
like following query.
;WITH cte
AS (SELECT id,
answer,
country_id,
question_id,
updated,
Row_number()
OVER(
partition BY question_id
ORDER BY updated DESC) RN
FROM answers
WHERE country_id = 15)
DELETE FROM cte
WHERE rn > 1
Related Topics
How to Create a "Unique" Constraint on a Boolean MySQL Column
Count Null Values from Multiple Columns with SQL
Add Column to Table and Then Update It Inside Transaction
How to Perform a Select Query in a Do Block
In MySQL: How to Pass a Table Name as Stored Procedure And/Or Function Argument
SQL Server as Statement Aliased Column Within Where Statement
Datediff to Output Hours and Minutes
Is There a Performance Difference Between Between and in with MySQL or in SQL in General
How to Monitor and Log Actual Queries Made Against an Access Mdb
Recursive Stored Functions in MySQL
Libraries for Ado.Net to Rapidly Bulk Insert Data into a Database from a .CSV File
How to Get Max(Date) from Given Set of Data Grouped by Some Fields Using Pyspark
Remove the Last Character in a String in T-Sql