What's the Best Way to Dedupe a Table

Deduping SQL Server table

A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

In practice, this looks like:

create index idx_pf_main_first_last_dob on pf_main(first, last, dob);

select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
from pf_main m2
where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
m2.id < m.id
);

truncate table pf_main;

insert into pf_main
select *
from temp_pf_main;

Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL

If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.

I had to do something similar a few days ago so maybe you can modify my query for your purposes:

WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1

If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1 and table_2

How to delete duplicate rows in SQL ( Clickhouse)?

First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.

If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:

OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here

https://clickhouse.com/docs/en/sql-reference/statements/optimize/

Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.

In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.

The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.

What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.

Two example :

  • ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
  • AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key

-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3

The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.

This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ... is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.

SELECT * FROM radio_data FINAL WHERE id = 1

For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.

My last words will be on the usage of ALTER TABLE statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.

You can find more material here :

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/

Fastest technique to deleting duplicate data

What about EXISTS:

DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)

Best way to remove duplicate entries from a data table

Remove Duplicates

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
Hashtable hTable = new Hashtable();
ArrayList duplicateList = new ArrayList();

//Add list of all the unique item value to hashtable, which stores combination of key, value pair.
//And add duplicate item value in arraylist.
foreach (DataRow drow in dTable.Rows)
{
if (hTable.Contains(drow[colName]))
duplicateList.Add(drow);
else
hTable.Add(drow[colName], string.Empty);
}

//Removing a list of duplicate items from datatable.
foreach (DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);

//Datatable which contains unique records will be return as output.
return dTable;
}

Here Links below

http://www.dotnetspider.com/resources/4535-Remove-duplicate-records-from-table.aspx

http://www.dotnetspark.com/kb/94-remove-duplicate-rows-value-from-datatable.aspx

For remove duplicates in column

http://dotnetguts.blogspot.com/2007/02/removing-duplicate-records-from.html

Remove duplicate rows from a big table

Thanks to "Kazi Mohammad Ali Nur" and "eshirvana". I've combine there solutions. At first I created index on Field1.

CREATE CLUSTERED INDEX Index_Name   
ON MyTable(Field1);

and then I executed following query to insert unique records into a new table and deleted original table.

WITH CTE(Field1, Field2, Field3, Field4, DuplicateCount)
AS (SELECT *,
ROW_NUMBER() OVER(PARTITION BY Field1 ORDER BY Field1) AS DuplicateCount
FROM MyTable)
select * into TempTable FROM CTE
WHERE DuplicateCount = 1;

and it worked.

Thanks to all.



Related Topics



Leave a reply



Submit