Deduping SQL Server table
A 2 billion row table is quite big. Let me assume that first
, last
, and dob
constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate
/re-insert approach.
In practice, this looks like:
create index idx_pf_main_first_last_dob on pf_main(first, last, dob);
select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
from pf_main m2
where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
m2.id < m.id
);
truncate table pf_main;
insert into pf_main
select *
from temp_pf_main;
Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL
If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.
I had to do something similar a few days ago so maybe you can modify my query for your purposes:
WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1
If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1
and table_2
How to delete duplicate rows in SQL ( Clickhouse)?
First of all, the answer depends on the table engine you used.
The most common on ClickHouse is the MergeTree family.
If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:
OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.
In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.
The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.
What means OPTIMIZE in ClickHouse ?
This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.
Two example :
- ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
id UInt64,
datetime DateTime,
name Nullable(String) default NULL
)
engine = ReplicatedReplacingMergeTree(datetime)
ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id, datetime, name
1, '2022-04-04 15:15:00', 'New name'
- AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
datetime DateTime,
id UInt64,
power SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
access SimpleAggregateFunction(sum, Nullable(UInt64)) default NULL
)
engine = ReplicatedAggregatingMergeTree()
ORDER BY (id, datetime) -- the primary key
-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime , id, power, access
2022-04-04 15:15:00, 1, 12, 3
The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.
This said, you will have some cases where you need to have your data "fresh" and not duplicated.
When your table is well configured, a simple OPTIMIZE TABLE ...
is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance.
You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.
SELECT * FROM radio_data FINAL WHERE id = 1
For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.
My last words will be on the usage of ALTER TABLE
statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.
You can find more material here :
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree
https://clickhouse.com/docs/en/sql-reference/statements/optimize/
https://clickhouse.com/docs/en/sql-reference/statements/alter/
Fastest technique to deleting duplicate data
What about EXISTS:
DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)
Best way to remove duplicate entries from a data table
Remove Duplicates
public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
Hashtable hTable = new Hashtable();
ArrayList duplicateList = new ArrayList();
//Add list of all the unique item value to hashtable, which stores combination of key, value pair.
//And add duplicate item value in arraylist.
foreach (DataRow drow in dTable.Rows)
{
if (hTable.Contains(drow[colName]))
duplicateList.Add(drow);
else
hTable.Add(drow[colName], string.Empty);
}
//Removing a list of duplicate items from datatable.
foreach (DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);
//Datatable which contains unique records will be return as output.
return dTable;
}
Here Links below
http://www.dotnetspider.com/resources/4535-Remove-duplicate-records-from-table.aspx
http://www.dotnetspark.com/kb/94-remove-duplicate-rows-value-from-datatable.aspx
For remove duplicates in column
http://dotnetguts.blogspot.com/2007/02/removing-duplicate-records-from.html
Remove duplicate rows from a big table
Thanks to "Kazi Mohammad Ali Nur" and "eshirvana". I've combine there solutions. At first I created index on Field1
.
CREATE CLUSTERED INDEX Index_Name
ON MyTable(Field1);
and then I executed following query to insert unique records into a new table and deleted original table.
WITH CTE(Field1, Field2, Field3, Field4, DuplicateCount)
AS (SELECT *,
ROW_NUMBER() OVER(PARTITION BY Field1 ORDER BY Field1) AS DuplicateCount
FROM MyTable)
select * into TempTable FROM CTE
WHERE DuplicateCount = 1;
and it worked.
Thanks to all.
Related Topics
A Strange Operation Problem in SQL Server: -100/-100*10 = 0
Sqlite Format Number with 2 Decimal Places Always
How to Make a Table Read Only in SQL Server
Query Across Multiple Databases on Same Server
SQL Server: How to Know If Any Row Is Referencing the Row to Delete
SQL - Columns for Different Categories
Are Determinants and Candidate Keys Same or Different Things
How to Order by Parent Then Child
How to Use Current Date in H2 Database SQL Query
How to Write a Function in the H2 Database Without Using Java
Add Row to Query Result Using Select
SQL Join Where to Place the Where Condition
How to Exclude Records with Certain Values in SQL Select
How to Catch a Query Exception in Laravel to See If It Fails
Join the Same Table Twice with Conditions
Concatenate Many Rows into a Single Text String with Grouping