How to delete duplicates in SQL table based on multiple fields
You should be able to do a correlated subquery to delete the data. Find all rows that are duplicates and delete all but the one with the smallest id. For MYSQL, an inner join (functional equivalent of EXISTS) needs to be used, like so:
delete games from games inner join
(select min(id) minid, date, time,
hometeam_id, awayteam_id, locationcity, locationstate
from games
group by date, time, hometeam_id,
awayteam_id, locationcity, locationstate
having count(1) > 1) as duplicates
on (duplicates.date = games.date
and duplicates.time = games.time
and duplicates.hometeam_id = games.hometeam_id
and duplicates.awayteam_id = games.awayteam_id
and duplicates.locationcity = games.locationcity
and duplicates.locationstate = games.locationstate
and duplicates.minid <> games.id)
To test, replace delete games from games
with select * from games
. Don't just run a delete on your DB :-)
Eliminating duplicate values based on only one column of the table
This is where the window function row_number()
comes in handy:
SELECT s.siteName, s.siteIP, h.date
FROM sites s INNER JOIN
(select h.*, row_number() over (partition by siteName order by date desc) as seqnum
from history h
) h
ON s.siteName = h.siteName and seqnum = 1
ORDER BY s.siteName, h.date
remove duplicate rows based on one column value
This is similar to Gordon Linoff's query, but without the subquery:
DELETE t1 FROM table t1
JOIN table t2
ON t2.refID = t1.refID
AND t2.ID < t1.ID
This uses an inner join to only delete rows where there is another row with the same refID but lower ID.
The benefit of avoiding a subquery is being able to utilize an index for the search. This query should perform well with a multi-column index on refID + ID.
Delete duplicate rows in MySQL based on contents of another table
@MHardwick and @ShadowRay almost got it right. The following also checks to make sure the email exists more tan once in tb_email_to_members
DELETE FROM tb_email_to_members
WHERE email_id NOT IN (SELECT frn_email_id FROM tb_tx)
AND email_address IN (SELECT email_address FROM tb_email_to_members GROUP BY email_address HAVING COUNT(email_address) > 1);
And obviously changing DELETE
to SELECT *
will show you what exactly you're about to delete.
Bonus points for knowing tb
is short for tidbits?
Remove duplicates from table based on multiple criteria and persist to other table
If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:
SELECT
surviving rows into a temporary table.- Reroute FK references to survivors
DELETE
all rows from the base table.- Re-
INSERT
survivors.
1a. Distill surviving rows
CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM (
SELECT DISTINCT ON (email) *
FROM taccounts
ORDER BY email, last_login DESC
) sub
ORDER BY login_name, password, last_login DESC;
About DISTINCT ON
:
- Select first row in each GROUP BY group?
To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login
, so this is "serializable".
Inspect results and test for plausibility.
SELECT * FROM tmp;
Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.
1b. Alternative query for updated definition of "duplicates"
SELECT *
FROM taccounts t
WHERE NOT EXISTS (
SELECT FROM taccounts t1
WHERE ( NULLIF(t1.email, '') = t.email
OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password))
AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
);
This doesn't treat NULL
or empty string (''
) as identical in any of the "duplicate" columns.
The row expression (t1.last_login, t1.account_id)
takes care of the possibility that two dupes could share the same last_login
. The one with the bigger account_id
is chosen in this case - which is unique, since it is the PK.
2a. How to identify all incoming FKs
SELECT c.confrelid::regclass::text AS referenced_table
, c.conname AS fk_name
, pg_get_constraintdef(c.oid) AS fk_definition
FROM pg_attribute a
JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE c.confrelid = 'taccounts'::regclass -- (schema-qualified) table name
AND c.contype = 'f'
ORDER BY 1, contype DESC;
Only building on the first column of the foreign key. More about that:
- Find the referenced table name using table, field and schema name
Or inspect the Dependents
rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts
.
2b. Reroute to new primary
If you have tables referencing taccounts
(incoming foreign keys to taccounts
) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t1 USING (email)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t2 USING (login_name, password)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
3. & 4. Go in for the kill
Now, dupes are not referenced any more. Go in for the kill.
ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;
Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.
Related Topics
Counting Number of Joined Rows in Left Join
Jooq - Support for Update ... Set ... Query with Arbitrary Degree
SQL Server 2008: the Columns in Table Do Not Match an Existing Primary Key or Unique Constraint
Just Get Column Names from Hive Table
Get .SQL File from SQL Server 2012 Database
Postgresql - Using Subqueries with Alter Sequence Expressions
Rows to Comma Separated Values Using Xml Tag
Why Does No Database Fully Support Ansi or Iso SQL Standards
SQL Server Giving Logins(Users) Db_Owner Access to Database
Suppress Output of Variables Substitution in SQLplus
Sqlite: Count Slow on Big Tables
Why Does SQL Server Return 0 for 1/2
Trim Spaces in String - Ltrim Rtrim Not Working
SQL Full Text Search VS "Like"
Oracle Copy Data to Another Table
SQL Query Selecting Different Row Result in JSON_Modify Because of in Operator Provided Value