Deleting Duplicates Rows from Redshift

SQL - Remove all duplicates and retain just one

I don't think Redshift has a way to identify rows, if all the data is the same. So, I think your best bet is to recreate the table:

create table temp_mytable as
select distinct *
from mytable;

truncate table mytable;

insert into mytable
select *
from distinct mytable;

If your table really did have a primary key, there would be alternative ways of deleting rows.

Remove all duplicates from Redshift database

One thing to bare in mind with Redshift is that deleted records are only actually "soft" deleted until VACUUM is run.

- They remain in the table, marked as to-be-ignored

- They're only deleted after a Vacuum

However, a VACUUM on a large table with deletes scattered through it is very often actually slower than a "Deep Copy". (Duplicate the data into another table, using GROUP BY or DISTINCT to eliminate the duplicates, TRUNCATE the original table and re-insert the data or drop the original table and rename the new table.)

This is a general rational for why you may actually benefit from what feels like the "slow" process.


Also, if two rows really are identical then there is no way (by definition) to uniquely identify one row. That being the case you can't differentiate between one to be kept and ones to be deleted.

One "trick" in other RDBMS is to use ROW_NUMBER() inside of a Common Table Expression and then delete from that CTE. (With the CTE creating the unique identifiers, allowing you to identify individual rows to be kept or deleted.) Unfortunately Redshift doesn't currently support deleting from a CTE.

Until this changes, Deep Copy (copying to a separate table while using GROUP BY or DISTINCT) is currently your only option.

Even so, the Deep Copy option may still be more valid in Redshift even if deleting from a CTE does ever become possible.


EDIT :

Correction:

If any row in a Redshift Table has been deleted, any subsequent VACUUM will reprocess the entire table (regardless of where the deleted rows are, or how many deleted rows there are).

(It's more sophisticated when VACUUMing following an INSERT, but down-right-ugly following a DELETE.)

I've also noticed that a Deep Copy uses less disk space than a VACUUM. (Which only came to my attention when we ran out of disk space...)


EDIT :

Code Example:

CREATE TABLE blah_temp (
<Exactly the same DDL as the original table, especially Distribution and Sort keys>
)
;

INSERT INTO
blah_temp
SELECT DISTINCT
*
FROM
blah
;

DROP TABLE blah;

ALTER TABLE blah_temp RENAME TO blah;

Or...

CREATE TABLE blah_temp (
<Exactly the same DDL as the original table, especially Distribution and Sort keys>
)
;

INSERT INTO
blah_temp
SELECT
*
FROM
blah
GROUP BY
a, b, c, d, e, f, g, etc
;

TRUNCATE TABLE blah;

INSERT INTO
blah
SELECT
*
FROM
blah_temp
;

DROP TABLE blah_temp;


Related Link: https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

Removing duplicates in all fields but one, from a table in Redshift

I would suggest recreating the table:

create table new_t as 
select t.*
from (select <all other columns>,
row_number() over (partition by <all other columns> order by mongo_id) as seqnum
from t
) t
where seqnum = 1;

You can truncate the existing table and then copy these results into it, if you must put the data back in place.

Deleting lots of rows in a table can be much more expensive than using a query and saving the results.

SQL - Redshift remove duplicate rows without primary key

In Postgres, you can do this using ctid. This is a system "column" that physically identifies each row.

The idea is:

delete from tablename
where ctid not in (select min(t2.ctid)
from tablename t2
group by column1, column2, column3
);

I am not sure if Redshift supports ctid. But then again, despite the tags, your question is explicitly about Postgres.



Related Topics



Leave a reply



Submit