Efficiently Duplicate Some Rows in Postgresql Table

postgreSQL: how to duplicate a row

You need to create a new ID for the newly inserted row:

INSERT INTO web_book( 
   id, page_count, year_published, file, image, 
   display_on_hp, name, description, name_cs, 
   name_en, description_cs, description_en
)
SELECT nextval('web_book_id_seq'), 
       page_count, 
       year_published, 
       file, 
       image, 
       display_on_hp, 
       name, 
       description, 
       name_cs, 
       name_en, 
       description_cs, 
       description_en 
FROM web_book WHERE id=3;

As mentioned by ClodoaldoNeto you can make things a bit easier by simply leaving out the ID column and let the default definition do its job:

INSERT INTO web_book( 
   page_count, year_published, file, image, 
   display_on_hp, name, description, name_cs, 
   name_en, description_cs, description_en
)
SELECT page_count, 
       year_published, 
       file, 
       image, 
       display_on_hp, 
       name, 
       description, 
       name_cs, 
       name_en, 
       description_cs, 
       description_en 
FROM web_book WHERE id=3;

In this case you don't need to know the sequence name (but it is a bit less obvious what's going on).

most efficient way to select duplicate rows with max timestamp

Use DISTINCT ON:

SELECT DISTINCT ON (id) id, content, time
FROM yourTable
ORDER BY id, time DESC;

On Postgres, this is usually the most performant way to write your query, and it should outperform ROW_NUMBER and other approaches.

The following index might speed up this query:

CREATE INDEX idx ON yourTable (id, time DESC, content);

This index, if used, would let Postgres rapidly find, for each id, the record having the latest time. This index also covers the content column.

Most efficient way to remove duplicates - Postgres

demo:db<>fiddle

Finding duplicates can be easily achieved by using row_number() window function:

SELECT ctid 
FROM(
    SELECT 
        *, 
        ctid,
        row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
    FROM test
)s
WHERE row_number >= 2

This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:

DELETE 
FROM test
WHERE ctid IN 
(
    SELECT ctid 
    FROM(
        SELECT 
            *, 
            ctid,
            row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
        FROM test
    )s
    WHERE row_number >= 2
)

I don't know if this solution is faster than your attempts but your could give it a try.

Furthermore - as @a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.

Edit:

For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).

demo:db<>fiddle

How to drop duplicate rows from postgresql sql table

That is a lot of rows to delete. I would suggest just recreating the table:

create table new_classification as
    select distinct c.*
    from classification c;

After you have validated the data, you can reload it if you really want:

truncate table classification;

insert into classification
    select *
    from new_classification;

This process should be much faster than deleting 90% of the rows.

Efficient inserts with duplicate checks for large tables in Postgres

Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.

However, the distinct on is slowing things down. Here are two ideas.

The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.

The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.

There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.

Delete duplicate rows from large ( 100 MIo) postgresql table (truncate with condition?)

You could use the ctid column as a "replacement id":

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email 
  AND user_account.ctid < ua2.ctid;

Although that raises another question: why doesn't your user_accounts table have a primary key?

But if you delete a substantial part of the rows in the table then delete will never be very efficient (and the comparison on ctid isn't a quick one either because it does not have an index). So the delete will most probably take a very long time.

For a one time operation and if you need to delete many rows, then inserting those you want to keep into an intermediate table is going to be much faster.

That method can be improved by simply keeping the intermediate table instead of copying the rows back to the original table.

-- this will create the same table including indexes and not null constraint
-- but NOT foreign key constraints!
create table temp (like user_accounts including all);

insert into temp 
select distinct ... -- this is your query that removes the duplicates
from user_accounts;

 -- you might need cascade if the table is referenced by others
drop table user_accounts;

alter table temp rename to user_accounts;

commit;

The only drawback is that you have to re-create foreign keys for the original table (fks referencing the original table and foreign keys from the original table to a different one).