How to Bulk Insert Only New Rows in Postresql

How to bulk insert only new rows in PostreSQL

Import data

COPY everything to a temporary staging table and insert only new titles into your target table.

CREATE TEMP TABLE tmp(title text);

COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp;

INSERT INTO tbl
SELECT DISTINCT tmp.title
FROM   tmp 
LEFT   JOIN tbl USING (title)
WHERE  tbl.title IS NULL;

IDs should be generated automatically with a serial column tbl_id in tbl.

The LEFT JOIN / IS NULL construct disqualifies already existing titles. NOT EXISTS would be another possibility.

DISTINCT prevents duplicates in the incoming data in the temporary table tmp.

ANALYZE is useful to make sure the query planner picks a sensible plan, and temporary tables are not analyzed by autovacuum.

Since you have 3 million items, it might pay to raise the setting for temp_buffer (for this session only):

SET temp_buffers = 1000MB;

Or however much you can afford and is enough to hold the temp table in RAM, which is much faster. Note: must be done first in the session - before any temp objects are created.

Retrieve IDs

To see all IDs for the imported data:

SELECT tbl.tbl_id, tbl.title
FROM   tbl
JOIN   tmp USING (title)

In the same session! A temporary table is dropped automatically at the end of the session.

What's the fastest way to do a bulk insert into Postgres?

PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).

Insert multiple rows where not exists PostgresQL

Your select is not doing what you think it does.

The most compact version in PostgreSQL would be something like this:

with data(first_name, last_name, uid)  as (
   values
      ( 'John', 'Doe', '3sldkjfksjd'),
      ( 'Jane', 'Doe', 'adslkejkdsjfds')
) 
insert into users (first_name, last_name, uid) 
select d.first_name, d.last_name, d.uid
from data d
where not exists (select 1
                  from users u2
                  where u2.uid = d.uid);

Which is pretty much equivalent to:

insert into users (first_name, last_name, uid) 
select d.first_name, d.last_name, d.uid
from (
   select 'John' as first_name, 'Doe' as last_name, '3sldkjfksjd' as uid
   union all
   select 'Jane', 'Doe', 'adslkejkdsjfds'
) as d
where not exists (select 1
                  from users u2
                  where u2.uid = d.uid);

How to do PostgreSQL Bulk INSERT without Primary Key Violation

Generally for this type of situation I'd have a separate staging table that does not have the PK constraint, which I'd populate using COPY (assuming the data were in a format for which it makes sense to do a COPY). Then I'd do something like:

insert into table
select a.*
from staging a
where not exists (select 1
                  from table
                  where a.id = b.id)

That approach isn't too far off from your original design.

I don't totally understand this part of your question, however, which doesn't even seem totally relevant to your question:

this approach unfortunately still doesn't work - because every single
statement in postgreSQL is committed separately.

That's not true at all, not for any RDBMS. Sure, auto-commit might be enabled on your client, but that doesn't mean that postgres commits every statement separately and that you can't disable the auto-commit. This approach would work:

begin;
insert into table (id) select 1 where not exists (select 1 from table where id = 1);
insert into table (id) select 2 where not exists (select 1 from table where id = 2);
insert into table (id) select 3 where not exists (select 1 from table where id = 3);
commit;

As you pointed out, however, if you've got more than a handful of such statements you'll quickly be hitting some performance concerns.

How to Bulk Insert Only New Rows in Postresql