Avoid Duplicates in Insert into Select Query in SQL Server

Avoid duplicates in INSERT INTO SELECT query in SQL Server

Using NOT EXISTS:

INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE NOT EXISTS(SELECT id
FROM TABLE_2 t2
WHERE t2.id = t1.id)

Using NOT IN:

INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE t1.id NOT IN (SELECT id
FROM TABLE_2)

Using LEFT JOIN/IS NULL:

INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
LEFT JOIN TABLE_2 t2 ON t2.id = t1.id
WHERE t2.id IS NULL

Of the three options, the LEFT JOIN/IS NULL is less efficient. See this link for more details.

Avoid duplicates on INSERT INTO SELECT query in SQL Server

The problem is that select distinct is not sufficient. You still have duplicates in the underlying table, but with different names or descriptions.

I view this as a problem. But, you can work around it by selecting one arbitrary row per cust_code, using row_number():

insert into dbo.Entities (EntityId, [Name], [Description], [Type], Source)
select CUST_CODE, NAME, FULLDESCRIPTION, 'Agency' AS [Type], 'SunDbAgencies' AS Source
from (select a.*,
row_number() over (partitoin by cust_code order by cust_code) as seqnum
from dbo.VW_SUNDB_AGENCIES a
) a
where seqnum = 1 and
not exists (select 1 from dbo.Entities E where A.CUST_CODE = E.EntityId);

How to avoid Duplicate values for INSERT in SQL?

Before inserting check if there is a record with the same values:

if not exists (select * from Delegates d where d.FromYr = @FromYr and d.MemNo = @MemNo)
INSERT INTO Delegates ([MemNo],[FromYr],[ToYr]) values(@MemNo, @FromYr,@ToYr)

Avoid duplicates in INSERT INTO in the same table

If the version of SQLite you use is 3.24.0+ and there is a unique constraint for the column city, you can use upsert which gives you an option to do NOTHING or UPDATE the table if a unique constraint violation occurs.

In this case:

String sql = 
"INSERT INTO weather VALUES(?,?,?,?,?,?,?) " +
"ON CONFLICT DO NOTHING";

if you try to insert a row with an existing city, the statement will fail without an error.

But if the new row contains up to date data for the other columns and you want the row updated, you can do this:

String sql = 
"INSERT INTO weather VALUES(?,?,?,?,?,?,?) " +
"ON CONFLICT(city) DO UPDATE SET "
"temp = excluded.temp, " +
"feels_like = excluded.feels_like, " +
"temp_min = excluded.temp_min, " +
"temp_max = excluded.temp_max, " +
"pressure = excluded.pressure, " +
"humidity = excluded.humidity";

and the other 6 columns will be overwritten by the new values you supplied.

If there isn't a unique constraint defined for city and you don't want to or can't define one, then you can avoid inserting the same city twice with NOT EXISTS like this:

String sql = 
"INSERT INTO weather SELECT ?,?,?,?,?,?,? " +
"WHERE NOT EXISTS (SELECT 1 FROM weather WHERE city = ?);

In this case you will have to pass in your Java code as an additional 8th parameter the value of the city again.

SQL INSERT but avoid duplicates

EDIT: to prevent race conditions in concurrent environments, use WITH (UPDLOCK) in the correlated subquery or EXCEPT'd SELECT. The test script I wrote below doesn't require it, since it uses temporary tables that are only visible to the current connection, but in a real environment, operating against user tables, it would be necessary.

MERGE doesn't require UPDLOCK.


Inspired by mcl's answer re: unique index & let the database throw an error, I decided to benchmark conditional inserts vs. try/catch.

The results appear to support the conditional insert over try/catch, but YMMV. It's a very simple scenario (one column, small table, etc), executed on one machine, etc.

Here are the results (SQL Server 2008, build 10.0.1600.2):

duplicates (short table)    
try/catch: 14440 milliseconds / 100000 inserts
conditional insert: 2983 milliseconds / 100000 inserts
except: 2966 milliseconds / 100000 inserts
merge: 2983 milliseconds / 100000 inserts

uniques
try/catch: 3920 milliseconds / 100000 inserts
conditional insert: 3860 milliseconds / 100000 inserts
except: 3873 milliseconds / 100000 inserts
merge: 3890 milliseconds / 100000 inserts

straight insert: 3173 milliseconds / 100000 inserts

duplicates (tall table)
try/catch: 14436 milliseconds / 100000 inserts
conditional insert: 3063 milliseconds / 100000 inserts
except: 3063 milliseconds / 100000 inserts
merge: 3030 milliseconds / 100000 inserts

Notice, even on unique inserts, there's slightly more overhead to try/catch than a conditional insert. I wonder if this varies by version, CPU, number of cores, etc.

I did not benchmark the IF conditional inserts, just WHERE. I assume the IF variety would show more overhead, since a) would you have two statements, and b) you would need to wrap the two statements in a transaction and set the isolation level to serializable (!). If someone wanted to test this, you would need to change the temp table to a regular user table (serializable doesn't apply to local temp tables).

Here is the script:

-- tested on SQL 2008.
-- to run on SQL 2005, comment out the statements using MERGE
set nocount on

if object_id('tempdb..#temp') is not null drop table #temp
create table #temp (col1 int primary key)
go

-------------------------------------------------------

-- duplicate insert test against a table w/ 1 record

-------------------------------------------------------

insert #temp values (1)
go

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
begin try
insert #temp select @x
end try
begin catch end catch
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (short table), try/catch: %i milliseconds / %i inserts',-1,-1,@duration,@y) with nowait
go

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
insert #temp select @x where not exists (select * from #temp where col1 = @x)
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (short table), conditional insert: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
insert #temp select @x except select col1 from #temp
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (short table), except: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

-- comment this batch out for SQL 2005
declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
merge #temp t using (select @x) s (col1) on t.col1 = s.col1 when not matched by target then insert values (col1);
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (short table), merge: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

-------------------------------------------------------

-- unique insert test against an initially empty table

-------------------------------------------------------

truncate table #temp
declare @x int, @now datetime, @duration int
select @x = 0, @now = getdate()
while @x < 100000 begin
set @x = @x+1
insert #temp select @x
end
set @duration = datediff(ms,@now,getdate())
raiserror('uniques, straight insert: %i milliseconds / %i inserts',-1,-1,@duration, @x) with nowait
go

truncate table #temp
declare @x int, @now datetime, @duration int
select @x = 0, @now = getdate()
while @x < 100000 begin
set @x = @x+1
begin try
insert #temp select @x
end try
begin catch end catch
end
set @duration = datediff(ms,@now,getdate())
raiserror('uniques, try/catch: %i milliseconds / %i inserts',-1,-1,@duration, @x) with nowait
go

truncate table #temp
declare @x int, @now datetime, @duration int
select @x = 0, @now = getdate()
while @x < 100000 begin
set @x = @x+1
insert #temp select @x where not exists (select * from #temp where col1 = @x)
end
set @duration = datediff(ms,@now,getdate())
raiserror('uniques, conditional insert: %i milliseconds / %i inserts',-1,-1,@duration, @x) with nowait
go

truncate table #temp
declare @x int, @now datetime, @duration int
select @x = 0, @now = getdate()
while @x < 100000 begin
set @x = @x+1
insert #temp select @x except select col1 from #temp
end
set @duration = datediff(ms,@now,getdate())
raiserror('uniques, except: %i milliseconds / %i inserts',-1,-1,@duration, @x) with nowait
go

-- comment this batch out for SQL 2005
truncate table #temp
declare @x int, @now datetime, @duration int
select @x = 1, @now = getdate()
while @x < 100000 begin
set @x = @x+1
merge #temp t using (select @x) s (col1) on t.col1 = s.col1 when not matched by target then insert values (col1);
end
set @duration = datediff(ms,@now,getdate())
raiserror('uniques, merge: %i milliseconds / %i inserts',-1,-1,@duration, @x) with nowait
go

-------------------------------------------------------

-- duplicate insert test against a table w/ 100000 records

-------------------------------------------------------

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
begin try
insert #temp select @x
end try
begin catch end catch
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (tall table), try/catch: %i milliseconds / %i inserts',-1,-1,@duration,@y) with nowait
go

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
insert #temp select @x where not exists (select * from #temp where col1 = @x)
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (tall table), conditional insert: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
insert #temp select @x except select col1 from #temp
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (tall table), except: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

-- comment this batch out for SQL 2005
declare @x int, @y int, @now datetime, @duration int
select @x = 1, @y = 0, @now = getdate()
while @y < 100000 begin
set @y = @y+1
merge #temp t using (select @x) s (col1) on t.col1 = s.col1 when not matched by target then insert values (col1);
end
set @duration = datediff(ms,@now,getdate())
raiserror('duplicates (tall table), merge: %i milliseconds / %i inserts',-1,-1,@duration, @y) with nowait
go

INSERT INTO SELECT with LEFT JOIN not preventing duplicates for simultaneous hits

The corrective action here depends on the behavior you want. If you intend to allow for just a single horizontal instance of your application to execute this query, then you need to create a critical section, into which one instance is allowed to enter. Since you are already using SQL Server, you could implement by forcing each instance to get a lock on a certain table. Only the instance which gets the lock will execute the query, and the others will drop off.

If, on the other hand, you really want each instance to execute the query, then you should use a serializable transaction. Using a serializable transaction will ensure that only one instance can do the insert on the table at a given time. It would not be possible for two or more instances to interleave and execute the same insert.



Related Topics



Leave a reply



Submit