IF EXISTS before INSERT, UPDATE, DELETE for optimization
I'm not completely sure, but I get the impression that this question is really about upsert, which is the following atomic operation:
- If the row exists in both the source and target,
UPDATE
the target; - If the row only exists in the source,
INSERT
the row into the target; - (Optionally) If the row exists in the target but not the source,
DELETE
the row from the target.
Developers-turned-DBAs often naïvely write it row-by-row, like this:
-- For each row in source
IF EXISTS(<target_expression>)
IF @delete_flag = 1
DELETE <target_expression>
ELSE
UPDATE target
SET <target_columns> = <source_values>
WHERE <target_expression>
ELSE
INSERT target (<target_columns>)
VALUES (<source_values>)
This is just about the worst thing you can do, for several reasons:
It has a race condition. The row can disappear between
IF EXISTS
and the subsequentDELETE
orUPDATE
.It's wasteful. For every transaction you have an extra operation being performed; maybe it's trivial, but that depends entirely on how well you've indexed.
Worst of all - it's following an iterative model, thinking about these problems at the level of a single row. This will have the largest (worst) impact of all on overall performance.
One very minor (and I emphasize minor) optimization is to just attempt the UPDATE
anyway; if the row doesn't exist, @@ROWCOUNT
will be 0 and you can then "safely" insert:
-- For each row in source
BEGIN TRAN
UPDATE target
SET <target_columns> = <source_values>
WHERE <target_expression>
IF (@@ROWCOUNT = 0)
INSERT target (<target_columns>)
VALUES (<source_values>)
COMMIT
Worst-case, this will still perform two operations for every transaction, but at least there's a chance of only performing one, and it also eliminates the race condition (kind of).
But the real issue is that this is still being done for each row in the source.
Before SQL Server 2008, you had to use an awkward 3-stage model to deal with this at the set level (still better than row-by-row):
BEGIN TRAN
INSERT target (<target_columns>)
SELECT <source_columns> FROM source s
WHERE s.id NOT IN (SELECT id FROM target)
UPDATE t SET <target_columns> = <source_columns>
FROM target t
INNER JOIN source s ON t.d = s.id
DELETE t
FROM target t
WHERE t.id NOT IN (SELECT id FROM source)
COMMIT
As I said, performance was pretty lousy on this, but still a lot better than the one-row-at-a-time approach. SQL Server 2008, however, finally introduced MERGE syntax, so now all you have to do is this:
MERGE target
USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE <target_columns> = <source_columns>
WHEN NOT MATCHED THEN INSERT (<target_columns>) VALUES (<source_columns>)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
That's it. One statement. If you're using SQL Server 2008 and need to perform any sequence of INSERT
, UPDATE
and DELETE
depending on whether or not the row already exists - even if it's just one row - there is no excuse not to be using MERGE
.
You can even OUTPUT
the rows affected by a MERGE
into a table variable if you need to find out afterward what was done. Simple, fast, and risk-free. Do it.
Optimize INSERT / UPDATE / DELETE operation
Modified table definition
If you really need those columns to be NOT NULL
and you really need the string 'default'
as default for engine_slug
, I would advice to introduce column defaults:
COLUMN | TYPE | Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...
DDL statement would be:
ALTER TABLE watson_searchentry ALTER COLUMN engine_slug DEFAULT 'default';
Etc.
Then you don't have to insert those values manually every time.
Also: object_id text NOT NULL, object_id_int INTEGER
? That's odd. I guess you have your reasons ...
I'll go with your updated requirement:
The main point is to update columns
title
andcontent
inwatson_searchentry
Of course, you must add a UNIQUE constraint to enforce your requirements:
ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)
The accompanying index will be used. By this query for starters.
BTW, I almost never use varchar(n)
in Postgres. Just text
. Here's one reason.
Query with data-modifying CTEs
This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.
WITH ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
The subquery on
django_content_type
always returns a single value? Otherwise, theCROSS JOIN
might cause trouble.The first CTE
sel
gathers the rows to be inserted. Note how I pick matching column names to simplify things.In the CTE
del
I avoid deleting rows that can be updated.In the CTE
up
those rows are updated instead.Accordingly, I avoid inserting rows that were not deleted before in the final
INSERT
.
Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.
Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.
Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.
Speed first
If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:
CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;
INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;
I even refrain from using concat_ws()
: It is safe against NULL
values and simplifies code, but a bit slower than simple concatenation.
Also:
There is a trigger on the table that sets value of column
search_tsv
based on these columns.
It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.
Which is the best choice in delete-insert vs if-update else-insert?
The if-exists-then-update-else-insert approach may be more code than delete-insert, but (depending on how many and what kind of indexes are defined on the tables) it is a lot less work for the server.
- A DELETE or INSERT operation requires every index to be modified, period.
- An UPDATE operation only requires these indexes modified whose fields have been updated in this instance.
So unless you modify every indexed field of the record with your updates, the longer approach is the more efficient one.
EDIT: Your question update says that currently you do not have any indexes apart from the primary key (which I assume is a clustered key). So in comparison:
When the row already exists, it's 1 clustered index seek (assuming you find records by their PK) and:
- delete-insert: 4 write operations (delete row, delete PK index row, insert row, insert PK index row)
- check-update/insert: 1 write operation (update row)
When the row does not yet exist, it's 1 clustered index seek and:
- delete-insert: 2 write operations (insert row, insert PK index row)
- check-update/insert: 2 write operations (insert row, insert PK index row)
The more indexes exist, the worse will delete/insert perform for rows that already exist. Plus it can cause non-clustered indexes to become unnecessarily fragmented due to avoidable write operations.
PostgreSQL, MySQL - redundant update/insert/delete optimization
In your examples, there will be no optimizations done, the databases will behave exactly as instructed (INSERT
first then DELETE
).
SQL Server
and Oracle
support MERGE
command which combines INSERT
, UPDATE
and DELETE
, but it is currently supported neither by PostgreSQL
nor by MySQL
.
MySQL
also supports INSERT … ON DUPLICATE KEY UPDATE
which can help in certain cases.
MySQL Update if exists, insert if not, delete if not exists (week schedule profile)
The simplest thing is just to delete all instances then create a new schedule for the updated entries. It sounds like you are trying to do some premature optimisation of the process which IMHO will not be necessary. If you find that deleting all existing entries and replacing them is a performance bottleneck then you might want to consider using an approach which tries to minimise the additional inserts but I think that this would probably end up actually being more work than the simpler option of just removing everything and replacing it
SQL - IF EXISTS UPDATE ELSE INSERT INTO
Create a
UNIQUE
constraint on yoursubs_email
column, if one does not already exist:ALTER TABLE subs ADD UNIQUE (subs_email)
Use
INSERT ... ON DUPLICATE KEY UPDATE
:INSERT INTO subs
(subs_name, subs_email, subs_birthday)
VALUES
(?, ?, ?)
ON DUPLICATE KEY UPDATE
subs_name = VALUES(subs_name),
subs_birthday = VALUES(subs_birthday)
You can use the VALUES(col_name) function in the UPDATE clause to
refer to column values from the INSERT portion of the INSERT ... ON
DUPLICATE KEY UPDATE - dev.mysql.com
- Note that I have used parameter placeholders in the place of string literals, as one really should be using parameterised statements to defend against SQL injection attacks.
INSERT new row or DELETE old row IF it already exists
You can use stored procedure to SELECT and DELETE if exists or INSERT. But much easier would just be left the data there and update seen/unseen with ON DUPLICATE KEY UPDATE.
Is inserting a new database entry faster than checking if the entry exists first?
If the insert is going to fail because of an index violation, it will be at most marginally slower than a check that the record exists. (Both require checking whether the index contains the value.) If the insert is going to succeed, then issuing two queries is significantly slower than issuing one.
Related Topics
SQL Server: How to Get All Child Records Given a Parent Id in a Self Referencing Table
SQL Query to Collapse Duplicate Values by Date Range
How to Get Input File Name as Column in Aws Athena External Tables
Add an Incremental Number in a Field in Insert into Select Query in SQL Server
Sqlite: Preventing Duplicate Rows
Good Resources for Relational Database Design
Deleting Hierarchical Data in SQL Table
Fastest Way to Find String by Substring in SQL
How to Set Server Output on in Datagrip
The Best Way to Use a Db Table as a Job Queue (A.K.A Batch Queue or Message Queue)
Difference Between for and After Triggers
Alter Data Type of a Column to Serial
Using Window Functions in an Update Statement
Fastest Way Merge Two SQLite Databases
SQL Merge Tables Side-By-Side with Nothing in Common
To Find Infinite Recursive Loop in Cte
Cannot Use Group by and Over(Partition By) in the Same Query