If Exists Before Insert, Update, Delete for Optimization

IF EXISTS before INSERT, UPDATE, DELETE for optimization

I'm not completely sure, but I get the impression that this question is really about upsert, which is the following atomic operation:

  • If the row exists in both the source and target, UPDATE the target;
  • If the row only exists in the source, INSERT the row into the target;
  • (Optionally) If the row exists in the target but not the source, DELETE the row from the target.

Developers-turned-DBAs often naïvely write it row-by-row, like this:

-- For each row in source
IF EXISTS(<target_expression>)
IF @delete_flag = 1
DELETE <target_expression>
ELSE
UPDATE target
SET <target_columns> = <source_values>
WHERE <target_expression>
ELSE
INSERT target (<target_columns>)
VALUES (<source_values>)

This is just about the worst thing you can do, for several reasons:

  • It has a race condition. The row can disappear between IF EXISTS and the subsequent DELETE or UPDATE.

  • It's wasteful. For every transaction you have an extra operation being performed; maybe it's trivial, but that depends entirely on how well you've indexed.

  • Worst of all - it's following an iterative model, thinking about these problems at the level of a single row. This will have the largest (worst) impact of all on overall performance.

One very minor (and I emphasize minor) optimization is to just attempt the UPDATE anyway; if the row doesn't exist, @@ROWCOUNT will be 0 and you can then "safely" insert:

-- For each row in source
BEGIN TRAN

UPDATE target
SET <target_columns> = <source_values>
WHERE <target_expression>

IF (@@ROWCOUNT = 0)
INSERT target (<target_columns>)
VALUES (<source_values>)

COMMIT

Worst-case, this will still perform two operations for every transaction, but at least there's a chance of only performing one, and it also eliminates the race condition (kind of).

But the real issue is that this is still being done for each row in the source.

Before SQL Server 2008, you had to use an awkward 3-stage model to deal with this at the set level (still better than row-by-row):

BEGIN TRAN

INSERT target (<target_columns>)
SELECT <source_columns> FROM source s
WHERE s.id NOT IN (SELECT id FROM target)

UPDATE t SET <target_columns> = <source_columns>
FROM target t
INNER JOIN source s ON t.d = s.id

DELETE t
FROM target t
WHERE t.id NOT IN (SELECT id FROM source)

COMMIT

As I said, performance was pretty lousy on this, but still a lot better than the one-row-at-a-time approach. SQL Server 2008, however, finally introduced MERGE syntax, so now all you have to do is this:

MERGE target
USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE <target_columns> = <source_columns>
WHEN NOT MATCHED THEN INSERT (<target_columns>) VALUES (<source_columns>)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

That's it. One statement. If you're using SQL Server 2008 and need to perform any sequence of INSERT, UPDATE and DELETE depending on whether or not the row already exists - even if it's just one row - there is no excuse not to be using MERGE.

You can even OUTPUT the rows affected by a MERGE into a table variable if you need to find out afterward what was done. Simple, fast, and risk-free. Do it.

Optimize INSERT / UPDATE / DELETE operation

Modified table definition

If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:

COLUMN           |          TYPE           |      Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...

DDL statement would be:

ALTER TABLE watson_searchentry ALTER COLUMN  engine_slug DEFAULT 'default';

Etc.

Then you don't have to insert those values manually every time.

Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...

I'll go with your updated requirement:

The main point is to update columns title and content in watson_searchentry

Of course, you must add a UNIQUE constraint to enforce your requirements:

ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)

The accompanying index will be used. By this query for starters.

BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.

Query with data-modifying CTEs

This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.

Additionally, this query only deletes what has to be deleted, and updates what can be updated.

WITH  ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
  • The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.

  • The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.

  • In the CTE del I avoid deleting rows that can be updated.

  • In the CTE up those rows are updated instead.

  • Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.

Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.

Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.

Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.

Speed first

If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:

CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;

INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;

I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.

Also:

There is a trigger on the table that sets value of column search_tsv
based on these columns.

It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.

Which is the best choice in delete-insert vs if-update else-insert?

The if-exists-then-update-else-insert approach may be more code than delete-insert, but (depending on how many and what kind of indexes are defined on the tables) it is a lot less work for the server.

  • A DELETE or INSERT operation requires every index to be modified, period.
  • An UPDATE operation only requires these indexes modified whose fields have been updated in this instance.

So unless you modify every indexed field of the record with your updates, the longer approach is the more efficient one.


EDIT: Your question update says that currently you do not have any indexes apart from the primary key (which I assume is a clustered key). So in comparison:

When the row already exists, it's 1 clustered index seek (assuming you find records by their PK) and:

  • delete-insert: 4 write operations (delete row, delete PK index row, insert row, insert PK index row)
  • check-update/insert: 1 write operation (update row)

When the row does not yet exist, it's 1 clustered index seek and:

  • delete-insert: 2 write operations (insert row, insert PK index row)
  • check-update/insert: 2 write operations (insert row, insert PK index row)

The more indexes exist, the worse will delete/insert perform for rows that already exist. Plus it can cause non-clustered indexes to become unnecessarily fragmented due to avoidable write operations.

PostgreSQL, MySQL - redundant update/insert/delete optimization

In your examples, there will be no optimizations done, the databases will behave exactly as instructed (INSERT first then DELETE).

SQL Server and Oracle support MERGE command which combines INSERT, UPDATE and DELETE, but it is currently supported neither by PostgreSQL nor by MySQL.

MySQL also supports INSERT … ON DUPLICATE KEY UPDATE which can help in certain cases.

MySQL Update if exists, insert if not, delete if not exists (week schedule profile)

The simplest thing is just to delete all instances then create a new schedule for the updated entries. It sounds like you are trying to do some premature optimisation of the process which IMHO will not be necessary. If you find that deleting all existing entries and replacing them is a performance bottleneck then you might want to consider using an approach which tries to minimise the additional inserts but I think that this would probably end up actually being more work than the simpler option of just removing everything and replacing it

SQL - IF EXISTS UPDATE ELSE INSERT INTO

  1. Create a UNIQUE constraint on your subs_email column, if one does not already exist:

    ALTER TABLE subs ADD UNIQUE (subs_email)
  2. Use INSERT ... ON DUPLICATE KEY UPDATE:

    INSERT INTO subs
    (subs_name, subs_email, subs_birthday)
    VALUES
    (?, ?, ?)
    ON DUPLICATE KEY UPDATE
    subs_name = VALUES(subs_name),
    subs_birthday = VALUES(subs_birthday)

You can use the VALUES(col_name) function in the UPDATE clause to
refer to column values from the INSERT portion of the INSERT ... ON
DUPLICATE KEY UPDATE - dev.mysql.com


  1. Note that I have used parameter placeholders in the place of string literals, as one really should be using parameterised statements to defend against SQL injection attacks.

INSERT new row or DELETE old row IF it already exists

You can use stored procedure to SELECT and DELETE if exists or INSERT. But much easier would just be left the data there and update seen/unseen with ON DUPLICATE KEY UPDATE.

Is inserting a new database entry faster than checking if the entry exists first?

If the insert is going to fail because of an index violation, it will be at most marginally slower than a check that the record exists. (Both require checking whether the index contains the value.) If the insert is going to succeed, then issuing two queries is significantly slower than issuing one.



Related Topics



Leave a reply



Submit