How to Mark Certain Nr of Rows in Table on Concurrent Access

How to mark certain nr of rows in table on concurrent access

In the related answer you are referring to:

  • Postgres UPDATE ... LIMIT 1

The objective is to lock one row at a time. This works fine with or without advisory locks, because there is no chance for a deadlock - as long as you don't try to lock more rows in the same transaction.

Your example is different in that you want to lock 3000 rows at a time. There is potential for deadlock, except if all concurrent write operations lock rows in the same consistent order. Per documentation:

The best defense against deadlocks is generally to avoid them by being
certain that all applications using a database acquire locks on
multiple objects in a consistent order.

Implement that with an ORDER BY in your subquery.

UPDATE cargo_item item
SET job_id = 'SOME_UUID', job_ts = now()
FROM (
SELECT id
FROM cargo_item
WHERE state='NEW' AND job_id is null
ORDER BY id
LIMIT 3000
FOR UPDATE
) sub
WHERE item.id = sub.id;

This is safe and reliable, as long as all transactions acquire locks in the same order and concurrent updates of the ordering columns are not to be expected. (Read the yellow "CAUTION" box at the end of this chapter in the manual.) So this should be safe in your case, since you are not going to update the id column.

Effectively only one client at a time can manipulate rows this way. Concurrent transactions would try to lock the same (locked) rows and wait for the first transaction to finish.

Advisory locks are useful if you have many or very long running concurrent transactions (doesn't seem you do). With only a few, it will be cheaper overall to just use above query and have concurrent transactions wait for their turn.

All in one UPDATE

It seems concurrent access isn't a problem per se in your setup. Concurrency is an issue created by your current solution.

Instead, do it all in a single UPDATE. Assign batches of n numbers (3000 in the example) to each UUID and update all at once. Should be fastest.

UPDATE cargo_item c
SET job_id = u.uuid_col
, job_ts = now()
FROM (
SELECT row_number() OVER () AS rn, uuid_col
FROM uuid_tbl WHERE <some_criteria> -- or see below
) u
JOIN (
SELECT (row_number() OVER () / 3000) + 1 AS rn, item.id
FROM cargo_item
WHERE state = 'NEW' AND job_id IS NULL
FOR UPDATE -- just to be sure
) c2 USING (rn)
WHERE c2.item_id = c.item_id;

Major points

  • Integer division truncates. You get 1 for the first 3000 rows, 2 for the next 3000 rows. etc.

  • I pick rows arbitrarily, you could apply ORDER BY in the window for row_number() to assign certain rows.

  • If you don't have a table of UUIDs to dispatch (uuid_tbl), use a VALUES expression to supply them. Example.

  • You get batches of 3000 rows. The last batch will be short of 3000 if you don't find a multiple of 3000 to assign.

Efficiently mark contiguous subsets of rows in postgres

After initializing max_id as max(id) - 1000 to leave room for 1000 rows, this should be using the index:

UPDATE table
SET test = true
FROM (SELECT (random() * max_id)::bigint AS lower_bound) t
WHERE id BETWEEN t.lower_bound AND t.lower_bound + 999;
  • No need for the complicated structure with a CTE and subqueries. Use a single subquery.

  • You original calculation yields a numeric (or dp), which may not go well with an index on a bigint column. Cast to bigint. (Should not be a problem in pg 9.3.)

  • BETWEEN includes lower and upper bound. Your upper bound should be lower + 999, strictly speaking.

  • random() returns (per documentation) a random value in the range 0.0 <= x < 1.0. To be completely fair, your lower_bound should really be calculated like this (assuming no gaps):

      trunc(random() * max_id)::bigint + 1

If you need truly random numbers (or if your id has gaps), consider this related answer:

  • Best way to select random rows PostgreSQL

Maybe advisory locks or a different approach may be useful. Compare this related, later answer:

  • How to mark certain nr of rows in table on concurrent access

How to add rows in an Access Table with parameters?

Actually, I often find it too much work to create a complex insert routine. And even worse is I often don't care or want to supply all of the columns.

.net as a result has what is called a command builder for you.

And this quite much means you can write a lot of your code in a simular way to how VBA code in Access works.

So, say I want to add a new row - the table might have 50 columns, but I don't really care.

So, I can write the code this way:

    Dim rstHotels As DataTable

rstHotels = MyRst("SELECT * FROM tblHotels WHERE ID = 0")

' now add 3 new hotels

For i = 1 To 3

Dim OneRow = rstHotels.NewRow

OneRow("HotelName") = "Hotel #" & i
OneRow("City") = "City #" & i
OneRow("FirstName") = "Test First name #" & i
OneRow("Active") = True

rstHotels.Rows.Add(OneRow)

Next

' ok, added rows to rstHotels - now write back to database

MyRstUpDate(rstHotels, "tblHotels")

' or update 5 rows and do compplex processing to exising data.

Dim rstFun As DataTable = MyRst("SELECT * from tblHotels where City = 'Banff'")

For Each MyRow As DataRow In rstFun.Rows

MyRow("Active") = True
' more complex cpde here

Next

' now send data changes back to database

MyRstUpdate(rstFun, "tblHotels")

So, note how we don't have to have some complex insert statement, and we don't hve to write some loop that gives one hoot about the number of columns. So the .net data operations have build in all this stuff for you - little or even next to no reason for you to try and re-invent the wheel here.

And the two handy dandy code routines I have? The are :

Public Function MyRst(strSQL As String) As DataTable

Dim rstData As New DataTable
Using conn As New OleDbConnection(My.Settings.AccessDB)
Using cmdSQL As New OleDbCommand(strSQL, conn)
conn.Open()
rstData.Load(cmdSQL.ExecuteReader)
rstData.TableName = strSQL
End Using
End Using

Return rstData

End Function

Public Sub MyRstUpdate(rstData As DataTable, strTableName As String)

Using conn As New OleDbConnection(My.Settings.AccessDB)
Using cmdSQL As New OleDbCommand("SELECT * from " & strTableName, conn)

Dim da As New OleDbDataAdapter(cmdSQL)
Dim daUP As New OleDbCommandBuilder(da)
conn.Open()
da.Update(rstData)
End Using
End Using

End Sub

Now, I am really rather free to just code out my general routines.

So, you need to say load up a grid, or even a combo box? You can now do this:

ListBox1.DataSource = MyRst("SELECT ID, Salutation from tblGender ORDER BY Salutation")

So, for a simple insert, or even edit of some rows? No need to create some monster huge insert statement with a boatload of parameters. Just create a data table, and then use a simple data row to either add new rows, or even update existing ones.

The beauty of above is not only do you eliminate a boatload of parameters, but you also get parameter safe, and even type conversions. So, you can for example do this:

 OneRow("InvoiceDate") = Date.Today

Thus a strong typed value of "money" or integer, or datetime can be used in code - and no messey format convertions are required in most cases.

This so called "data base" first can be really handy, and often for some operations this is a lot less setup time and learning curve then say using EF, or even the previous dataset designer (EF = "Entity framework", which works really much like the older data set designer system - but introduction of these object model systems can be a big system to chew on when you just starting out).

But, no, don't write your own looping code to write out and create all the columns for a update command. (or insert command - note how that ONE routine can handle both updates or inserts. And you can even use row.Delete and then call tht update routine - it will also work!!.

If you think about this, that really amounts to a lot of work, and built in systems exist for this propose - saves you having to re-invent the wheel.

Concurrent insert of keys into a table

With Firebird you can use the following statement:

UPDATE OR INSERT INTO MY_TABLE (MY_KEY) VALUES (:MY_KEY) MATCHING (MY_KEY) RETURNING MY_ID
  • assuming there is a BEFORE INSERT trigger which will generate the MY_ID if a NULL value is being inserted.

Here is the documentation.

Update: The above statement will avoid exceptions and cause every statement to succeed. However, in case of many duplicate key values it will also cause many unnecessary updates.
This can be avoided by another approach: just handle the unique constraint exception on the client and ignore it. The details depend on which Delphi library you're using to work with Firebird but it should be possible to examine the SQLCode returned by the server and ignore only the specific case of unique constraint violation.

rowcount in 2 concurrent transactions

Yes, that's exactly what's happening. The timing has to be just right, which is why it's intermittent. You can verify this if you open two sqlplus sessions.

  1. Session A: Updates, no rows changed so SQL%ROWCOUNT = 0.
  2. Session B: Updates, no rows changed so SQL%ROWCOUNT = 0.
  3. Session A: Passes the IF condition, so does the insert.
  4. Session B: Passes the IF condition, so does the insert.
  5. Session A: Commits.
  6. Session B: Commits. Oops, there's a constraint violation because now I can see session A's changes.

Changing the isolation level will not help you. Oracle does not have any isolation level that lets you see uncommitted changes from another session (and that's a good thing).

The first thing to do is change your UPDATE and INSERT into a MERGE statement. That way, you have only a single statement that will succeed or fail. As far as I'm concerned, having an insert and update separated by a condition is an antipattern. Something like this:

MERGE INTO customer
USING ( SELECT customer_name FROM wherever ) source
ON ( source.customer_name = customer.customer_name )
WHEN NOT MATCHED THEN INSERT VALUES ( source.customer_name )
WHEN MATCHED THEN UPDATE SET ( customer_name = source.customer_name );

The drawback to MERGE is that it doesn't have a RETURNING INTO clause, so if you need that, you do have to go with the select/insert.

Second, to stop two sessions from inserting at the same time, you need to do:

LOCK TABLE customer IN SHARE MODE;

I generally don't like making custom locks, but I don't know any other way around this. This will stop other sessions from modifying the table until the first session commits, though they can query it. This means access to the table is serialized, so if you have a lot of sessions trying to update this table, that may not be acceptable.

Interview task: Limit number of concurrent requests with tokens

What you need to do is to check whether there is a free token when a request starts, take a token if there is a free one, and release it when work is done. Since the question was about a database, I think it hints at how to achieve atomicity of check and take: use transactions.

For example, you could have table with 500 rows for tokens with a procedure that would select a free token from database and, if there is one, update its row to mark it as taken. This procedure would be run in a transaction. If it ends with no free token available, then the serving thread would wait and try again after a short period of time. Releasing the token is again a trivial update of a row.

Another option would be starting with an empty table and taking a token by inserting a new row. Again, checking that there are <= 500 rows and inserting a new one must be run in a transaction to ensure atomicity.

Update column to a value in joined column

Raw SQL should be faster several orders of magnitude.

Step 1: INSERT

Insert all domain names into table domains unless they are already there:

INSERT INTO domains (name)
SELECT DISTINCT s.domain
FROM stats s
LEFT JOIN domains d ON d.name = s.domain
WHERE d.name IS NULL;
  • Select rows which are not present in other table

There is a potential race condition, if you have concurrent write access. The simplest solution would be to lock the table domains exclusively for the transaction. Else you might run into a unique violation halfway into the operation because a concurrent transaction has committed the same domain name in between. And everything will be rolled back.

BEGIN;
LOCK TABLE domains IN EXCLUSIVE MODE;

INSERT INTO domains (name)
SELECT DISTINCT s.domain
FROM stats s
LEFT JOIN domains d ON d.name = s.domain
WHERE d.name IS NULL;

COMMIT;

domains.name should be UNIQUE. That constraint is implemented with an index on the column which will help performance in the next step.

  • How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

  • Does a Postgres UNIQUE constraint imply an index?

Step 2: UPDATE

To update some rows but not all:

Update all domain_id to make it a foreign key to domains.name.
But don't use correlated subqueries, use an UPDATE with a FROM clause. Much faster here.

UPDATE stats s
SET domain_id = d.id
FROM domains d
WHERE d.name = s.domain
AND domain_id IS NULL; -- assuming existing ids are correct.

Then you can drop the now redundant column stats.domain:

ALTER TABLE stats DROP column domain;

That is extremely cheap. The column is marked dead in the system catalog. The actual column value is not removed until the row is updated or vacuumed.

To further improve performance drop all indexes that are not needed for the operation directly and create them afterwards - all in the same transaction.

Or, to update in batches of n rows:

  • How to mark certain nr of rows in table on concurrent access

Or, since you clarified in your comment that you are updating all rows, it would be substantially cheaper to create a new table like @Tim3880 also suggested - if constraints and access patterns allow for that.

Either create a completely new table, drop the old one and rename the new one:

  • Best way to populate a new column in a large table?

Or, if you need to keep the existing table in place (due to concurrent access or other constraints):

  • Optimizing bulk update performance in PostgreSQL


Aside: Never use non-descriptive terms like name or id as column names. That's a widespread anti-pattern. Schema should really be something like:

CREATE TABLE domain (
domain_id serial PRIMARY KEY
, domain text UNIQUE NOT NULL -- guessing it should be UNIQUE
);

CREATE TABLE stats (
stats_id serial PRIMARY KEY
, domain_id int REFERENCES domain
-- , domain text -- can be deleted after above normalization.
);


Related Topics



Leave a reply



Submit