update x set y = null takes a long time
Summary
I think updating to null is slower because Oracle (incorrectly) tries to take advantage of the way it stores nulls, causing it to frequently re-organize the rows in the block ("heap block compress"), creating a lot of extra UNDO and REDO.
What's so special about null?
From the Oracle Database Concepts:
"Nulls are stored in the database if they fall between columns with data values. In these cases they require 1 byte to store the length of the column (zero).
Trailing nulls in a row require no storage because a new row header signals that the remaining columns in the previous row are null. For example, if the last three columns of a table are null, no information is stored for those columns. In tables with many columns,
the columns more likely to contain nulls should be defined last to conserve disk space."
Test
Benchmarking updates is very difficult because the true cost of an update cannot be measured just from the update statement. For example, log switches will
not happen with every update, and delayed block cleanout will happen later. To accurately test an update, there should be multiple runs,
objects should be recreated for each run, and the high and low values should be discarded.
For simplicity the script below does not throw out high and low results, and only tests a table with a single column. But the problem still occurs regardless of the number of columns, their data, and which column is updated.
I used the RunStats utility from http://www.oracle-developer.net/utilities.php to compare the resource consumption of updating-to-a-value with updating-to-a-null.
create table test1(col1 number);
BEGIN
dbms_output.enable(1000000);
runstats_pkg.rs_start;
for i in 1 .. 10 loop
execute immediate 'drop table test1 purge';
execute immediate 'create table test1 (col1 number)';
execute immediate 'insert /*+ append */ into test1 select 1 col1
from dual connect by level <= 100000';
commit;
execute immediate 'update test1 set col1 = 1';
commit;
end loop;
runstats_pkg.rs_pause;
runstats_pkg.rs_resume;
for i in 1 .. 10 loop
execute immediate 'drop table test1 purge';
execute immediate 'create table test1 (col1 number)';
execute immediate 'insert /*+ append */ into test1 select 1 col1
from dual connect by level <= 100000';
commit;
execute immediate 'update test1 set col1 = null';
commit;
end loop;
runstats_pkg.rs_stop();
END;
/
Result
There are dozens of differences, these are the four I think are most relevant:
Type Name Run1 Run2 Diff
----- ---------------------------- ------------ ------------ ------------
TIMER elapsed time (hsecs) 1,269 4,738 3,469
STAT heap block compress 1 2,028 2,027
STAT undo change vector size 55,855,008 181,387,456 125,532,448
STAT redo size 133,260,596 581,641,084 448,380,488
Solutions?
The only possible solution I can think of is to enable table compression. The trailing-null storage trick doesn't happen for compressed tables.
So even though the "heap block compress" number gets even higher for Run2, from 2028 to 23208, I guess it doesn't actually do anything.
The redo, undo, and elapsed time between the two runs is almost identical with table compression enabled.
However, there are lots of potential downsides to table compression. Updating to a null will run much faster, but every other update will run at least slightly slower.
Possibly stalling UPDATE x SET y = NULL statement
We ended up copying and swapping the table.
The question linked by Stephan in the comments contains some useful pointers as to how to keep the dataset online during the operation. Especially see this answer by Mitch Schroeter which essentially sets up a view unioning the old and new tables while the transfer takes place.
Because we didn't need to keep the dataset online, this was overkill (especially considering the rest of the dataset is pretty small). Instead:
CREATE TABLE _foobar (id INT IDENTITY PRIMARY KEY, foo INT, bar INT NULL);
SET IDENTITY_INSERT _foobar ON;
INSERT _foobar (id, foo, bar) SELECT id, foo, NULL FROM foobar;
SET IDENTITY_INSERT _foobar OFF;
DROP TABLE foobar;
EXECUTE sp_rename '_foobar', 'foobar';
The whole operation took 14 seconds, which seemed difficult to beat for our scenario.
Some tips/comments:
- Ensure the
CREATE TABLE
statement produces a schema that matches (e.g. using tools like VS or SSMS). - Don't forget about
IDENTITY
columns. This means you need to write the column list explicitly for theINSERT
statement, and of course setIDENTITY_INSERT
for the table. See the MSDN documentation for details.
Conclusions:
- It would seem that according to this there is no easy way to split a normal
UPDATE
transaction in multiple transactions to manage consistency at a higher level. As suggested there and by HABO, all solutions seem to require either a scan for the predicate of the request on every batch or the use of a temporary table to store keys of rows matching the predicate in one go and use that for each batch (which should always be faster since the PK is always indexed). - It would seem that there is no easy way to do a copy/swap while keeping operations online either. Again, see this for an approach where you manually setup a unioned view.
- If the rest of your dataset is pretty small (fast to copy in its entirety) and you don't need to keep it online, you may use the more straightforward approach above. Disclaimer: Check with your DBA if you have one, this may be dangerous if you're not 100% sure of what you're doing.
How to prevent update statement from setting some records to null?
The best way is to use exists
in the where
clause:
update t1 x
set x.code = (select code
from t2 y
where x.address = y.address and x.city = y.city and x.state = y.state and x.flag = y.flag and rownum = 1
)
where x.aflag like '%b%' and
exists (select code
from t2 y
where x.address = y.address and x.city = y.city and x.state = y.state and x.flag = y.flag and rownum = 1
);
Update statement using a WHERE clause that contains columns with null Values
Since null = null
evaluates to false
you need to check if two fields are both null
in addition to equality check:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
(table_one.invoice_number = table_two.invoice_number
OR (table_one.invoice_number is null AND table_two.invoice_number is null))
AND
(table_one.submitted_by = table_two.submitted_by
OR (table_one.submitted_by is null AND table_two.submitted_by is null))
AND
-- etc
You could also use the coalesce
function which is more readable:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
AND -- etc
But you need to be careful about the default values (last argument to coalesce
).
It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data
E.g coalesce(null, 1) = coalesce(1, 1)
is a situation you'd want to avoid.
Update (regarding performance):
Seq Scan on table_two
- this suggests that you don't have any indexes on table_two
.
So if you update a row in table_one
then to find a matching row in table_two
the database basically has to scan through all the rows one by one until it finds a match.
The matching rows could be found much faster if the relevant columns were indexed.
On the flipside if table_one
has any indexes then that slows down the update.
According to this performance guide:
Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.
Another suggestion from the same guide that might be helpful is:
If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.
So for example if table_one
an id
column you could add something like
and table_one.id between x and y
to the where
condition and run the query several times changing the values of x
and y
so that all rows are covered.
The EXPLAIN ANALYZE option took also forever
You might want to be careful when using the ANALYZE
option with EXPLAIN
when dealing with statements with sideffects.
According to documentation:
Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.
PL/SQL: UPDATE inside CURSOR, but some data is NULL
If I'm understanding you correctly, you want to match lf
with rm.lf
, including when they're both null? If that's what you want, then this will do it:
...
AND (lf = rm.lf
OR (lf IS NULL AND rm.lf IS NULL)
)
...
It's comparing the values of lf
and rm.lf
, which will return false if either is null, so the OR
condition returns true if they're both null.
Update data.table with mapply speed issue
Few notes here:
- As it stands now, using
data.table
for your need could be an overkill (though not necessarily) and you could probably avoid it. - You are growing objects in a loop (
Column <- c(Column, x)
)- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function. - There is absolutely no need in creating
Column2
- it is justz
- as R automatically will recycle it in order to fit it to the correct size - No need to recalculate
nrow(addTable)
by row neither, that could be just an additional parameter. - Your bigggest overhead is calling data.table:::`[.data.table` per row- it is a very expensive function. The
:=
function has a very little overhead here. If you''ll replaceaddTable[, First := First + Column] ; addTable[, Second := Second + Column2]
with justaddTable$First + Column ; addTable$Second + Column2
the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines withset
- e.g.set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column)
which basically shares the source code with:=
. This also runs ~ 2 secs - Finally, it is better to reduce the amount of operations per row. You could try accumulating the result using
Reduce
instead of updating the actual data set per row.
Let's see some examples
Your original function timings
library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 30.71 0.00 30.78
30 secs is pretty slow...
1- Let's try removing the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable$First + Column
addTable$Second + Column2
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.25 0.00 2.26
^ That was much faster but didn't update the actual data set.
2- Now let's try replacing it with set
which will have the same affect as :=
but without the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z, n) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
set(addTable, j = "First", value = addTable[["First"]] + Column)
set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.96 0.00 2.96
^ Well, that was also much faster than 30 secs and had the exact same effect as :=
3- Let's try it without using data.table
at all
dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.frame(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
return(list(Column, Column2))
}
system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 1.34 0.02 1.36
^ That's even faster
Now we can use Reduce
combined with accumulate = TRUE
in order to create those vectors
system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
Well, everything combined is now under 2 seconds (instead of 30 with your original function).
4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just
sample_fun <- function(x, y, n) {
Column <- numeric(n)
i <- 1L
while(x >= 1) {
x <- x * y
Column[i] <- x
i <- i + 1L
}
return(Column)
}
system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.72 0.00 0.72
^ Twice improvement in speed
Now, we didn't even bother creating Column2
as we already have it in dt$Z
. We also used Map
instead of mapply
as it will be easier for Reduce
to work with a list
than a matrix
.
The next step is similar to as before
system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.07
But we could improve this even further. Instead of using Map
/Reduce
we could create a matrix
using mapply
and then run matrixStats::rowCumsums
over it (which is written in C++ internally) in order to calculate addTable$First
)
system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.76 0.00 0.76
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
# user system elapsed
# 0 0 0
While the final step is simply summing dt$Z
system.time(addTable$Second <- sum(dt$Z))
# user system elapsed
# 0 0 0
So eventually we went from ~30 secs to less than a second.
Some final notes
- As it seems like the main overhead remained in the function itself, you could also maybe try rewriting it using Rcpp as it seems like loops are inevitable in this case (though the overhead is not so big it seems).
Update Field When Not Null
Do this:
UPDATE newspapers
SET scan_notes = "data",
scan_entered_by = "some_name",
scan_modified_date = "current_unix_timestamp",
scan_created_date = COALESCE(scan_created_date, "current_unix_timestamp")
WHERE id = X
The COALESCE
function picks the first non-null value. In this case, it will update the datestamp scan_created_date to be the same value if it exists, else it will take whatever you replace "current_unix_timestamp"
with.
Related Topics
How to Do a Count(Distinct) Using Window Functions with a Frame in SQL Server
T/F: Using If Statements in a Procedure Produces Multiple Plans
Sql Order by on Multiple Column
How to Add Months to a Current_Timestamp in Sql
Export Data Frame to SQL Server Using Rodbc Package
Using Select Distinct in MySQL
How to Change Default Systemdate from Ymd to Dmy
How to Sum() Over Column with Reset Condition
Sql Server Pivot with Multiple X-Axis Columns
How to Control Nullability in Select into for Literal-Based Columns
Why Aren't Nulls Counted in Count(Columnname)
Querying Count on Daily Basis with Date Constraints Over Multiple Weeks
Handling Null in Greatest Function in Oracle
How to Group by One Column and Retrieve a Row with The Minimum Value of Another Column in T/Sql
Return Value at Max Date for a Particular Id
Haversine Formula Using SQL Server to Find Closest Venue - VB.NET