Update X Set Y = Null Takes a Long Time

update x set y = null takes a long time

Summary

I think updating to null is slower because Oracle (incorrectly) tries to take advantage of the way it stores nulls, causing it to frequently re-organize the rows in the block ("heap block compress"), creating a lot of extra UNDO and REDO.

What's so special about null?

From the Oracle Database Concepts:

"Nulls are stored in the database if they fall between columns with data values. In these cases they require 1 byte to store the length of the column (zero).

Trailing nulls in a row require no storage because a new row header signals that the remaining columns in the previous row are null. For example, if the last three columns of a table are null, no information is stored for those columns. In tables with many columns,
the columns more likely to contain nulls should be defined last to conserve disk space."

Test

Benchmarking updates is very difficult because the true cost of an update cannot be measured just from the update statement. For example, log switches will
not happen with every update, and delayed block cleanout will happen later. To accurately test an update, there should be multiple runs,
objects should be recreated for each run, and the high and low values should be discarded.

For simplicity the script below does not throw out high and low results, and only tests a table with a single column. But the problem still occurs regardless of the number of columns, their data, and which column is updated.

I used the RunStats utility from http://www.oracle-developer.net/utilities.php to compare the resource consumption of updating-to-a-value with updating-to-a-null.

create table test1(col1 number);

BEGIN
    dbms_output.enable(1000000);

   runstats_pkg.rs_start;

    for i in 1 .. 10 loop
        execute immediate 'drop table test1 purge';
        execute immediate 'create table test1 (col1 number)';
        execute immediate 'insert /*+ append */ into test1 select 1 col1
            from dual connect by level <= 100000';
        commit;
        execute immediate 'update test1 set col1 = 1';
        commit;
    end loop;

   runstats_pkg.rs_pause;
   runstats_pkg.rs_resume;

    for i in 1 .. 10 loop
        execute immediate 'drop table test1 purge';
        execute immediate 'create table test1 (col1 number)';
        execute immediate 'insert /*+ append */ into test1 select 1 col1
            from dual connect by level <= 100000';
        commit;
        execute immediate 'update test1 set col1 = null';
        commit;
    end loop;

   runstats_pkg.rs_stop();
END;
/

Result

There are dozens of differences, these are the four I think are most relevant:

Type  Name                                 Run1         Run2         Diff
----- ---------------------------- ------------ ------------ ------------
TIMER elapsed time (hsecs)                1,269        4,738        3,469
STAT  heap block compress                     1        2,028        2,027
STAT  undo change vector size        55,855,008  181,387,456  125,532,448
STAT  redo size                     133,260,596  581,641,084  448,380,488

Solutions?

The only possible solution I can think of is to enable table compression. The trailing-null storage trick doesn't happen for compressed tables.
So even though the "heap block compress" number gets even higher for Run2, from 2028 to 23208, I guess it doesn't actually do anything.
The redo, undo, and elapsed time between the two runs is almost identical with table compression enabled.

However, there are lots of potential downsides to table compression. Updating to a null will run much faster, but every other update will run at least slightly slower.

Possibly stalling UPDATE x SET y = NULL statement

We ended up copying and swapping the table.

The question linked by Stephan in the comments contains some useful pointers as to how to keep the dataset online during the operation. Especially see this answer by Mitch Schroeter which essentially sets up a view unioning the old and new tables while the transfer takes place.

Because we didn't need to keep the dataset online, this was overkill (especially considering the rest of the dataset is pretty small). Instead:

CREATE TABLE _foobar (id INT IDENTITY PRIMARY KEY, foo INT, bar INT NULL);
SET IDENTITY_INSERT _foobar ON;
INSERT _foobar (id, foo, bar) SELECT id, foo, NULL FROM foobar;
SET IDENTITY_INSERT _foobar OFF;
DROP TABLE foobar;
EXECUTE sp_rename '_foobar', 'foobar';

The whole operation took 14 seconds, which seemed difficult to beat for our scenario.

Some tips/comments:

Ensure the CREATE TABLE statement produces a schema that matches (e.g. using tools like VS or SSMS).
Don't forget about IDENTITY columns. This means you need to write the column list explicitly for the INSERT statement, and of course set IDENTITY_INSERT for the table. See the MSDN documentation for details.

Conclusions:

It would seem that according to this there is no easy way to split a normal UPDATE transaction in multiple transactions to manage consistency at a higher level. As suggested there and by HABO, all solutions seem to require either a scan for the predicate of the request on every batch or the use of a temporary table to store keys of rows matching the predicate in one go and use that for each batch (which should always be faster since the PK is always indexed).
It would seem that there is no easy way to do a copy/swap while keeping operations online either. Again, see this for an approach where you manually setup a unioned view.
If the rest of your dataset is pretty small (fast to copy in its entirety) and you don't need to keep it online, you may use the more straightforward approach above. Disclaimer: Check with your DBA if you have one, this may be dangerous if you're not 100% sure of what you're doing.

How to prevent update statement from setting some records to null?

The best way is to use exists in the where clause:

update t1 x
    set x.code = (select code
                  from t2 y
                  where x.address = y.address and x.city = y.city and x.state = y.state and x.flag = y.flag and rownum = 1
                 )
    where x.aflag like '%b%' and
          exists (select code
                  from t2 y
                  where x.address = y.address and x.city = y.city and x.state = y.state and x.flag = y.flag and rownum = 1
                 );

Update statement using a WHERE clause that contains columns with null Values

Since null = null evaluates to false you need to check if two fields are both null in addition to equality check:

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
    (table_one.invoice_number = table_two.invoice_number 
        OR (table_one.invoice_number is null AND table_two.invoice_number is null))
    AND
    (table_one.submitted_by = table_two.submitted_by 
        OR (table_one.submitted_by is null AND table_two.submitted_by is null))
    AND 
    -- etc

You could also use the coalesce function which is more readable:

UPDATE table_one SET table_one.x = table_two.y 
FROM table_two
WHERE 
    coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
    AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
    AND -- etc

But you need to be careful about the default values (last argument to coalesce).

It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data

E.g coalesce(null, 1) = coalesce(1, 1) is a situation you'd want to avoid.

Update (regarding performance):

Seq Scan on table_two - this suggests that you don't have any indexes on table_two.

So if you update a row in table_one then to find a matching row in table_two the database basically has to scan through all the rows one by one until it finds a match.

The matching rows could be found much faster if the relevant columns were indexed.

On the flipside if table_one has any indexes then that slows down the update.

According to this performance guide:

Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.

Another suggestion from the same guide that might be helpful is:

If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.

So for example if table_one an id column you could add something like

and table_one.id between x and y

to the where condition and run the query several times changing the values of x and y so that all rows are covered.

The EXPLAIN ANALYZE option took also forever

You might want to be careful when using the ANALYZE option with EXPLAIN when dealing with statements with sideffects.
According to documentation:

Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.

PL/SQL: UPDATE inside CURSOR, but some data is NULL

If I'm understanding you correctly, you want to match lf with rm.lf, including when they're both null? If that's what you want, then this will do it:

...
AND (lf = rm.lf 
  OR (lf IS NULL AND rm.lf IS NULL)
)
...

It's comparing the values of lf and rm.lf, which will return false if either is null, so the OR condition returns true if they're both null.

Update data.table with mapply speed issue

Few notes here:

As it stands now, using data.table for your need could be an overkill (though not necessarily) and you could probably avoid it.
You are growing objects in a loop (Column <- c(Column, x))- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function.
There is absolutely no need in creating Column2- it is just z- as R automatically will recycle it in order to fit it to the correct size
No need to recalculate nrow(addTable) by row neither, that could be just an additional parameter.
Your bigggest overhead is calling data.table:::`[.data.table` per row- it is a very expensive function. The := function has a very little overhead here. If you''ll replace addTable[, First := First + Column] ; addTable[, Second := Second + Column2] with just addTable$First + Column ; addTable$Second + Column2 the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines with set- e.g. set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column) which basically shares the source code with :=. This also runs ~ 2 secs
Finally, it is better to reduce the amount of operations per row. You could try accumulating the result using Reduce instead of updating the actual data set per row.

Let's see some examples

Your original function timings

library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))    
addTable <- data.table(First=0, Second=0, Term=c(1:50))

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  addTable[, First := First + Column]
  addTable[, Second := Second + Column2]
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
#  user  system elapsed 
# 30.71    0.00   30.78

30 secs is pretty slow...

1- Let's try removing the data.table:::`[.data.table` overhead

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  addTable$First + Column
  addTable$Second + Column2
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 2.25    0.00    2.26

^ That was much faster but didn't update the actual data set.

2- Now let's try replacing it with set which will have the same affect as := but without the data.table:::`[.data.table` overhead

sample_fun <- function(x, y, z, n) {  
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  set(addTable, j = "First", value = addTable[["First"]] + Column)
  set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 2.96    0.00    2.96

^ Well, that was also much faster than 30 secs and had the exact same effect as :=

3- Let's try it without using data.table at all

dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))    
addTable <- data.frame(First=0, Second=0, Term=c(1:50))

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  return(list(Column, Column2))
}

system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 1.34    0.02    1.36

^ That's even faster

Now we can use Reduce combined with accumulate = TRUE in order to create those vectors

system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.06 
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.06

Well, everything combined is now under 2 seconds (instead of 30 with your original function).

4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just

sample_fun <- function(x, y, n) {
  Column <- numeric(n)
  i <- 1L
  while(x >= 1) {
    x <- x * y
    Column[i] <- x
    i <- i + 1L
  }
  return(Column)
}

system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user  system elapsed 
# 0.72    0.00    0.72

^ Twice improvement in speed

Now, we didn't even bother creating Column2 as we already have it in dt$Z. We also used Map instead of mapply as it will be easier for Reduce to work with a list than a matrix.

The next step is similar to as before

system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.07

But we could improve this even further. Instead of using Map/Reduce we could create a matrix using mapply and then run matrixStats::rowCumsums over it (which is written in C++ internally) in order to calculate addTable$First)

system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user  system elapsed 
# 0.76    0.00    0.76 
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
# user  system elapsed 
#    0       0       0

While the final step is simply summing dt$Z

system.time(addTable$Second <- sum(dt$Z))
# user  system elapsed 
#    0       0       0

So eventually we went from ~30 secs to less than a second.

Some final notes

As it seems like the main overhead remained in the function itself, you could also maybe try rewriting it using Rcpp as it seems like loops are inevitable in this case (though the overhead is not so big it seems).

Update Field When Not Null

Do this:

UPDATE newspapers
SET scan_notes = "data",    
  scan_entered_by = "some_name",    
  scan_modified_date = "current_unix_timestamp",
  scan_created_date = COALESCE(scan_created_date, "current_unix_timestamp")
WHERE id = X

The COALESCE function picks the first non-null value. In this case, it will update the datestamp scan_created_date to be the same value if it exists, else it will take whatever you replace "current_unix_timestamp" with.

Update X Set Y = Null Takes a Long Time