Is Substr or Like Faster in Oracle

Is substr or LIKE faster in Oracle?

Assuming maximum performance is the goal, I would ideally choose SUBSTR(my_field,1,6) and create a function-based index to support the query.

CREATE INDEX my_substr_idx
    ON my_table( substr( my_field,1,6 ) );

As others point out, SUBSTR(my_field,1,6) would not be able to use a regular index on MY_FIELD. The LIKE version might use the index, but the optimizer's cardinality estimates in that case are generally rather poor so it is quite likely to either not use an index when it would be helpful or to use an index when a table scan would be preferable. Indexing the actual expression will give the optimizer far more information to work with so it is much more likely to pick the index correctly. Someone smarter than I am may be able to suggest a way to use statistics on virtual columns in 11g to give the optimizer better information for the LIKE query.

If 6 is a variable (i.e. you sometimes want to search the first 6 characters and sometimes want to search a different number), you probably won't be able to come up with a function-based index to support that query. In that case, you're probably better off with the vagaries of the optimizer's decisions with the LIKE formulation.

Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z 
set data1 = case
when rownum between      1 and 150000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9)) 
                               || '#'   || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9)) 
                               || 'ALL' || substr(data1, mod(rownum, 9) + 1)
          end
where rownum <= 450000;

commit;

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1, 
       case 
         when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
         when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
         when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
         else data1 end
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP solution

select sum(length(data2))
from (
select data1, 
       COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
                            ,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
                            ,DATA1)
       as data2
from   table_z
);

SUM(LENGTH(DATA2))
------------------
          10713352

1 row selected.

Elapsed: 00:00:30.75

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

Performance of SQL comparison using substring vs like with wildcard

I found this reference in an IBM redbook related to SQL performance. It sounds like the SUBSTR scalar function can be handled in an optimized manner by an iSeries.

If you search for the first character and want to use the SQE instead
of the CQE, you can use the scalar function substring on the left sign
of the equal sign. If you have to search for additional characters in
the string, you can additionally use the scalar function POSSTR. By
splitting the LIKE predicate into several scalar function, you can
affect the query optimizer to use the SQE.

http://publib-b.boulder.ibm.com/abstracts/sg246654.html?Open

Is there a SUBSTR alternative in Oracle SQL

You don't need regular expressions. INSTR and SUBSTR will work (and are faster):

Oracle Setup:

CREATE TABLE test_data ( str ) AS
  SELECT 'F-123-A123-B12'  FROM DUAL UNION ALL
  SELECT 'F-1234-A123-B12' FROM DUAL

Query 1:

SELECT SUBSTR( str, 1,              delimiter1 - 1              ) AS substr1,
       SUBSTR( str, delimiter1 + 1, delimiter2 - delimiter1 - 1 ) AS substr2,
       SUBSTR( str, delimiter2 + 1, delimiter3 - delimiter2 - 1 ) AS substr3,
       SUBSTR( str, delimiter3 + 1                              ) AS substr4
FROM   (
  SELECT str,
         INSTR( str, '-', 1, 1 ) AS delimiter1,
         INSTR( str, '-', 1, 2 ) AS delimiter2,
         INSTR( str, '-', 1, 3 ) AS delimiter3
  FROM   test_data
) s

Output:


SUBSTR1 | SUBSTR2 | SUBSTR3 | SUBSTR4
:------ | :------ | :------ | :------
F       | 123     | A123    | B12    
F       | 1234    | A123    | B12

If you do want to use regular expressions, then there is no need for a hierarchical query:

Query 2:

SELECT REGEXP_SUBSTR( str, '[^-]+', 1, 1 ) AS substr1,
       REGEXP_SUBSTR( str, '[^-]+', 1, 2 ) AS substr2,
       REGEXP_SUBSTR( str, '[^-]+', 1, 3 ) AS substr3,
       REGEXP_SUBSTR( str, '[^-]+', 1, 4 ) AS substr4
FROM   test_data

(Output as Query 1 above.)

Query 3

If you don't know how many delimited values there will be and want to parse them all to rows then you still don't need to use (slow) regular expressions or hierarchical queries and can just use a recursive sub-query factoring clause with simple string functions (and it works with zero-width/NULL sub-strings between delimiters):

WITH substr_bounds ( str, idx, startidx, endidx ) AS (
  SELECT str,
         1,
         1,
         INSTR( str, '-', 1 )
  FROM   test_data
UNION ALL
  SELECT str,
         idx + 1,
         endidx + 1,
         INSTR( str, '-', endidx + 1 )
  FROM   substr_bounds
  WHERE  endidx > 0
)
SELECT str,
       idx,
       CASE
       WHEN endidx = 0
       THEN SUBSTR( str, startidx )
       ELSE SUBSTR( str, startidx, endidx - startidx )
       END AS substr
FROM   substr_bounds
ORDER BY str, idx

Output:


STR             | IDX | SUBSTR
:-------------- | --: | :-----
F-123-A123-B12  |   1 | F     
F-123-A123-B12  |   2 | 123   
F-123-A123-B12  |   3 | A123  
F-123-A123-B12  |   4 | B12   
F-1234-A123-B12 |   1 | F     
F-1234-A123-B12 |   2 | 1234  
F-1234-A123-B12 |   3 | A123  
F-1234-A123-B12 |   4 | B12

db<>fiddle here

Oracle: instr+substr instead of regexp_substr

SELECT CASE 
       WHEN start_24 > 0
       THEN TRIM(
              SUBSTR(
                value,
                start_24 + 5,
                INSTR(value, '|', start_24 + 5) - (start_24+5)
              )
           )
       END AS "24",
       CASE 
       WHEN start_35 > 0
       THEN TRIM(
              SUBSTR(
                value,
                start_35 + 5,
                INSTR(value, '|', start_35 + 5) - (start_35+5)
              )
           )
       END AS "35",
       CASE 
       WHEN start_47a > 0
       THEN TRIM(
              SUBSTR(
                value,
                start_47a + 6,
                INSTR(value, '|', start_47a + 6) - (start_47a+6)
              )
           )
       END AS "47A",
       CASE 
       WHEN start_98a > 0
       THEN TRIM(
              SUBSTR(
                value,
                start_98a + 6,
                INSTR(value, '|', start_98a + 6) - (start_98a+6)
              )
           )
       END AS "98A"
FROM   (
  SELECT value,
         INSTR(value, '| 24=') AS start_24,
         INSTR(value, '| 35=') AS start_35,
         INSTR(value, '| 47A=') AS start_47a,
         INSTR(value, '| 98A=') AS start_98a
  FROM   (
    SELECT '| ' || value AS value FROM table_name
  )
);

Which, for your sample data, outputs:

24 35 47A 98A
88234.00 Shawn This is a comment
123.00 Derick This is a comment

What is the difference between INSTR and LIKE in Oracle?
That depends on the data and on the pattern. If you use like 'a%', then Oracle can use a BTree indexes to look up the matches because it can search the btree with the start of the pattern and then consider only the subtree.
This doesn't work for LIKE '%a' but you can work around this by creating a calculated column which reverses all values from the column you want to search (so you get the pattern above).
If you use hashed indexes, there is little that Oracle can do but scan the whole index. It might sill be faster when there are only few different values.
I'm not sure whether INSTR can ever use an index because it has no fixed anchor.
So like with all performance questions:
Fill the database with some realistic test data and run some tests.
Always write your code in a way that it can be optimized easily later, when you know about the bottlenecks
Never guess what might be slow. You'll be wrong 90% of the time. Always measure.

Which is faster — INSTR or LIKE?
FULLTEXT searches are absolutely going to be faster, as kibibu noted in the comments above.
However:
mysql> select COUNT(ID) FROM table WHERE INSTR(Name,'search') > 0;
+-----------+
| COUNT(ID) |
+-----------+
|     40735 | 
+-----------+
1 row in set (5.54 sec)

mysql> select COUNT(ID) FROM table WHERE Name LIKE '%search%';
+-----------+
| COUNT(ID) |
+-----------+
|     40735 | 
+-----------+
1 row in set (5.54 sec)
In my tests, they perform exactly the same. They are both case-insensitive, and generally they perform full-table scans, a general no-no when dealing with high-performance MySQL.
Unless you are doing a prefix search on an indexed column:
mysql> select COUNT(ID) FROM table WHERE Name LIKE 'search%';
+-----------+
| COUNT(ID) |
+-----------+
|         7 | 
+-----------+
1 row in set (3.88 sec)
In which case, the LIKE with only a suffix wildcard is much faster.

Related Topics

Cascade Update Oracle
Informix 7.3 Isql Insert Statement - Text/Blob/Clob Field Insert Error
Jooq - Support for Update ... Set ... Query with Arbitrary Degree
Index on Timestamp: Functions in Index Expression Must Be Marked as Immutable
Concatenate a Selected Column in a Single Query
T-SQL Query:Getting Child Nodes of a Parent
Mysql: Select N Rows, But with Only Unique Values in One Column
Text Was Truncated or One or More Characters Had No Match in the Target Code Page Including the Primary Key in an Unpivot
MySQL - Select All Except What Is in This Table
Efficiently Mapping One-To-Many Many-To-Many Database to Struct in Golang
Differencebetween ";" and "Go" in T-Sql
How to Select the Set of Rows Where Each Item Has the Greatest Timestamp
Resolve Hostnames with T-Sql
Oracle Insert into Two Tables in One Query
How to Convert SQL Unpivot Query to Hana SQL
Transposing Rows in to Colums in SQL Server 2005
Select All Records Don't Meet Certain Conditions in a Joined Table
What's the Best Way to Store Sort Order in SQL
Leave a reply

Submit