Any Performance Impact in Oracle for Using Like 'String' VS = 'String'

Any performance impact in Oracle for using LIKE 'string' vs = 'string'?

There is a clear difference when you use bind variables, which you should be using in Oracle for anything other than data warehousing or other bulk data operations.

Take the case of:

SELECT * FROM SOME_TABLE WHERE SOME_FIELD LIKE :b1

Oracle cannot know that the value of :b1 is '%some_value%', or 'some_value' etc. until execution time, so it will make an estimation of the cardinality of the result based on heuristics and come up with an appropriate plan that either may or may not be suitable for various values of :b, such as '%A','%', 'A' etc.

Similar issues can apply with an equality predicate but the range of cardinalities that might result is much more easily estimated based on column statistics or the presence of a unique constraint, for example.

So, personally I wouldn't start using LIKE as a replacement for =. The optimizer is pretty easy to fool sometimes.

SQL 'like' vs '=' performance

See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx

Quote from there:

the rules for index usage with LIKE
are loosely like this:

  • If your filter criteria uses equals =
    and the field is indexed, then most
    likely it will use an INDEX/CLUSTERED
    INDEX SEEK

  • If your filter criteria uses LIKE,
    with no wildcards (like if you had a
    parameter in a web report that COULD
    have a % but you instead use the full
    string), it is about as likely as #1
    to use the index. The increased cost
    is almost nothing.

  • If your filter criteria uses LIKE, but
    with a wildcard at the beginning (as
    in Name0 LIKE '%UTER') it's much less
    likely to use the index, but it still
    may at least perform an INDEX SCAN on
    a full or partial range of the index.

  • HOWEVER, if your filter criteria uses
    LIKE, but starts with a STRING FIRST
    and has wildcards somewhere AFTER that
    (as in Name0 LIKE 'COMP%ER'), then SQL
    may just use an INDEX SEEK to quickly
    find rows that have the same first
    starting characters, and then look
    through those rows for an exact match.


(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)

Equals(=) vs. LIKE

Different Operators

LIKE and = are different operators. Most answers here focus on the wildcard support, which is not the only difference between these operators!

= is a comparison operator that operates on numbers and strings. When comparing strings, the comparison operator compares whole strings.

LIKE is a string operator that compares character by character.

To complicate matters, both operators use a collation which can have important effects on the result of the comparison.

Motivating Example

Let us first identify an example where these operators produce obviously different results. Allow me to quote from the MySQL manual:

Per the SQL standard, LIKE performs matching on a per-character basis, thus it can produce results different from the = comparison operator:

mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+-----------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+-----------------------------------------+
| 0 |
+-----------------------------------------+
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+--------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+

Please note that this page of the MySQL manual is called String Comparison Functions, and = is not discussed, which implies that = is not strictly a string comparison function.

How Does = Work?

The SQL Standard § 8.2 describes how = compares strings:

The comparison of two character strings is determined as follows:

a) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad
characters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any
character in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.

b) The result of the comparison of X and Y is given by the
collating sequence CS.

c) Depending on the collating sequence, two strings may
compare as equal even if they are of different lengths or
contain different sequences of characters. When the operations
MAX, MIN, DISTINCT, references to a grouping column, and the
UNION, EXCEPT, and INTERSECT operators refer to character
strings, the specific value selected by these operations from
a set of such equal values is implementation-dependent.

(Emphasis added.)

What does this mean? It means that when comparing strings, the = operator is just a thin wrapper around the current collation. A collation is a library that has various rules for comparing strings. Here is an example of a binary collation from MySQL:

static int my_strnncoll_binary(const CHARSET_INFO *cs __attribute__((unused)),
const uchar *s, size_t slen,
const uchar *t, size_t tlen,
my_bool t_is_prefix)
{
size_t len= MY_MIN(slen,tlen);
int cmp= memcmp(s,t,len);
return cmp ? cmp : (int)((t_is_prefix ? len : slen) - tlen);
}

This particular collation happens to compare byte-by-byte (which is why it's called "binary" — it doesn't give any special meaning to strings). Other collations may provide more advanced comparisons.

For example, here is a UTF-8 collation that supports case-insensitive comparisons. The code is too long to paste here, but go to that link and read the body of my_strnncollsp_utf8mb4(). This collation can process multiple bytes at a time and it can apply various transforms (such as case insensitive comparison). The = operator is completely abstracted from the vagaries of the collation.

How Does LIKE Work?

The SQL Standard § 8.5 describes how LIKE compares strings:

The <predicate>

M LIKE P

is true if there exists a partitioning of M into substrings
such that:

i) A substring of M is a sequence of 0 or more contiguous
<character representation>s of M and each <character
representation> of M is part of exactly one substring.

ii) If the i-th substring specifier of P is an arbitrary
character specifier, the i-th substring of M is any single
<character representation>.

iii) If the i-th substring specifier of P is an arbitrary string
specifier, then the i-th substring of M is any sequence of
0 or more <character representation>s.

iv) If the i-th substring specifier of P is neither an
arbitrary character specifier nor an arbitrary string specifier,
then the i-th substring of M is equal to that substring
specifier according to the collating sequence of
the <like predicate>, without the appending of <space>
characters to M, and has the same length as that substring
specifier.

v) The number of substrings of M is equal to the number of
substring specifiers of P.

(Emphasis added.)

This is pretty wordy, so let's break it down. Items ii and iii refer to the wildcards _ and %, respectively. If P does not contain any wildcards, then only item iv applies. This is the case of interest posed by the OP.

In this case, it compares each "substring" (individual characters) in M against each substring in P using the current collation.

Conclusions

The bottom line is that when comparing strings, = compares the entire string while LIKE compares one character at a time. Both comparisons use the current collation. This difference leads to different results in some cases, as evidenced in the first example in this post.

Which one should you use? Nobody can tell you that — you need to use the one that's correct for your use case. Don't prematurely optimize by switching comparison operators.

Is substr or LIKE faster in Oracle?

Assuming maximum performance is the goal, I would ideally choose SUBSTR(my_field,1,6) and create a function-based index to support the query.

CREATE INDEX my_substr_idx
ON my_table( substr( my_field,1,6 ) );

As others point out, SUBSTR(my_field,1,6) would not be able to use a regular index on MY_FIELD. The LIKE version might use the index, but the optimizer's cardinality estimates in that case are generally rather poor so it is quite likely to either not use an index when it would be helpful or to use an index when a table scan would be preferable. Indexing the actual expression will give the optimizer far more information to work with so it is much more likely to pick the index correctly. Someone smarter than I am may be able to suggest a way to use statistics on virtual columns in 11g to give the optimizer better information for the LIKE query.

If 6 is a variable (i.e. you sometimes want to search the first 6 characters and sometimes want to search a different number), you probably won't be able to come up with a function-based index to support that query. In that case, you're probably better off with the vagaries of the optimizer's decisions with the LIKE formulation.

Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z
set data1 = case
when rownum between 1 and 150000 then substr(data1, 1, mod(rownum, 9))
|| 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9))
|| '#' || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9))
|| 'ALL' || substr(data1, mod(rownum, 9) + 1)
end
where rownum <= 450000;

commit;

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1,
case
when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
when instr(data1, '#' , 2) > 0 then substr(data1, 1, instr(data1, '#' , 2) - 1)
when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
else data1 end
as data2
from table_z
);

SUM(LENGTH(DATA2))
------------------
10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP solution

select sum(length(data2))
from (
select data1,
COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
,DATA1)
as data2
from table_z
);

SUM(LENGTH(DATA2))
------------------
10713352

1 row selected.

Elapsed: 00:00:30.75

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

Oracle query using 'like' on indexed number column, poor performance

LIKE pattern-matching condition expects to see character types as both left-side and right-side operands. When it encounters a NUMBER, it implicitly converts it to char. Your Query 1 is basically silently rewritten to this:

SELECT a1.*
FROM people a1
WHERE TO_CHAR(a1.id) LIKE '119%'
AND ROWNUM < 5

That happens in your case, and that is bad for 2 reasons:

  1. The conversion is executed for every row, which is slow;
  2. Because of a function (though implicit) in a WHERE predicate, Oracle is unable to use the index on A1.ID column.

To get around it, you need to do one of the following:

  1. Create a function-based index on A1.ID column:

    CREATE INDEX people_idx5 ON people (TO_CHAR(id));

  2. If you need to match records on first 3 characters of ID column, create another column of type NUMBER containing just these 3 characters and use a plain = operator on it.

  3. Create a separate column ID_CHAR of type VARCHAR2 and fill it with TO_CHAR(id). Index it and use instead of ID in your WHERE condition.

    Of course if you choose to create an additional column based on existing ID column, you need to keep those 2 synchronized.You can do that in batch as a single UPDATE, or in an ON-UPDATE trigger, or add that column to the appropriate INSERT and UPDATE statements in your code.

Performance of SQL comparison using substring vs like with wildcard

I found this reference in an IBM redbook related to SQL performance. It sounds like the SUBSTR scalar function can be handled in an optimized manner by an iSeries.

If you search for the first character and want to use the SQE instead
of the CQE, you can use the scalar function substring on the left sign
of the equal sign. If you have to search for additional characters in
the string, you can additionally use the scalar function POSSTR. By
splitting the LIKE predicate into several scalar function, you can
affect the query optimizer to use the SQE.

http://publib-b.boulder.ibm.com/abstracts/sg246654.html?Open

Optimization of text parsing - Oracle SQL

The CONTEXT index type is used to index long texts. You can use :

CREATE INDEX idx_TheTaskTxt ON example(TRIM(UPPER(TheTaskText))) INDEXTYPE IS CTXSYS.CONTEXT;

and collect statistics for the optimizer to take effect :

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'EXAMPLE', cascade=>TRUE);

and call

SELECT
TheTask AS Tasking,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) AS LongCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) AS TextCount,
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END) AS EnglishCount
FROM example
GROUP BY TheTask
HAVING
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'LONG', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'TEXT', 1) > 0 THEN 1 ELSE 0 END) *
SUM(CASE WHEN CONTAINS(TRIM(UPPER(TheTaskText)), 'ENGLISH', 1) > 0 THEN 1 ELSE 0 END)
IN (0,1)

Performance Difference in comparing integers and comparing strings

OK interesting question always took it as read that integer was quicker and never actually tested it. I took 1M Random Surnames and types from a list of Contacts from my data into a scratch database with no indices or primary key just raw data. No measurement was made over the range of my data in either of the columns has not been standardised so reflects the reality of my database rather than a pure statistical set.

select top 100 * from tblScratch where contactsurname = '<TestSurname>' order by NEWID()
select top 100 * from tblScratch where contacttyperef = 1-22 order by NEWID()

The Newid is there to randomise the data list out each time. Quickly ran this for 20 surnames and 20 types. Queries were run surname than ref then surname. Searching for the reference number was almost 4x quicker and used about 1/2 so the books were right all those years ago.

String -

SELECT TOP 100 * FROM tblScratch WHERE contactsurname = 'hoare' ORDER BY NEWID()

Duration 430ms
Reads 902
CPU 203

Integer -

SELECT TOP 100 * FROM tblScratch WHERE contacttyperef = 3 ORDER BY NEWID()

Duration 136ms
Reads 902
CPU 79


Related Topics



Leave a reply



Submit