Performance and Readability of Regexp_Substr VS Instr and Substr

Performance and Readability of REGEXP_SUBSTR vs INSTR and SUBSTR

I already posted an answer showing how to solve this problem using INSTR and SUBSTR the right way.

In this "Answer" I address the other question - which solution is more efficient. I will explain the test below, but here is the bottom line: the REGEXP solution takes 40 times longer than the INSTR/SUBSTR solution.

Setup: I created a table with 1.5 million random strings (all exactly eight characters long, all upper-case letters). Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them.

The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same.

Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Below are the statements needed to create the base table, which I called table_z, then the queries I ran.

create table table_z as
select dbms_random.string('U', 8) as data1 from dual
connect by level <= 1500000;

update table_z
set data1 = case
when rownum between 1 and 150000 then substr(data1, 1, mod(rownum, 9))
|| 'PLE' || substr(data1, mod(rownum, 9) + 1)
when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9))
|| '#' || substr(data1, mod(rownum, 9) + 1)
when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9))
|| 'ALL' || substr(data1, mod(rownum, 9) + 1)
end
where rownum <= 450000;

commit;

INSTR/SUBSTR solution

select sum(length(data2))
from (
select data1,
case
when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
when instr(data1, '#' , 2) > 0 then substr(data1, 1, instr(data1, '#' , 2) - 1)
when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
else data1 end
as data2
from table_z
);

SUM(LENGTH(DATA2))
------------------
10713352

1 row selected.

Elapsed: 00:00:00.73

REGEXP solution

select sum(length(data2))
from (
select data1,
COALESCE(REGEXP_SUBSTR(DATA1, '(.+?)PLE',1,1,null,1)
,REGEXP_SUBSTR(DATA1, '(.+?)#',1,1,null,1)
,REGEXP_SUBSTR(DATA1, '(.+?)ALL',1,1,null,1)
,DATA1)
as data2
from table_z
);

SUM(LENGTH(DATA2))
------------------
10713352

1 row selected.

Elapsed: 00:00:30.75

Before anyone suggests these things: I repeated both queries several times; the first solution always runs in 0.75 to 0.80 seconds, the second query runs in 30 to 35 seconds. More than 40 times slower. (So it is not a matter of the compiler/optimizer spending time to compile the query; it is really the execution time.) Also, this has nothing to do with reading the 1.5 million values from the base table - that is the same in both tests, and it takes far less time than the processing. In any case, I ran the INSTR/SUBSTR query first, so if there was any caching, the REGEXP query would have been the one to benefit.

Edit: I just figured out one inefficiency in the proposed REGEXP solution. If we anchor the search pattern to the beginning of the string (for example '^(.+?)PLE', notice the ^ anchor), the runtime for the REGEXP query drops from 30 seconds to 10 seconds. Apparently the Oracle implementation isn't smart enough to recognize this equivalence and tries searches from the second character, from the third, etc. Still the execution time is almost 15 times longer; 15 < 40 but that is still a very large difference.

Using REGEXP_SUBSTR with Strings Qualifier

The problem with your query is that if you use [^PLE] it would match any characters other than P or L or E. You are looking for an occurence of PLE consecutively. So, use

select REGEXP_SUBSTR(colname,'(.+)PLE',1,1,null,1) 
from tablename

This returns the substring up to the last occurrence of PLE in the string.

If the string contains multiple instances of PLE and only the substring up to the first occurrence needs to be extracted, use

select REGEXP_SUBSTR(colname,'(.+?)PLE',1,1,null,1) 
from tablename

Oracle Query on Regexp_substr

The [^\S] pattern matches any char but a \ and a capital letter S.

You need to capture 1+ digits followed with - and again 1+ digits after a # and extract that part of the match only with

regexp_substr(PROPERTY, '#(\d+-\d+)', 1, 1, NULL, 1)

The last 1 argument tells the regex engine to extract the contents of the capturing group with ID 1 (it is the only capturing group in the pattern anyway).

  • # - a literal #
  • (\d+-\d+) - Capturing group 1 matching:

    • \d+ - 1 or more digits
    • - - a hyphen
    • \d+ - 1 or more digits.

See the online demo:

select regexp_substr('#05-08 DOVER PARKVIEW', '#(\d+-\d+)', 1, 1, NULL, 1) as RESULT from dual

Sample Image

Looking for SQL query to split the value into another column

Basically, if the server name is always the last value in the attributes (comma-separated) list, then don't bother looking at the rest of the attributes at all.

SELECT 
upper(regexp_substr(db_attributes, '[^,]+$',1,1)) AS servername
, db_attributes
FROM
tablename

UPDATE

And, if you want to test it, here is an SQLFiddle link.

How to Select a substring in Oracle SQL up to a specific character?

Using a combination of SUBSTR, INSTR, and NVL (for strings without an underscore) will return what you want:

SELECT NVL(SUBSTR('ABC_blah', 0, INSTR('ABC_blah', '_')-1), 'ABC_blah') AS output
FROM DUAL

Result:

output
------
ABC

Use:

SELECT NVL(SUBSTR(t.column, 0, INSTR(t.column, '_')-1), t.column) AS output
FROM YOUR_TABLE t

Reference:

  • SUBSTR
  • INSTR

Addendum

If using Oracle10g+, you can use regex via REGEXP_SUBSTR.

SQL REGEX not working like its expected to

The problematic part of your query can be reduced to:

begin 
dbms_output.put_line(
REGEXP_SUBSTR(
'first@hotmail.net',
'\.([^.\n\s]*)$'
)
);
end;
/

The regular expression \.([^.\n\s]*)$ is looking for:

  1. A dot character \.; then
  2. Zero-or-more characters that are not a dot . or a slash \ or an n or a slash \ or an s; then
  3. The end-of-the-string.

The problem is that your string has an n character and the regular expression is excluding n as \n is interpreted as two characters and not the perl-like expression representing a single newline character. You want to replace \n with the CHR(10) character outside the string literal (or a newline inside it) and \s with the posix-expression [:space:].

What you want is:

begin 
dbms_output.put_line(
REGEXP_SUBSTR(
'first@hotmail.net',
'\.([^.' || CHR(10) || '[:space:]]*)$'
)
);
end;
/

or

begin 
dbms_output.put_line(
REGEXP_SUBSTR(
'first@hotmail.net',
'\.([^.
[:space:]]*)$'
)
);
end;
/

db<>fiddle here

how to get the string before first occurrence of a special character

SELECT REGEXP_SUBSTR(HOSTNAMES, '[^.]+', 1, 1) FROM MYTABLE;

Splitting string into 3 parts based on Space

regexp functions were very slow before Oracle 12. And even on Oracle 12+ they are slower then substr/instr functions, so it would be much faster to use substr/instr.
The only problem that you need to remove extra space characters in your strings, but there is very fast solution using replace function: replace(replace(Fullname,' ',' *'),'* ')

Example:

with input_table(Fullname) as (
select * from table(ku$_vcnt(
'Test1',
'Test1 Test2',
'Test1 Test2 Test3',
'Test1 Test2 Test3',
'Test1 Test2 Test3 Test4 Test5',
'Test1 Test2 Test3 Test4 Test5'
))
)
select
Fullname,
trim(replace(replace(replace(Fullname,' ',' *'),'* '),'*')) Fullname_2
from input_table;

Result:

FULLNAME                            FULLNAME_2
----------------------------------- -----------------------------------
Test1 Test1
Test1 Test2 Test1 Test2
Test1 Test2 Test3 Test1 Test2 Test3
Test1 Test2 Test3 Test1 Test2 Test3
Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5
Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5

6 rows selected.

So now we can easily split Fullname by ' ':

with input_table(Fullname) as (
select * from table(ku$_vcnt(
'Test1',
'Test1 Test2',
'Test1 Test2 Test3',
'Test1 Test2 Test3',
'Test1 Test2 Test3 Test4 Test5',
'Test1 Test2 Test3 Test4 Test5'
))
)
select
v.*,
substr(Fullname,1,instr(Fullname||' ',' ')) FNAME_substr,
substr(Fullname,instr(Fullname,' ')+1,instr(Fullname||' ',' ',1,2)-instr(Fullname,' ')) SNAME_substr,
substr(Fullname||' ',instr(Fullname||' ',' ',1,2)+1) LNAME_substr
from
(
select
Fullname as Fullname_old,
trim(replace(replace(replace(Fullname,' ',' *'),'* '),'*')) as Fullname
from input_table
) v;

Results:

FULLNAME_OLD                        FULLNAME                            FNAME_SUBS SNAME_SUBS LNAME_SUBSTR
----------------------------------- ----------------------------------- ---------- ---------- --------------------
Test1 Test1 Test1
Test1 Test2 Test1 Test2 Test1 Test2
Test1 Test2 Test3 Test1 Test2 Test3 Test1 Test2 Test3
Test1 Test2 Test3 Test1 Test2 Test3 Test1 Test2 Test3
Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5
Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5 Test1 Test2 Test3 Test4 Test5

6 rows selected.

Example with chr(0) instead of *:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=802e9b10e8a1089b148afbd13e5ef2eb

with input_table(Fullname) as (
select * from table(ku$_vcnt(
'Test1',
'Test1 Test2',
'Test1 Test2 Test3',
'Test1 Test2 Test3',
'Test1 Test2 Test3 Test4 Test5',
'Test1 Test2 Test3 Test4 Test5',
'Test1 Test2 Test3 Test4 Test5',
'Test1 Test2 Test3 Test4 Test5'
))
)
select
v.*,
substr(Fullname,1,instr(Fullname||' ',' ')) FNAME_substr,
substr(Fullname,instr(Fullname,' ')+1,instr(Fullname||' ',' ',1,2)-instr(Fullname,' ')) SNAME_substr,
substr(Fullname||' ',instr(Fullname||' ',' ',1,2)+1) LNAME_substr
from
(
select
Fullname as Fullname_old,
trim(replace(replace(replace(Fullname,' ',' '||chr(0)),chr(0)||' '),chr(0))) as Fullname
from input_table
) v;

Oracle SQL Get Last String in Parentheses (may include parentheses inside too)

The solution below uses plain SQL (no procedure/function); it works for any level of nested parentheses and "same-level" parentheses; and it returns null whenever the input is null, or it doesn't contain any right parentheses, or it contains a right parenthesis but the right-most right parenthesis is unbalanced (there is no left parenthesis, to the left of this right-most right parenthesis, so that the pair is balanced).

At the very bottom I will show the minor adjustment needed to return "the result" only if the right-most right parenthesis is the very last character in the input string, otherwise return null. This was the OP's edited requirement.

I created several more input strings for testing. Notice in particular id = 156, a case in which a smart parser wouldn't "count" parentheses that are within string literals, or in some other way aren't "normal" parentheses. My solution does NOT go that far - it treats all parentheses the same.

The strategy is to start from the position of the right-most right parenthesis (if there is at least one), and to move left from there, step by step, going only through left parentheses (if there are any) and to test if the parentheses are balanced. That is done easily by comparing the length of the "test string" after all ) are removed vs. length after all ( are removed.

Bonus: I was able to write the solution without regular expressions, using only "standard" (non-regexp) string functions. This should help keep it fast.

Query:

with
species_str ( id, name) as (
select 100, 'CfwHE3 (HH3d) Jt1 (CD-1)' from dual union all
select 101, '4GSdg-3t 22sfG/J (mdx (fq) KO)' from dual union all
select 102, 'Yf7mMjfel 7(tm1) (SCID)' from dual union all
select 103, 'B29fj;jfos x11 (tmos (line x11))' from dual union all
select 104, 'B29;CD (Atm (line G5))' from dual union all
select 105, 'Ifkso30 jel-3' from dual union all
select 106, '13GupSip (te3x) Blhas/J' from dual union all
select 151, '' from dual union all
select 152, 'try (this (and (this))) ok?' from dual union all
select 153, 'try (this (and (this)) ok?)' from dual union all
select 154, 'try (this (and) this (ok))?' from dual union all
select 155, 'try (this (and (this)' from dual union all
select 156, 'right grouping (includging ")")' from dual union all
select 157, 'try this out ) ( too' from dual
),
prep ( id, name, pos ) as (
select id, name, instr(name, ')', -1)
from species_str
),
rec ( id, name, str, len, prev_pos, new_pos, flag ) as (
select id, name, substr(name, 1, instr(name, ')', -1)),
pos, pos - 1, pos, null
from prep
union all
select id, name, str, len, new_pos,
instr(str, '(', -(len - new_pos + 2)),
case when length(replace(substr(str, new_pos), '(', '')) =
length(replace(substr(str, new_pos), ')', ''))
then 1 end
from rec
where prev_pos > 0 and flag is null
)
select id, name, case when flag = 1
then substr(name, prev_pos, len - prev_pos + 1) end as target
from rec
where flag = 1 or prev_pos <= 0 or name is null
order by id;

Output:

        ID NAME                             TARGET                         
---------- -------------------------------- --------------------------------
100 CfwHE3 (HH3d) Jt1 (CD-1) (CD-1)
101 4GSdg-3t 22sfG/J (mdx (fq) KO) (mdx (fq) KO)
102 Yf7mMjfel 7(tm1) (SCID) (SCID)
103 B29fj;jfos x11 (tmos (line x11)) (tmos (line x11))
104 B29;CD (Atm (line G5)) (Atm (line G5))
105 Ifkso30 jel-3
106 13GupSip (te3x) Blhas/J (te3x)
151
152 try (this (and (this))) ok? (this (and (this)))
153 try (this (and (this)) ok?) (this (and (this)) ok?)
154 try (this (and) this (ok))? (this (and) this (ok))
155 try (this (and (this) (this)
156 right grouping (includging ")")
157 try this out ) ( too

14 rows selected

Change needed to satisfy the OP's (edited) requirement:

In the outermost select (at the bottom of the code), where we have case when flag = 1 then... to define the target column, add a condition like so:

... , case when flag = 1 and len = length(name) then ...

Output with this modification:

        ID NAME                             TARGET                         
---------- -------------------------------- --------------------------------
100 CfwHE3 (HH3d) Jt1 (CD-1) (CD-1)
101 4GSdg-3t 22sfG/J (mdx (fq) KO) (mdx (fq) KO)
102 Yf7mMjfel 7(tm1) (SCID) (SCID)
103 B29fj;jfos x11 (tmos (line x11)) (tmos (line x11))
104 B29;CD (Atm (line G5)) (Atm (line G5))
105 Ifkso30 jel-3
106 13GupSip (te3x) Blhas/J
151
152 try (this (and (this))) ok?
153 try (this (and (this)) ok?) (this (and (this)) ok?)
154 try (this (and) this (ok))?
155 try (this (and (this) (this)
156 right grouping (includging ")")
157 try this out ) ( too

14 rows selected


Related Topics



Leave a reply



Submit