How to Improve Performance of This Regular Expression Further

How to improve the performance of this regular expression?

Replace your \s with [\t\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n).

The problem is that you have three greedy consumers that all match '\n' (\s*, (...\n)* and again \s*).

In your last timing example, they will try out all strings a, b and c (one for each consumer) that make up 25*'\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\n'.

Now find all combinations of a, b, c and e so that a+b+c+e == d+e == 25*'\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D

By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.

What are some ways I can improve the performance of a regular expression query in PostgreSQL 8?

You cannot create an index that will speed up any generic regular expression; however, if you have one or a limited number of regular expressions that you are matching against, you have a few options.

As Paul Tomblin mentions, you can use an extra column or columns to indicate whether or not a given row matches that regex or regexes. That column can be indexed, and queried efficiently.

If you want to go further than that, this paper discusses an interesting sounding technique for indexing against regular expressions, which involves looking for long substrings in the regex and indexing based on whether those are present in the text to generate candidate matches. That filters down the number of rows that you actually need to check the regex against. You could probably implement this using GiST indexes, though that would be a non-trivial amount of work.

How to improve the regex performance in java

The fastest way to get regex to work fast is to not use regex. Regex was never meant to be and almost never is a good choice for performance-sensitive operations. (Further reading: Why are regular expressions so controversial?)

Try using String class methods instead, or write a custom method doing what you want. Use a tokenizer with split on '=', and then use .toUpperCase() on the tailing part (what's after \n). Alternatively, just convert to char[] or use charAt() and traverse it manually, switching chars to upper after a newline and back to regular way after '='.

For example:

public static String changeCase( String s ) {
boolean capitalize = true;
int len = s.length();
char[] output = new char[len];
for( int i = 0; i < len; i++ ) {
char input = s.charAt(i);
if ( input == '\n' ) {
capitalize = true;
output[i] = input;
} else if ( input == '=' ) {
capitalize = false;
output[i] = input;
} else {
output[i] = capitalize ? Character.toUpperCase(input) : input;
}
}
return new String(output);
}

Method input:

field=value\n
field2=value2\n
field3=value3

Method output:

FIELD=value\n
FIELD2=value2\n
FIELD3=value3

Try it here: http://ideone.com/k0p67j

PS (by Jamie Zawinski):

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Why regular expression .* is slower at one place and faster at other

The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:

  1. try the next term in the regex. If it matches, proceed on
  2. "unconsume" one character (move the pointer back one), aka backtrack and goto step 1.

Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.

With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.

Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.

\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

How to optimize regular expression performance?

You can greatly improve the performance of this regex by prepending \b at the beginning:

\b(ACS| ... |Z)

This will prevent a check on every character, and check every word instead.



Related Topics



Leave a reply



Submit