How to improve the performance of this regular expression?
Replace your \s
with [\t\f ]
so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n)
.
The problem is that you have three greedy consumers that all match '\n'
(\s*
, (...\n)*
and again \s*
).
In your last timing example, they will try out all strings a
, b
and c
(one for each consumer) that make up 25*'\n'
or any substring d
it begins with, say e
is what is ignored, then d+e == 25*'\n'
.
Now find all combinations of a
, b
, c
and e
so that a+b+c+e == d+e == 25*'\n'
considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.
What are some ways I can improve the performance of a regular expression query in PostgreSQL 8?
You cannot create an index that will speed up any generic regular expression; however, if you have one or a limited number of regular expressions that you are matching against, you have a few options.
As Paul Tomblin mentions, you can use an extra column or columns to indicate whether or not a given row matches that regex or regexes. That column can be indexed, and queried efficiently.
If you want to go further than that, this paper discusses an interesting sounding technique for indexing against regular expressions, which involves looking for long substrings in the regex and indexing based on whether those are present in the text to generate candidate matches. That filters down the number of rows that you actually need to check the regex against. You could probably implement this using GiST indexes, though that would be a non-trivial amount of work.
How to improve the regex performance in java
The fastest way to get regex to work fast is to not use regex. Regex was never meant to be and almost never is a good choice for performance-sensitive operations. (Further reading: Why are regular expressions so controversial?)
Try using String class methods instead, or write a custom method doing what you want. Use a tokenizer with split on '=', and then use .toUpperCase()
on the tailing part (what's after \n
). Alternatively, just convert to char[]
or use charAt()
and traverse it manually, switching chars to upper after a newline and back to regular way after '='.
For example:
public static String changeCase( String s ) {
boolean capitalize = true;
int len = s.length();
char[] output = new char[len];
for( int i = 0; i < len; i++ ) {
char input = s.charAt(i);
if ( input == '\n' ) {
capitalize = true;
output[i] = input;
} else if ( input == '=' ) {
capitalize = false;
output[i] = input;
} else {
output[i] = capitalize ? Character.toUpperCase(input) : input;
}
}
return new String(output);
}
Method input:
field=value\n
field2=value2\n
field3=value3
Method output:
FIELD=value\n
FIELD2=value2\n
FIELD3=value3
Try it here: http://ideone.com/k0p67j
PS (by Jamie Zawinski):
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Why regular expression .* is slower at one place and faster at other
The way regex engines work with the *
quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
- try the next term in the regex. If it matches, proceed on
- "unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since .
matches anything (almost), the first state after encountering .*
is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*
, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?
, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*
, but be slightly more efficient because no check of the current char is required.
\s*
and .*
at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.
How to optimize regular expression performance?
You can greatly improve the performance of this regex by prepending \b
at the beginning:
\b(ACS| ... |Z)
This will prevent a check on every character, and check every word instead.
Related Topics
How Does the Bitwise & (And) Work in Java
Create MySQL Database from Java
Cors Allowed-Origin Restrictions Aren't Causing the Server to Reject Requests
Extract String Between Two Strings in Java
Error Message "Unreported Exception Java.Io.Ioexception; Must Be Caught or Declared to Be Thrown"
Log4J Configuration via Jvm Argument(S)
Painted Content Invisible While Resizing in Java
When to Use Atomicreference in Java
"Comparison Method Violates Its General Contract!" - Timsort and Gridlayout
When Should We Use a Preparedstatement Instead of a Statement
Spark Strutured Streaming Automatically Converts Timestamp to Local Time
Last Row Always Removed from Defaulttablemodel, Regardless of Index
Mocking Files in Java - Mock Contents - Mockito
Is It Safe to Get Values from a Java.Util.Hashmap from Multiple Threads (No Modification)
Spring Crudrepository Findbyinventoryids(List<Long> Inventoryidlist) - Equivalent to in Clause
What Package Naming Convention Do You Use for Personal/Hobby Projects in Java