Java.Util.Regex - Importance of Pattern.Compile()

java.util.regex - importance of Pattern.compile()?

The compile() method is always called at some point; it's the only way to create a Pattern object. So the question is really, why should you call it explicitly? One reason is that you need a reference to the Matcher object so you can use its methods, like group(int) to retrieve the contents of capturing groups. The only way to get ahold of the Matcher object is through the Pattern object's matcher() method, and the only way to get ahold of the Pattern object is through the compile() method. Then there's the find() method which, unlike matches(), is not duplicated in the String or Pattern classes.

The other reason is to avoid creating the same Pattern object over and over. Every time you use one of the regex-powered methods in String (or the static matches() method in Pattern), it creates a new Pattern and a new Matcher. So this code snippet:

for (String s : myStringList) {
if ( s.matches("\\d+") ) {
doSomething();
}
}

...is exactly equivalent to this:

for (String s : myStringList) {
if ( Pattern.compile("\\d+").matcher(s).matches() ) {
doSomething();
}
}

Obviously, that's doing a lot of unnecessary work. In fact, it can easily take longer to compile the regex and instantiate the Pattern object, than it does to perform an actual match. So it usually makes sense to pull that step out of the loop. You can create the Matcher ahead of time as well, though they're not nearly so expensive:

Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher("");
for (String s : myStringList) {
if ( m.reset(s).matches() ) {
doSomething();
}
}

If you're familiar with .NET regexes, you may be wondering if Java's compile() method is related to .NET's RegexOptions.Compiled modifier; the answer is no. Java's Pattern.compile() method is merely equivalent to .NET's Regex constructor. When you specify the Compiled option:

Regex r = new Regex(@"\d+", RegexOptions.Compiled); 

...it compiles the regex directly to CIL byte code, allowing it to perform much faster, but at a significant cost in up-front processing and memory use--think of it as steroids for regexes. Java has no equivalent; there's no difference between a Pattern that's created behind the scenes by String#matches(String) and one you create explicitly with Pattern#compile(String).

(EDIT: I originally said that all .NET Regex objects are cached, which is incorrect. Since .NET 2.0, automatic caching occurs only with static methods like Regex.Matches(), not when you call a Regex constructor directly. ref)

Pattern.compile() with an argument

Short answer

Although perhaps possible from a theoretical point of view, no; not at compile time, these are compiled again.

But Pattern uses a Flyweight pattern, such that once a regex is compiled, it is stored in memory, so full compilation at runtime is not necessary.

Compile time

From a theoretical point of view, it is possible that the compiler will perform what is known as constant propagation and thus resolve the problem at compile time. This can be done given the methods you call are final (or the callee is known at compile time), etc.

If one however compiles your method and inspects the Java bytecode, it compiles to:

public static boolean checkRegex(java.lang.String, java.lang.String);
Code:
0: aload_1
1: ifnonnull 6
4: iconst_0
5: ireturn
6: aload_0
7: invokestatic #15 // Method java/util/regex/Pattern.compile:(Ljava/lang/String;)Ljava/util/regex/Pattern;
10: astore_2
11: aload_2
12: aload_1
13: invokevirtual #16 // Method java/util/regex/Pattern.matcher:(Ljava/lang/CharSequence;)Ljava/util/regex/Matcher;
16: astore_3
17: aload_3
18: invokevirtual #17 // Method java/util/regex/Matcher.matches:()Z
21: ireturn

public static void main(java.lang.String[]);
Code:
0: ldc #18 // String ^.+@.+\..+$
2: ldc #19 // String email@example.com
4: invokestatic #20 // Method checkRegex:(Ljava/lang/String;Ljava/lang/String;)Z
7: pop
...

As you can perhaps see, the methods are simply translated into Java byte code and the method is called with the two parameters.

Some compilers (mostly ones for functional and logic programming languages) allow program specialization: if a certain call is done with a constant, the compiler can create a copy of that method and resolve what is known at compile time. It is however a trade-off, since introducing a large number of specialized methods, will reduce in a large code base.

Flyweight pattern at runtime

If you dig into the Java bytecode of java.util.regex.Pattern, you will see:

private static final HashMap<String, CharPropertyFactory> map;

static CharProperty charPropertyFor(String name) {
// <editor-fold defaultstate="collapsed" desc="Compiled Code">
/* 0: getstatic java/util/regex/Pattern$CharPropertyNames.map:Ljava/util/HashMap;
* 3: aload_0
* 4: invokevirtual java/util/HashMap.get:(Ljava/lang/Object;)Ljava/lang/Object;
* 7: checkcast java/util/regex/Pattern$CharPropertyNames$CharPropertyFactory
* 10: astore_1
* 11: aload_1
* 12: ifnonnull 19
* 15: aconst_null
* 16: goto 23
* 19: aload_1
* 20: invokevirtual java/util/regex/Pattern$CharPropertyNames$CharPropertyFactory.make:()Ljava/util/regex/Pattern$CharProperty;
* 23: areturn
* */
// </editor-fold>
}

Notice the HashMap, so this means Pattern stores fragments of the regex and their corresponding micro-DFA, this behavior is known as a Flyweight. This means you only compile the regex once. Evidently you will still have to perform lookups, so this is no free optimization, but it will definitely help.

How to use the argument of Pattern.compile?

The first parameter of the pattern is a regular expression. It must conform to a regular expression language, a widely used way to describe such patterns.

Although the fine details of regular expressions are very important to understand, and are often subjects of lengthy college courses, you can learn the basics by following a simple tutorial [link], following numerous examples, and trying your hand at writing regular expressions for your particular purposes.

There are many implementations of regular expression engines, with widely different capabilities. To learn the particulars of the Java "dialect" of regular expressions, follow the documentation of the Pattern class.

What is benefit in design of java.util.regex.Pattern and java.util.regex.Matcher?

Why Pattern is created by static factory method?

As per documentation of Pattern,

A (Pattern) is compiled representation of a regular expression.

A Pattern object will be associated with a pattern, and users of this objects are supposed to create it once and use it many times. By providing a static factory, Pattern class has freedom to perform internal checks before returning the Pattern object. For example, it can (if it wishes to) cache the Pattern instances and return the cached instance if same pattern string is provided in another call to compile (Note: This is not the way it is implemented though, however, it has that freedom due to use of static factory).

Why Matcher is created through factory method on Pattern?

Matcher can be used for two purposes

(Below is simplified perspective for sake of discussion, refer Java doc of Matcher for more details):

  1. Check whether given string matches a given regex.
  2. Match the given string against a given pattern, and return various match results.

For the first case, one can use Pattern.matches(regex, string) form of method invocation. In this case, the regex will be compiled and a boolean result will be returned after the match. Note that this is sort of a functional style of programming - and it works fine here because there are no matching state to be maintained.

For the second case, a match state has to be maintained which user can query after the matching is performed. Hence, in this case Matcher object is used which can maintain state of match results. Since, Matcher object cannot exist without a corresponding Pattern object, the API developer allows its creation only through an instance of Pattern - thus users to invoke p.matcher('aaaaab'). Internally, the code in Pattern class looks like below:

public Matcher matcher(CharSequence input) {
if (!compiled) {
synchronized(this) {
if (!compiled)
compile();
}
}
Matcher m = new Matcher(this, input);
return m;
}

As can be seen, Matcher takes Pattern as constructor parameter - so that it can invoke it at various points to get & maintain match result

PS

Like any API, the Pattern and Matcher could have been implemented bit differently as well - not all Java APIs are consistent in their design - I guess there is always some trait of the developer who developed those APIs left behind. Above answer is my interpretation of what approach those developers took.

How to serialize a java.util.regex.Pattern using protobuf?

java.util.regex.Pattern does not have encode and decode proto functions implemented in itself. However, you can implement that yourself pretty easy (as Andy Turner suggests). Something like this:

Proto

syntax = "proto2";

package termin4t0r;
option java_package = "com.example.termin4t0r";

// Proto for java.util.regex.Pattern
message RegexPatternProto {
// See Pattern.pattern()
optional string pattern = 1;
// See Pattern.flags()
optional int64 flags = 2;
}

Java encode and decode functions

class RegexpPatternProtos {
public static RegexPatternProto encode(java.util.regex.Pattern pattern) {
return RegexPatternProto.newBuilder()
.setPattern(pattern.pattern())
.setFlags(pattern.flags())
.build();
}

public static java.util.regex.Pattern decode(RegexPatternProto patternProto) {
return new RegexPatternProto(
patternProto.getPattern(), patternProto.getFlags());
}
}

I leave the unittests as an exercise :) I even find serializing this way preferable as protocol buffers have forward and backward compatibility, whereas java serialization has problems with that.

Does Pattern.compile cache?

I don't believe the results are cached and there's no evidence of such behaviour in the code or the documentation. It would (of course) be relatively trivial to implement such a cache yourself, but I would be interested in a use case in which such caching is beneficial.

Re. the comment below and String.split(), there's a different approach in that the code takes a distinct path for trivial 1 or 2 char patterns vs more complex regexps. But it still doesn't appear to cache.

Java regex performance

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.



Related Topics



Leave a reply



Submit