Difference Between Matches() and Find() in Java Regex

Difference between matches() and find() in Java Regex

matches tries to match the expression against the entire string and implicitly add a ^ at the start and $ at the end of your pattern, meaning it will not look for a substring. Hence the output of this code:

public static void main(String[] args) throws ParseException {
Pattern p = Pattern.compile("\\d\\d\\d");
Matcher m = p.matcher("a123b");
System.out.println(m.find());
System.out.println(m.matches());

p = Pattern.compile("^\\d\\d\\d$");
m = p.matcher("123");
System.out.println(m.find());
System.out.println(m.matches());
}

/* output:
true
false
true
true
*/

123 is a substring of a123b so the find() method outputs true. matches() only 'sees' a123b which is not the same as 123 and thus outputs false.

What's the difference between Matcher.lookingAt() and find()?

The documentation for Matcher.lookingAt clearly explains the region lookingAt tries to match:

Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.

So no, lookingAt does not require matching the whole string. Then what's the difference between lookingAt and find? From the Matcher Javadoc overview:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
  • The find method scans the input sequence looking for the next subsequence that matches the pattern.

lookingAt always starts at the beginning, but find will scan for a starting position.

Viewed another way, matches has a fixed start and end, lookingAt has a fixed start but a variable end, and find has a variable start and end.

Java regex, matches and find

When you call matches(), the Matcher already searches for a match (the whole String). Calling find the Matcher will try to find the pattern again after the current match, but since there are no characters left after a match that matches the entire String, find returns false.

To search theString again, you'd need to create a new Matcher or call reset():

final Matcher subMatcher = Pattern.compile("\\d+").matcher("123");
System.out.println("Found : " + subMatcher.matches());
subMatcher.reset();
System.out.println("Found : " + subMatcher.find());

What's the difference between String.matches and Matcher.matches?

Absolutely. A Matcher is created on on a precompiled regexp, while String.matches must recompile the regexp every time it executes, so it becomes more wasteful the more often you run that line of code.

Difference in results between Java matches vs JavaScript match

In JavaScript match returns substrings which matches used regex. In Java matches checks if entire string matches regex.

If you want to find substrings that match regex use Pattern and Matcher classes like

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(yourData);
while(m.find()){
m.group();//this will return current match in each iteration
//you can also use other groups here using their indexes
m.group(2);
//or names (?<groupName>...)
m.group("groupName");
}

Find ALL matches of a regex pattern in Java - even overlapping ones

By default, successive calls to Matcher.find() start at the end of the previous match.

To find from a specific location pass a start position parameter to find of one character past the start of the previous find.

In your case probably something like:

while (matcher.find(matcher.start()+1))

This works fine:

Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");

public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}

printing

0,1,2,3

1,2,3,4

...

Why is find() with Regex ^[a-z]$ not equivalent to matches() with Regex [a-z]?

^ and $ mean different things depending on which mode you're running your regexp in. See the Pattern.MULTILINE flag's javadoc.

In any case, ^ and $ never consume anything.

The way regex engines work, is that everything in the regexp can 'match' or 'not match' and usually as part of matching, they also consume characters.

You can think about it as a cursor that, just like your text cursor is always in between characters, and the regexp engine will go from left to right through your regexp, starting the cursor at the beginning of input, and for each item in the regexp pattern, that item either matches or fails, and usually but not always, moves the cursor forward.

^ and $ can match or fail, but they cannot move the cursor. It's the same as e.g. \b (matches on a 'word break'), or (positive/negative) look-(ahead/behind) in that way. The relevant trickery here is that for the matches() case, every character must be consumed - the matching process must end such that the cursor is at the very end. Your pattern can only consume lowercase letters (only forward the cursor when there are lowercase letters), so the moment you toss any character in your string that isn't one of those (so even one \r or \n, in any position), it couldn't possibly match; there is no way to consume these non-lowercase characters.

With find(), on the other hand, you don't need to consume all characters; you merely need for a substring to match up, that is all.

Which then gets us to: Which 'states' in the string are considered as 'matching' the ^ state, and which ones are considered as 'matching' the $ state. The answer is partly dependent on whether MULTILINE mode is on. It's off in your code snippet; you can turn it on by making your regexes using Pattern.compile(patternString, Pattern.MULTILINE), or by tossing (?m) inside your regexp string ((?xyz) enables/disables flags from the point that shows up in your pattern string, and has no effect otherwise (always matches, consumes nothing - that's regexp-engine-ese for: Doesn't do anything whatsoever).

Even the UNIX_LINES has an effect on this (with UNIX_LINES mode on, only \n is considered a line termination, and ^/$ will match whenever you're on a line termination if you're in MULTILINE mode.

In multiline mode, all your examples trivially match; ^ is 'true' anytime the cursor is either at start-of-input (the cursor is always in between characters; if it's in between the start and the first character (i.e. before the first character), it is considered to match) - or if you're in between a newline character and the thing that immediately follows it, as long as that thing isn't the end of the entire input. \r and \n all count (because UNIX_LINES is off).

But you're not in MULTILINE mode, so what in the blazes is going on?

What's going on is that the docs are wrong. As @MartinDevillers excellent digging around for the relevant bug entries shows.

The docs are only slightly wrong. Specifically, the regex engine is trying to be a little more intelligent than the rather rote:

From the javadoc of the regular expression package:

By default these expressions only match at the beginning and the end of the entire input sequence.

And that's just plain hogwash. It's more intelligent than that: They also match when your cursor is in between a character and exactly one newline, though any of \r, \n, and \r\n are all considered 'one newline', as long as that one newline is the final thing in the entire input. In other words, given (where every space isn't real; I'm making room to show where cursors can be, which can only be between chars, so I can stick a marker below them to show where things match):

" h e l l o \r \n "
^ ^ ^

The matching system considers $ matched in any of the ^ places. Let's test that theory:

Pattern p = Pattern.compile("hello$");
System.out.println(p.matcher("hello\r\n\n").find());
System.out.println(p.matcher("hello\r\n").find());
System.out.println(p.matcher("hello\r").find());
System.out.println(p.matcher("hello\n").find());
System.out.println(p.matcher("hello\n\n").find());

This prints false, true, true, true, false. The middle 3 all have a character (or characters) at the end that are considered 'a single newline' on at least one major OS (\n is posix/unix/macosx, \r\n is windows, \r is classic mac which I don't think ever ran a JVM, and nobody uses anymore, but its still considered 'a newline' by most rules for grandfathering reasons I guess).

That's all you're missing here.

CONCLUSION:

The docs are slightly wrong, and $ is smarter than merely 'matches at very end of input'; it acknowledges that sometimes input has a stray newline hanging off of the end of it, and $ won't get confused by this. But matches() will get confused by a dangling newline at the very end though - it has to consume everything or it isn't considered matched.



Related Topics



Leave a reply



Submit