How to Find Repeated Characters with a Regex in Java

How can I find repeated characters with a regex in Java?

Try "(\\w)\\1+"

The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.

(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)

REGEX in java for extracting consecutive duplicate characters in a string

There is no single plain regex solution to this problem because you need a lookbehind with a backreference inside, which is not supported by Java regex engine.

What you can do is either get all (\w)\1+ matches and then check their length using common string methods:

String s = "aaabbaa";
Pattern pattern = Pattern.compile("(\\w)\\1+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group().length() == 2) System.out.println(matcher.group(1));
}

(see the Java demo) or you can match 3 or more repetitions or just 2 repetitions and only grab the match if the Group 2 matched:

String s = "aaabbaa";
Pattern pattern = Pattern.compile("(\\w)\\1{2,}|(\\w)\\2");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group(2) != null)
System.out.println(matcher.group(2));
}

See this Java demo. Regex details:

  • (\w)\1{2,} - a word char and two or more occurrences of the same char right after
  • | - or
  • (\w)\2 - a word char and the same char right after.

Regular expression find if same character repeats 3 or more no of times in Java

You can use the following regex for your problem:

^.*(.)\1\1.*$

Explanation

  1. ^ starting point of your string
  2. .* any char 0 to N times
  3. (.) one char in the capturing group that will be used by the backreference
  4. \1 back reference to the captured character (we call it twice to force your 3 times repetition constraint)
  5. .* any char 0 to N times
  6. $ end of the input string

I have tested on :

hello -> false
ohhhhh -> true
whatsuppp -> true
aaa -> true
aaahhhahj -> true
abcdef -> false
abceeedef -> true

Last but not least, you have to add a backslash \ before each backslash \ in your regex before being able to use it in your Java code.

This give you the following prototype Java code:

  ArrayList <String> strVector = new ArrayList<String>();
strVector.add("hello");
strVector.add("ohhhhh");
strVector.add("whatsuppp");
strVector.add("aaa");
strVector.add("aaahhhahj");
strVector.add("abcdef");
strVector.add("abceeedef");

Pattern pattern = Pattern.compile("^.*(.)\\1\\1.*$");
Matcher matcher;

for(String elem:strVector)
{
System.out.println(elem);
matcher = pattern.matcher(elem);
if (matcher.find())System.out.println("Found you!");
else System.out.println("Not Found!");
}

giving at execution the following output:

hello
Not Found!
ohhhhh
Found you!
whatsuppp
Found you!
aaa
Found you!
aaahhhahj
Found you!
abcdef
Not Found!
abceeedef
Found you!

Using Java+regex, I want to find repeating characters in a string and replace that substring(s) with character found and # of times it was found

You may wrap the quantified backreference with a capturing group to be able to access this value later, and use a Matcher#appendReplacement to actually modify the matches inside the string:

String text = "fghhhhjkjkljhdd";
String regex = "(\\w)(\\1+)";
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, m.group(1) + (m.group(2).length()+1));
}
m.appendTail(sb);
System.out.println(sb); // => fgh4jkjkljhd2

See the Java demo.

Find repeating characters in a string using regex

(.)\1{2}

(.) matches any char

\1 matches that exactly char

{2} is to grant its 2 more of that

Writing a regex to detect repeat-characters

You want to catch as many characters in your set as possible, so instead of (\\w) you should use (\\w+) and you want the sequence to be at the end, so you need to add $ (and I have removed the + after \\1 which is not useful to detect repetition: only one repetition is needed):

Pattern p = Pattern.compile("(\\w+)\\1$");

Your program then outputs An as expected.

Finally, if you only want to capture ascii characters, you can use [a-zA-Z] instead of \\w:

Pattern p = Pattern.compile("([a-zA-Z]+)\\1$");

And if you want the character set to be at least 2 characters:

Pattern p = Pattern.compile("([a-zA-Z]{2,})\\1$");

Regex to detect if character is repeated more than three times

I think there's a much simpler solution if you're looking for any character repeated more than 3 times:

String[] inputs = {
"hello how are you...", // -> VALID
"hello how are you.............", // -> INVALID
"hiii", // -> VALID
"hiiiiii" // -> INVALID
};
// | group 1 - any character
// | | back-reference
// | | | 4+ quantifier including previous instance
// | | | | dot represents any character,
// | | | | including whitespace and line feeds
// | | | |
Pattern p = Pattern.compile("(.)\\1{3,}", Pattern.DOTALL);
// iterating test inputs
for (String s: inputs) {
// matching
Matcher m = p.matcher(s);
// 4+ repeated character found
if (m.find()) {
System.out.printf(
"Input '%s' not valid, character '%s' repeated more than 3 times%n",
s,
m.group(1)
);
}
}

Output

Input 'hello how are you............. not valid', character '.' repeated more than 3 times
Input 'hiiiiii' not valid, character 'i' repeated more than 3 times
Input 'hello how are you' not valid, character ' ' repeated more than 3 times

Regex to match four repeated letters in a string using a Java pattern

Not knowing about the finite repetition syntax, your own problem solving skill should lead you to this:

([a-z])\1\1\1

Obviously it's not pretty, but:

  • It works
  • It exercises your own problem solving skill
  • It may lead you to deeper understanding of concepts
    • In this case, knowing the desugared form of the finite repetition syntax


I have a concern:

  • "ffffffff".matches("([a-z])\\1{3,}") = true
  • "fffffasdf".matches("([a-z])\\1{3,}") = false
  • "asdffffffasdf".matches("([a-z])\\1{3,}") = false

What can I do for the bottom two?

The problem is that in Java, matches need to match the whole string; it is as if the pattern is surrounded by ^ and $.

Unfortunately there is no String.containsPattern(String regex), but you can always use this trick of surrounding the pattern with .*:

"asdfffffffffasf".matches(".*([a-z])\\1{3,}.*") // true!
// ^^ ^^

Regex to validate 3 repeating characters

The regex solution for this is very inefficient. Please consider treating this answer from pure academic interest.

The pattern that fails strings having 4 or more occurrences of the same char is

^(?!.*(.).*\1.*\1.*\1).*

The last .* may be replaced with a more restrictive pattern if you need to precise this pattern.

See the regex demo.

The main part here is the (?!.*(.).*\1.*\1.*\1) negative lookahead. It matches
any 0+ chars (if Pattern.DOTALL is used, any char including newlines), as many as possible, then it matches and captures (with (.)) any char into Group 1, and then matches any 0+ chars followed with the same char 3 times. If the pattern is found (matched), the whole string match fails.

Why is it inefficient? The pattern relies heavily on backtracking. .* grabs all chars to the end of the string, then the engine backtracks, trying to accommodate some text for the subsequent subpatterns. You may see the backtracking steps here. The more .* there is, the more resource-consuming the pattern is.

Why is lazy variant not any better? The ^(?!.*?(.).*?\1.*?\1.*?\1).* looks to be faster with some strings, and it will be faster if the repeating chars appear close to each other and the start of the string. If they are at the end of the string, the efficiency will degrade. So, if the previous regex matches 121212 in 77 steps, the current one will also take the same amount of steps. However, if you test it against 1212124444, you will see that the lazy variant will fail after 139 steps, while the greedy variant will fail after 58 steps. And vice versa, 4444121212 will cause the lazy regex fail quicker, 14 steps vs. 211 steps with the greedy variant.

In Java, you may use it

s.matches("(?!.*(.).*\\1.*\\1.*\\1)")

or

s.matches("(?!.*?(.).*?\\1.*?\\1.*?\\1)")

Use Jacob's solution in production.



Related Topics



Leave a reply



Submit