Match Sequences of Consecutive Characters in a String

Match sequences of consecutive characters in a string

Using regex in Ruby 1.8.7+:

p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]

This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:

[["111", "1"], ["22", "2"], ["1", "1"]]

…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):

p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]

With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:

p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]

Here's another version that should work even in Ruby 1.8.6:

p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]

How to match sequences of consecutive Money like characters string in Dart?

You can use

final text = '000000000012735';
print(text.replaceFirstMapped(RegExp(r'^0*(\d+)(\d{2})$'), (Match m) =>
"${m[1]}.${m[2]}"));

The output is 127.35.

The regex matches

  • ^ - start of string
  • 0* - zero or more 0 chars
  • (\d+) - Group 1: one or more digits
  • (\d{2}) - Group 2: two digits
  • $ - end of string.

Note that since one replacement is expected, there is no need using replaceAllMapped, replaceFirstMapped will do.

How to match sequences of consecutive Date like characters string in Dart?

You can use

text.replaceAllMapped(RegExp(r'\b(?:((?:19|20)\d{2})(0?[1-9]|1[0-2])(0?[1-9]|[12][0-9]|3[01])|(0?[1-9]|[12][0-9]|3[01])(0?[1-9]|1[0-2])((?:19|20)\d{2}))\b'), (Match m) => m[4] == null ? "${m[1]}.${m[2]}.${m[3]}" : "${m[4]}.${m[5]}.${m[6]}")

The \b(?:((?:19|20)\d{2})(0?[1-9]|1[0-2])(0?[1-9]|[12][0-9]|3[01])|(0?[1-9]|[12][0-9]|3[01])(0?[1-9]|1[0-2])((?:19|20)\d{2}))\b regex matches

  • \b - a word boundary
  • (?: - start of a non-capturing group:
    • ((?:19|20)\d{2}) - year from 20th and 21st centuries
    • (0?[1-9]|1[0-2]) - month
    • (0?[1-9]|[12][0-9]|3[01]) - day
  • | - or
    • (0?[1-9]|[12][0-9]|3[01]) - day
    • (0?[1-9]|1[0-2]) - month
    • ((?:19|20)\d{2}) - year
  • ) - end of the group
  • \b - word boundary.

See the regex demo.

See a Dart demo:


void main() {
final text = '13022020 and 20200213 20111919';
print(text.replaceAllMapped(RegExp(r'\b(?:((?:19|20)\d{2})(0?[1-9]|1[0-2])(0?[1-9]|[12][0-9]|3[01])|(0?[1-9]|[12][0-9]|3[01])(0?[1-9]|1[0-2])((?:19|20)\d{2}))\b'), (Match m) =>
m[4] == null ? "${m[1]}.${m[2]}.${m[3]}" : "${m[4]}.${m[5]}.${m[6]}"));
}

Returning 13.02.2020 and 2020.02.13 20.11.1919.

If Group 4 is null, the first alternative matched, so we need to use Group 1, 2 and 3. Else, we join Group 4, 5 and 6 with a dot.

Python regex to match 3 consecutive characters in the alphabet but not necessarily side by side

You ask if it is possible to do that with a regular expression? Certainly! Is it pretty? That's in the eye of the beholder.

You need a regular expression that looks like this (with the case-indifferent flag set):

^(?=.*\d)(?=.*[a-z])(?=.*[<special symbols here>])(?!<no 3 digits that are consecutive>)(?!<no three letters that are consecutive>).{8}

Let's look at the negative lookahead

(?!<no 3 digits that are consecutive>) 

We can write that as follows.

(?!(?:(?=.*0)(?=.*1)(?=.*2))|(?:(?=.*1)(?=.*2)(?=.*3))|(?:(?=.*2)(?=.*3)(?=.*4))|(?:(?=.*3)(?=.*4)(?=.*5))|(?:(?=.*4)(?=.*5)(?=.*6))|(?:(?=.*5)(?=.*6)(?=.*7))|(?:(?=.*6)(?=.*7)(?=.*8))|(?:(?=.*7)(?=.*9)(?=.*9)))

Demo

The expression can be written in verbose mode (re.X or re.VERBOSE) to make it self-documenting.

(?!          # begin negative lookahead       
(?: # begin non-capture group
(?=.*0) # match > 0 characters followed by 0 (positive lookahead)
(?=.*1) # match > 0 characters followed by 1
(?=.*2) # match > 0 characters followed by 2
) # end non-capture group
| # or
... similar for (?:(?=.*1)(?=.*2)(?=.*3))
...
) # end negative lookahead

The construction of

(?!<no three letters that are consecutive>)

is similar (containing an alternation with 23 elements, a, b and c to x, y and z).

Match two consecutive sequences with fixed overall length

You can use the following method to cheat having to do alternations.

See regex in use here

\b[a-z]{1,4}\d{1,4}(?<=\b[a-z\d]{5})
  • \b Assert position at a word boundary
  • [a-z]{1,4} Matches a lowercase letter between 1 and 4 times
  • \d{1,4} Matches a digit between 1 and 4 times
  • (?<=\b[a-z\d]{5}) Positive lookbehind ensuring a combination of exactly 5 lowercase letters and digits precedes

Find consecutive characters in a string + their start and end indices (python)

One way using re.finditer:

[(*m.span(), len(m.group(0))) for m in re.finditer("-+", s)]

Output:

[(1, 4, 3), (5, 6, 1), (10, 15, 5)]

Regex to match two or more consecutive characters

You can use a lookahead and a backreference to solve this. But note that right now you are requiring at least 2 characters. The starting letter and another one (due to the +). You probably want to make that + and * so that the second character class can be repeated 0 or more times:

^(?!.*(.)\1)[a-zA-Z][a-zA-Z\d._-]*$

How does the lookahead work? Firstly, it's a negative lookahead. If the pattern inside finds a match, the lookahead causes the entire pattern to fail and vice-versa. So we can have a pattern inside that matches if we do have two consecutive characters. First, we look for an arbitrary position in the string (.*), then we match single (arbitrary) character (.) and capture it with the parentheses. Hence, that one character goes into capturing group 1. And then we require this capturing group to be followed by itself (referencing it with \1). So the inner pattern will try at every single position in the string (due to backtracking) whether there is a character that is followed by itself. If these two consecutive characters are found, the pattern will fail. If they cannot be found, the engine jumps back to where the lookahead started (the beginning of the string) and continue with matching the actual pattern.

Alternatively you can split this up into two separate checks. One for valid characters and the starting letter:

^[a-zA-Z][a-zA-Z\d._-]*$

And one for the consecutive characters (where you can invert the match result):

(.)\1

This would greatly increase the readability of your code (because it's less obscure than that lookahead) and it would also allow you to detect the actual problem in pattern and return an appropriate and helpful error message.

Using Regex to find longest consecutive match in a string

You should look for all matches of one or more consecutive occurrences of the string 'APPLE', which the following regex will do:

(?:APPLE)+

See RegEx demo

Then you sort them in descending order by length. Take the longest match (i.e., the first match) and divide by 5 (the number of characters in 'APPLE') and that will tell you how many consecutive occurrences of 'APPLE` were found in the longest match:

import re

s = "APPLEORANGEORANGEAPPLEAPPLEAPPLEBANANABANANABANANAAPPLEBANANA"
matches = sorted(re.findall(r'(?:APPLE)+', s), reverse=True)
if matches:
print(len(matches[0]) // 5)
else:
print(0)

Prints:

3


Related Topics



Leave a reply



Submit