Regex Match Count of Characters That Are Separated by Non-Matching Characters

Regex match count of characters that are separated by non-matching characters

Hey I think this would a simple but working one:

( *?[0-9a-zA-Z] *?){10,}

Breaking the regex down:

  1. ( *? --------It can start with space(s)
  2. [0-9a-zA-Z] -Followed with the alphanumeric values
  3. *?) ---------It can end with space(s)
  4. {10,} -------Matches this pattern 10 or more times

Key: When I look at the count for regexes, it applies to the group, i.e., the things in the brackets "()", this case, multiple spaces followed ONE from the alphanumeric values followed by spaces are still counted as one match. Hope it helps. :)

Count number of character matches in a string (Regex only)?

Use separate look aheads for each assertion:

^(?=(([^ac]*[ac]){2})*[^ac]*$)(?=(([^bd]*[bd]){2})*[^bd]*$).*$

See live demo.

This works basically because ([^ac]*[ac]){2}) matches pairs of [ac]. The rest is relatively simple.

Regular Expressions and negating a whole character group

Use negative lookahead:

^(?!.*ab).*$

UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.

The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.

For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:

(?m)^(?!.*\bquo\b).+$

(?m)^(?:(?!\bquo\b).)+$

Whether I search for matches in the whole text, or break it up into lines and match them individually, the anchored lookahead consistently outperforms the floating one.

Alphanumeric regex pattern with definite number of characters separated by hyphen

You may use the following regex to get all occurrences and then filter out those that do not contain a letter with /[A-Z]/ regex:

/(?:^|\s)(?=\S{6,})(?=\S*[A-Z])([A-Z0-9./+~]+(?:-[A-Z0-9./+~]+)+=*)(?!\S)/g

See the regex demo.

Details

  • (?:^|\s) - a start of string or whitespace
  • (?=\S{6,}) - 6 or more chars then
  • (?=\S*[A-Z]) - there must be at least 1 uppercase ASCII letter after 0+ non-whitespace chars
  • ([A-Z0-9./+~]+(?:-[A-Z0-9./+~]+)+=*) - Group 1:

    • [A-Z0-9./+~]+ - 1+ uppercase ASCII letters, digits, ., /, +, ~
    • (?:-[A-Z0-9./+~]+)+ - 1+ occurrences of:

      • - - a - char
      • [A-Z0-9./+~]+ - 1+ uppercase ASCII letters, digits, ., /, +, ~
    • =* - 0+ = symbols
  • (?!\S) - a whitespace or end of string.
    See the JS demo:

var s = "1-2-444555656-54545 800-CVB-4=\r\nThe  ABC-CD40N=  is also supported onslots GH-K on the 4000 Series ISRs using the \r\nXYZ-X-THM . This SM-X-NIM-REW34= information is not captured in the table above  \r\nTERMS WITH ONLY DIGITS SHOUD NOT MATCH --> 1-800-553-6387 \r\nNumber of chars less than 6 SHOULD NOT match ---> IP-IP \r\nGH-K\r\nVA-V etc\r\n\r\nFollowing Should match\r\nYUIO-10GB-BG4: Supports JK-X6824-UIO-XK++=  U-VI1.1-100-WX-Y9\r\nXX-123-UVW-3456\r\nVA-V-W-K9\r\nVA-V-W\r\n\r\nThe following term is not matching as there is no Alphabet in first term-----------> 800-CVB-4=        \r\nThis should match\r\n\r\nCD-YT-GH-40G-R9(=) \r\nCRT7.0-TPS8K-F\r\nJ-SMBYTRAS-SUB=\r\n===============================\r\n\r\nBelow terms should NOT match\r\nGH-K\r\nVA-V-W\r\nST-M UCS T-Series <-- Should NOT match\r\n\r\n";var m, res=[];var rx = /(?:^|\s)(?=\S{6,})(?=\S*[A-Z])([A-Z0-9./+~]+(?:-[A-Z0-9./+~]+)+=*)(?!\S)/g;while(m=rx.exec(s)) {    res.push(m[1]);}console.log(res);

find consequent row of numbers (separated by non alphabetic characters) and count them

You could use something like:

private static final Pattern p = Pattern
.compile( "(?<!\\d[^a-z\\d]{0,10000})"
+ "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
+ "(?![^a-z\\d]*\\d)", Pattern.CASE_INSENSITIVE);

public static String replaceSpecial(String text) {
StringBuffer sb = new StringBuffer();
Matcher m = p.matcher(text);
while (m.find()) {
m.appendReplacement(sb, m.group(2) == null ? "****" : "*****");
}
m.appendTail(sb);
return sb.toString();
}

Usage demo:

System.out.println(replaceSpecial("foo 123 56 78 bar 12 32 abc 000_00"));
System.out.println(replaceSpecial("0000"));
System.out.println(replaceSpecial("any text 00 00 more texts"));
System.out.println(replaceSpecial("any text 000 00 more texts 00"));
System.out.println(replaceSpecial("any text 000 00 more texts 00 00"));
System.out.println(replaceSpecial("any text 00-00 more texts 00_00"));

Result:

foo 123 56 78 bar **** abc *****
****
any text **** more texts
any text ***** more texts 00
any text ***** more texts ****
any text **** more texts ****

Idea/explanation:

We want to find series of digits which have between zero or more non-digit but also non-alphabetic characters (we can represent them via [^\\da-z] but IMO [^a-z\\d] looks better so I will use this form). Length of this series is 4 or 5 which we can write as

digit([validSeparator]*digit){3,4} //1 digit + (3 OR 4 digits) => 4 OR 5 digits

but we need to have some way to recognize if we matched 4 or 5 digits because we need to have some way to decide if we want to replace this match with 4 or 5 asterisks.

For this purpose I will try to put 5th digit in separate group and will test if that group is empty. So I will try to create something like dddd(d)?.

And that how I came up with

  "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
// ^^^^^^^^^^^^^^^ possible 5th digit

Now to need to make sure that our regex will match only dddd(d) which are not surrounded by any digit from left or right because we don't want to match any of cases like

d ddddd
dddddd
ddddd d

So we need to add tests which will check if before (or after) our match there will be no digit (and valid separator). We can use here negative-look-around mechanisms like

  • "(?<!\\d[^a-z\\d]{0,10000})" - I used {0,10000} instead of * because look-behind needs to have some maximal length which prevents us from *.

  • "(?![^a-z\\d]*\\d)"

So now all we needed to do is combine these regexes (and make it case insensitive or instead of a-z use a-zA-Z)

Pattern p = Pattern.compile( "(?<!\\d[^a-z\\d]{0,10000})"
+ "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
+ "(?![^a-z\\d]*\\d)", Pattern.CASE_INSENSITIVE);

Rest is simple usage of appendTail and appendReplacement methods from Matcher class which will let us decide dynamically what to use as replacement of founded match (I tried to explain it better here: https://stackoverflow.com/a/25081783/1393766)

Regular Expression to find a string included between two characters while EXCLUDING the delimiters

Easy done:

(?<=\[)(.*?)(?=\])

Technically that's using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:

  • is preceded by a [ that is not captured (lookbehind);
  • a non-greedy captured group. It's non-greedy to stop at the first ]; and
  • is followed by a ] that is not captured (lookahead).

Alternatively you can just capture what's between the square brackets:

\[(.*?)\]

and return the first captured group instead of the entire match.

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

Which regular expression operator means 'Don't' match this character?

You can use negated character classes to exclude certain characters: for example [^abcde] will match anything but a,b,c,d,e characters.

Instead of specifying all the characters literally, you can use shorthands inside character classes: [\w] (lowercase) will match any "word character" (letter, numbers and underscore), [\W] (uppercase) will match anything but word characters; similarly, [\d] will match the 0-9 digits while [\D] matches anything but the 0-9 digits, and so on.

If you use PHP you can take a look at the regex character classes documentation.



Related Topics



Leave a reply



Submit