When Not to Use Regex in C# (Or Java, C++, etc.)

When not to use Regex in C# (or Java, C++, etc.)

Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.

Use parser generators (or similar technologies) for that.

Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.

Are Java and C# regular expressions compatible?

There are quite (a lot of) differences.

Character Class

  1. Character classes subtraction [abc-[cde]]
    • .NET YES (2.0)
    • Java: Emulated via character class intersection and negation: [abc&&[^cde]])
  2. Character classes intersection [abc&&[cde]]
    • .NET: Emulated via character class subtraction and negation: [abc-[^cde]])
    • Java YES
  3. \p{Alpha} POSIX character class

    • .NET NO
    • Java YES (US-ASCII)
  4. Under (?x) mode COMMENTS/IgnorePatternWhitespace, space (U+0020) in character class is significant.

    • .NET YES
    • Java NO
  5. Unicode Category (L, M, N, P, S, Z, C)

    • .NET YES: \p{L} form only
    • Java YES:

      • From Java 5: \pL, \p{L}, \p{IsL}
      • From Java 7: \p{general_category=L}, \p{gc=L}
  6. Unicode Category (Lu, Ll, Lt, ...)

    • .NET YES: \p{Lu} form only
    • Java YES:

      • From Java 5: \p{Lu}, \p{IsLu}
      • From Java 7: \p{general_category=Lu}, \p{gc=Lu}
  7. Unicode Block

    • .NET YES: \p{IsBasicLatin} only. (Supported Named Blocks)
    • Java YES: (name of the block is free-casing)

      • From Java 5: \p{InBasicLatin}
      • From Java 7: \p{block=BasicLatin}, \p{blk=BasicLatin}
  8. Spaces, and underscores allowed in all long block names (e.g. BasicLatin can be written as Basic_Latin or Basic Latin)

    • .NET NO
    • Java YES (Java 5)

Quantifier

  1. ?+, *+, ++ and {m,n}+ (possessive quantifiers)

    • .NET NO
    • Java YES

Quotation

  1. \Q...\E escapes a string of metacharacters

    • .NET NO
    • Java YES
  2. \Q...\E escapes a string of character class metacharacters (in character sets)

    • .NET NO
    • Java YES

Matching construct

  1. Conditional matching (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)
    • .NET YES
    • Java NO
  2. Named capturing group and named backreference

    • .NET YES:

      • Capturing group: (?<name>regex) or (?'name'regex)
      • Backreference: \k<name> or \k'name'
    • Java YES (Java 7):

      • Capturing group: (?<name>regex)
      • Backreference: \k<name>
  3. Multiple capturing groups can have the same name

    • .NET YES
    • Java NO (Java 7)
  4. Balancing group definition (?<name1-name2>regex) or (?'name1-name2'subexpression)
    • .NET YES
    • Java NO

Assertions

  1. (?<=text) (positive lookbehind)

    • .NET Variable-width
    • Java Obvious width
  2. (?<!text) (negative lookbehind)

    • .NET Variable-width
    • Java Obvious width

Mode Options/Flags

  1. ExplicitCapture option (?n)
    • .NET YES
    • Java NO

Miscellaneous

  1. (?#comment) inline comments

    • .NET YES
    • Java NO

References

  • regular-expressions.info - Comparison of Different Regex Flavors
  • MSDN Library Reference - .NET Framework 4.5 - Regular Expression Language
  • Pattern (Java Platform SE 7)

Should I avoid regular expressions?

Don't avoid them. They're an excellent tool, and when used appropriately can save you a lot of time and effort. Moreover, a good implementation used carefully should not be particularly CPU-intensive.

Are Regular Expressions a must for programming?

One could easily go without them but one should (IMHO) know the basics, for 2 reasons.

1) There may come a time where RegEx is the best solution to the problem at hand (see image below)

2) When you see a Regex in someone else's code it shouldn't be 100% mystical.

preg_match('/summarycount">.*?([,\d]+)<\/div>.*?Reputation/s', $page, $rep);

This code is simple enough but if you don't know RegEx then that stuff thats in the first parameter may as well be a Martian language. The RegEx thats used here is actually pretty simple once you learn the basics, and to get you that far head over to http://www.regular-expressions.info/ they have ALOT of info about RegEx and its various implimentations on the different platforms/langauges they also have a great tutorial to get started with. After that check out RegexBuddy, it can help you build RegExs and while you build them if you watch what it does then it can help you lean, it by far was the best $39.95 I've ever spent.






Original Comic

Regex Captures in Java like in C#

As I mentioned in the comments, Java will only return the last value of a multiple valued group fit. So you should first use regex to isolate the last part of your string with the values:

strg = "0.478\t0.209\t0.211\t0.211\t0.205\t-0.462\t0.203\t0.202\t0.212"

and then just split around the tabs:

String[] values = strg.split("\\t");

Capture a part of a string that does not match another group (C# Regex)

I think trying to parse and validate the entire text in one regular expression is likely to give you problems. The text you are parsing is not a regular language, so regular expressions are not well designed for this purpose.

Instead I would recommend that you first tokenize the input to single tags and text between the tags. You can use a simple regular expression to find single tags - this is a much simpler problem that regular expressions can handle quite well. Once you have tokenized it, you can iterate over the tokens with an ordinary loop and apply formatting to the text as appropriate.

Are regexes really maintainable?

If regexes are long and impenetrable, making them hard to maintain then they should be commented.

A lot of regex implementations allow you to pad regexes with whitespace and comments.

See https://www.regular-expressions.info/freespacing.html#parenscomment

and Coding Horror: Regular Expressions: Now You Have Two Problems

Any code I've seen that uses Regexes tends to use them as a black box:

If by black box you mean abstraction, that's what all programming is, trying to abstract away the difficult part (parsing strings) so that you can concentrate on the problem domain (what kind of strings do I want to match).

even a small change can often result in a completely different regex.

That's true of any code. As long as you are testing your regex to make sure it matches the strings you expect, ideally with unit tests, then you should be confident at changing them.

Edit: please also read Jeff's comment to this answer about production code.



Related Topics



Leave a reply



Submit