Are Java and C# Regular Expressions Compatible

Are Java and C# regular expressions compatible?

There are quite (a lot of) differences.

Character Class

  1. Character classes subtraction [abc-[cde]]
    • .NET YES (2.0)
    • Java: Emulated via character class intersection and negation: [abc&&[^cde]])
  2. Character classes intersection [abc&&[cde]]
    • .NET: Emulated via character class subtraction and negation: [abc-[^cde]])
    • Java YES
  3. \p{Alpha} POSIX character class

    • .NET NO
    • Java YES (US-ASCII)
  4. Under (?x) mode COMMENTS/IgnorePatternWhitespace, space (U+0020) in character class is significant.

    • .NET YES
    • Java NO
  5. Unicode Category (L, M, N, P, S, Z, C)

    • .NET YES: \p{L} form only
    • Java YES:

      • From Java 5: \pL, \p{L}, \p{IsL}
      • From Java 7: \p{general_category=L}, \p{gc=L}
  6. Unicode Category (Lu, Ll, Lt, ...)

    • .NET YES: \p{Lu} form only
    • Java YES:

      • From Java 5: \p{Lu}, \p{IsLu}
      • From Java 7: \p{general_category=Lu}, \p{gc=Lu}
  7. Unicode Block

    • .NET YES: \p{IsBasicLatin} only. (Supported Named Blocks)
    • Java YES: (name of the block is free-casing)

      • From Java 5: \p{InBasicLatin}
      • From Java 7: \p{block=BasicLatin}, \p{blk=BasicLatin}
  8. Spaces, and underscores allowed in all long block names (e.g. BasicLatin can be written as Basic_Latin or Basic Latin)

    • .NET NO
    • Java YES (Java 5)

Quantifier

  1. ?+, *+, ++ and {m,n}+ (possessive quantifiers)

    • .NET NO
    • Java YES

Quotation

  1. \Q...\E escapes a string of metacharacters

    • .NET NO
    • Java YES
  2. \Q...\E escapes a string of character class metacharacters (in character sets)

    • .NET NO
    • Java YES

Matching construct

  1. Conditional matching (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)
    • .NET YES
    • Java NO
  2. Named capturing group and named backreference

    • .NET YES:

      • Capturing group: (?<name>regex) or (?'name'regex)
      • Backreference: \k<name> or \k'name'
    • Java YES (Java 7):

      • Capturing group: (?<name>regex)
      • Backreference: \k<name>
  3. Multiple capturing groups can have the same name

    • .NET YES
    • Java NO (Java 7)
  4. Balancing group definition (?<name1-name2>regex) or (?'name1-name2'subexpression)
    • .NET YES
    • Java NO

Assertions

  1. (?<=text) (positive lookbehind)

    • .NET Variable-width
    • Java Obvious width
  2. (?<!text) (negative lookbehind)

    • .NET Variable-width
    • Java Obvious width

Mode Options/Flags

  1. ExplicitCapture option (?n)
    • .NET YES
    • Java NO

Miscellaneous

  1. (?#comment) inline comments

    • .NET YES
    • Java NO

References

  • regular-expressions.info - Comparison of Different Regex Flavors
  • MSDN Library Reference - .NET Framework 4.5 - Regular Expression Language
  • Pattern (Java Platform SE 7)

Differences in regex between java .net and javascript?

Simple regex patterns will be fine across different implementations, however there are defferences for more complex matches. Yet there are limited number of common variations, namely POSIX and Perl regex. So you can have two columns for them in DB, which will be quite enough in most of the cases.

Modifing PCRE Regex to C# or Java Supported Regex

There are lots of things to bear in mind.

  • Named capturing groups: There syntax in Java is (?<name>pattern) and the names can only consist of ASCII digits or letters (see I can't use a group name like this "abc_def" using Patterns). Replace all (?P<name_parts>...) with (?<nameparts>...)
  • Use of #: In many flavors but Java, the free-spacing mode allows using a literal # inside character classes unescaped. In Java, any meaningful whitespace and # MUST be escaped EVEN inside character classes (replace all # with \\# inside character classes and pattern).
  • Pattern.COMMENTS is used in Java to enable free-spacing / comment mode. Alternatively, add (?x) at the pattern start.

Here is your code fix:

String line = "Bygholm Søpark 21B";
String pattern = "\\A\\s*\r\n" +
"(?: #########################################################################\r\n" +
" # Option A: [<Addition to address 1>] <House number> <Street name> #\r\n" +
" # [<Addition to address 2>] #\r\n" +
" #########################################################################\r\n" +
" (?:(?<AAdditiontoaddress1>.*?),\\s*)? # Addition to address 1\r\n" +
"(?:No\\.\\s*)?\r\n" +
" (?<AHousenumber1>\\pN+[a-zA-Z]?(?:\\s*[-/\\pP]\\s*\\pN+[a-zA-Z]?)*) # House number\r\n" +
"\\s*,?\\s*\r\n" +
" (?<AStreetname1>(?:[a-zA-Z]\\s*|\\pN\\pL{2,}\\s\\pL)\\S[^,\\#]*?(?<!\\s)) # Street name\r\n" +
"\\s*(?:(?:[,/]|(?=\\#))\\s*(?!\\s*No\\.)\r\n" +
" (?<AAdditiontoaddress2>(?!\\s).*?))? # Addition to address 2\r\n" +
"| #########################################################################\r\n" +
" # Option B: [<Addition to address 1>] <Street name> <House number> #\r\n" +
" # [<Addition to address 2>] #\r\n" +
" #########################################################################\r\n" +
" (?:(?<BAdditiontoaddress1>.*?),\\s*(?=.*[,/]))? # Addition to address 1\r\n" +
" (?!\\s*No\\.)(?<BStreetname>\\S\\s*\\S(?:[^,\\#](?!\\b\\pN+\\s))*?(?<!\\s)) # Street name\r\n" +
"\\s*[/,]?\\s*(?:\\sNo\\.)?\\s+\r\n" +
" (?<BHousenumber>\\pN+\\s*-?[a-zA-Z]?(?:\\s*[-/\\pP]?\\s*\\pN+(?:\\s*[-a-zA-Z])?)*|[IVXLCDM]+(?!.*\\b\\pN+\\b))(?<!\\s) # House number\r\n" +
"\\s*(?:(?:[,/]|(?=\\#)|\\s)\\s*(?!\\s*No\\.)\\s*\r\n" +
" (?<BAdditiontoaddress2>(?!\\s).*?))? # Addition to address 2\r\n" +
")\r\n" +
"\\s*\\Z";

// Create a Pattern object
Pattern r = Pattern.compile(pattern, Pattern.COMMENTS);
// Now create a matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("B_Street_name: " + m.group("BStreetname") );
System.out.println("B_House_number: " + m.group("BHousenumber") );
System.out.println("B_Addition_to_address_2: " + m.group("BAdditiontoaddress2") );
} else {
System.out.println("NO MATCH");
}

See the Java demo online.

Output:

B_Street_name: Bygholm Søpark
B_House_number: 21B
B_Addition_to_address_2: null

C# equivalent of java Matcher.hitEnd()

Built-in .NET alternative

It seems there is no direct built-in .NET alternative (within the System.Text.RegularExpressions namespace) to the Java java.util.regex.Matcher.hitEnd() method.

Alternative libraries

Probably, an alternative library could be found that provides the required alternative.

PCRE.NET

For example, a quick search revealed the library: ltrzesniewski/pcre-net: PCRE.NET - Perl Compatible Regular Expressions for .NET. As per its documentation (README.md), the library supports the partial matching:

Example usage


<…>

  • Partial matching:

    var regex = new PcreRegex(@"(?<=abc)123");
    var match = regex.Match("xyzabc12", PcreMatchOptions.PartialSoft);
    // result: match.IsPartialMatch == true

.net regex into java @(?!\\)(?'M'[^|%])?

This regex consists of two groups.

The first one (?<!\\) is a lookbehind assertion. It will match only if the previous letter is not a backslash. The second one (?'M'[^|%]) is a named capturing group (called M), that matches any character except "|" and "%".

I.e. the regex will match "a", and not match "\a" or "%".

Java does not support the named capture, but

(?<!\\)([^|%])

should work fine for you. You'd reference the first group by number, instead of by name then.

Note that you may have to escape backslashes leading to (?<!\\\\) for the first part.

Are Regular Expressions a must for programming?

One could easily go without them but one should (IMHO) know the basics, for 2 reasons.

1) There may come a time where RegEx is the best solution to the problem at hand (see image below)

2) When you see a Regex in someone else's code it shouldn't be 100% mystical.

preg_match('/summarycount">.*?([,\d]+)<\/div>.*?Reputation/s', $page, $rep);

This code is simple enough but if you don't know RegEx then that stuff thats in the first parameter may as well be a Martian language. The RegEx thats used here is actually pretty simple once you learn the basics, and to get you that far head over to http://www.regular-expressions.info/ they have ALOT of info about RegEx and its various implimentations on the different platforms/langauges they also have a great tutorial to get started with. After that check out RegexBuddy, it can help you build RegExs and while you build them if you watch what it does then it can help you lean, it by far was the best $39.95 I've ever spent.






Original Comic

Removing all punctuation using POSIX in Java and C# produce different output

"\\p{P}" means that same in both Java and C#, i.e. match Unicode Category P (Punctuation).

Java's "\\p{Punct}" means something else, and is documented as:

Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

So, the equivalent C# is "[!\"#$%&'()*+,\\-./:;<=>?@\\[\\\\\\]^_`{|}~]"

How to add features missing from the Java regex implementation?

From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java’s regexes are a long, long, long ways from the convenience you find in Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we’re stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.

Where to begin? There are the so-called convenience methods of the String class: matches, replaceAll, replaceFirst, and split. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here’s a partial list of those problems, and what can and cannot be done about them.

  1. The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.

  2. There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you have to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.

  3. The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.

  4. The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.

  5. The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!

  6. It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

  7. Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.

  8. There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.

  9. The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.

  10. The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.

  11. Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.

  12. There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.

  13. The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.

  14. The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.

Finally, to show you just how brain-damaged Java’s truly regexes are, consider this multiline pattern, which shows many of the flaws already described:

   String rx =
"(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
+ " # next is a big can't-have set \n"
+ "(?! ^ .* \n"
+ " (?: ^ \\d+ $ \n"
+ " | ^ \\p{Lu} - \\p{Lu} $ \n"
+ " | Invitrogen \n"
+ " | Clontech \n"
+ " | L-L-X-X # dashes ok \n"
+ " | Sarstedt \n"
+ " | Roche \n"
+ " | Beckman \n"
+ " | Bayer \n"
+ " ) # end alternatives \n"
+ " \\b # only on a word boundary \n"
+ ") # end negated lookahead \n"
;

Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don’t work right on Unicode. There are many more problems beyond that.

Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it’s all locked down with the finality of a death sentence, and it cannot be fixed.

So Alireza Noori, if you feel Java’s clumsy regexes are too hosed for reliable and convenient regex processing ever to be possible in Java, I cannot gainsay you. Sorry, but that’s just the way it is.

“Fixed in the Next Release!”

Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:

  1. The Unicode Script property is now supported. You may use any of the equivalent forms \p{Script=Greek}, \p{sc=Greek}, \p{IsGreek}, or \p{Greek}. This is inherently superior to the old clunky block properties. It means you can do things like [\p{Latin}\p{Common}\p{Inherited}], which is quite important.

  2. The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the \x{⋯} notation, such as \x{1D402}. This works even inside character classes, finally allowing [\x{1D400}-\x{1D419}] to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.

  3. Named groups are now supported via the standard notation (?<NAME>⋯) to create it and \k<NAME> to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.

  4. A new Pattern compile flag, Pattern.UNICODE_CHARACTER_CLASSES and associated embeddable switch, (?U), will now swap around all the definitions of things like \w, \b, \p{alpha}, and \p{punct}, so that they now conform to the definitions of those things required by The Unicode Standard.

  5. The missing or misdefined binary properties \p{IsLowercase}, \p{IsUppercase}, and \p{IsAlphabetic} will now be supported, and these correspond to methods in the Character class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.

These enhancements and fixes are very important to finally have, and so I am glad, even excited, to have them.

But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There’s just too much missing from Java’s still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model never works, which is all Java regexes are. You have to start over from first principles, the way Groovy did.

Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is that really true? Fewer even that you think, it turns out.

It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java’s Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It’s positively parochial. And no, there are Nᴏᴛ Hᴀᴘᴘʏ; would you be?

Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn’t hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make very sure that all the necessary infrastructure support is there before you invest too much. Otherwise you too may get stuck without any real options once you’re too far into it to rescue your project.

Caveat Emptor



Related Topics



Leave a reply



Submit