When not to use Regex in C# (or Java, C++, etc.)
Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.
Are Java and C# regular expressions compatible?
There are quite (a lot of) differences.
Character Class
- Character classes subtraction
[abc-[cde]]
- .NET YES (2.0)
- Java: Emulated via character class intersection and negation:
[abc&&[^cde]]
)
- Character classes intersection
[abc&&[cde]]
- .NET: Emulated via character class subtraction and negation:
[abc-[^cde]]
) - Java YES
- .NET: Emulated via character class subtraction and negation:
\p{Alpha}
POSIX character class- .NET NO
- Java YES (US-ASCII)
- Under
(?x)
modeCOMMENTS
/IgnorePatternWhitespace
, space (U+0020) in character class is significant.- .NET YES
- Java NO
- Unicode Category (L, M, N, P, S, Z, C)
- .NET YES:
\p{L}
form only - Java YES:
- From Java 5:
\pL
,\p{L}
,\p{IsL}
- From Java 7:
\p{general_category=L}
,\p{gc=L}
- From Java 5:
- .NET YES:
- Unicode Category (Lu, Ll, Lt, ...)
- .NET YES:
\p{Lu}
form only - Java YES:
- From Java 5:
\p{Lu}
,\p{IsLu}
- From Java 7:
\p{general_category=Lu}
,\p{gc=Lu}
- From Java 5:
- .NET YES:
- Unicode Block
- .NET YES:
\p{IsBasicLatin}
only. (Supported Named Blocks) - Java YES: (name of the block is free-casing)
- From Java 5:
\p{InBasicLatin}
- From Java 7:
\p{block=BasicLatin}
,\p{blk=BasicLatin}
- From Java 5:
- .NET YES:
- Spaces, and underscores allowed in all long block names (e.g.
BasicLatin
can be written asBasic_Latin
orBasic Latin
)- .NET NO
- Java YES (Java 5)
Quantifier
?+
,*+
,++
and{m,n}+
(possessive quantifiers)- .NET NO
- Java YES
Quotation
\Q...\E
escapes a string of metacharacters- .NET NO
- Java YES
\Q...\E
escapes a string of character class metacharacters (in character sets)- .NET NO
- Java YES
Matching construct
- Conditional matching
(?(?=regex)then|else)
,(?(regex)then|else)
,(?(1)then|else)
or(?(group)then|else)
- .NET YES
- Java NO
- Named capturing group and named backreference
- .NET YES:
- Capturing group:
(?<name>regex)
or(?'name'regex)
- Backreference:
\k<name>
or\k'name'
- Capturing group:
- Java YES (Java 7):
- Capturing group:
(?<name>regex)
- Backreference:
\k<name>
- Capturing group:
- .NET YES:
- Multiple capturing groups can have the same name
- .NET YES
- Java NO (Java 7)
- Balancing group definition
(?<name1-name2>regex)
or(?'name1-name2'subexpression)
- .NET YES
- Java NO
Assertions
(?<=text)
(positive lookbehind)- .NET Variable-width
- Java Obvious width
(?<!text)
(negative lookbehind)- .NET Variable-width
- Java Obvious width
Mode Options/Flags
ExplicitCapture
option(?n)
- .NET YES
- Java NO
Miscellaneous
(?#comment)
inline comments- .NET YES
- Java NO
References
- regular-expressions.info - Comparison of Different Regex Flavors
- MSDN Library Reference - .NET Framework 4.5 - Regular Expression Language
- Pattern (Java Platform SE 7)
Should I avoid regular expressions?
Don't avoid them. They're an excellent tool, and when used appropriately can save you a lot of time and effort. Moreover, a good implementation used carefully should not be particularly CPU-intensive.
Are Regular Expressions a must for programming?
One could easily go without them but one should (IMHO) know the basics, for 2 reasons.
1) There may come a time where RegEx is the best solution to the problem at hand (see image below)
2) When you see a Regex in someone else's code it shouldn't be 100% mystical.
preg_match('/summarycount">.*?([,\d]+)<\/div>.*?Reputation/s', $page, $rep);
This code is simple enough but if you don't know RegEx then that stuff thats in the first parameter may as well be a Martian language. The RegEx thats used here is actually pretty simple once you learn the basics, and to get you that far head over to http://www.regular-expressions.info/ they have ALOT of info about RegEx and its various implimentations on the different platforms/langauges they also have a great tutorial to get started with. After that check out RegexBuddy, it can help you build RegExs and while you build them if you watch what it does then it can help you lean, it by far was the best $39.95 I've ever spent.
Original Comic
Regex Captures in Java like in C#
As I mentioned in the comments, Java will only return the last value of a multiple valued group fit. So you should first use regex to isolate the last part of your string with the values:
strg = "0.478\t0.209\t0.211\t0.211\t0.205\t-0.462\t0.203\t0.202\t0.212"
and then just split around the tabs:
String[] values = strg.split("\\t");
Capture a part of a string that does not match another group (C# Regex)
I think trying to parse and validate the entire text in one regular expression is likely to give you problems. The text you are parsing is not a regular language, so regular expressions are not well designed for this purpose.
Instead I would recommend that you first tokenize the input to single tags and text between the tags. You can use a simple regular expression to find single tags - this is a much simpler problem that regular expressions can handle quite well. Once you have tokenized it, you can iterate over the tokens with an ordinary loop and apply formatting to the text as appropriate.
Are regexes really maintainable?
If regexes are long and impenetrable, making them hard to maintain then they should be commented.
A lot of regex implementations allow you to pad regexes with whitespace and comments.
See https://www.regular-expressions.info/freespacing.html#parenscomment
and Coding Horror: Regular Expressions: Now You Have Two Problems
Any code I've seen that uses Regexes tends to use them as a black box:
If by black box you mean abstraction, that's what all programming is, trying to abstract away the difficult part (parsing strings) so that you can concentrate on the problem domain (what kind of strings do I want to match).
even a small change can often result in a completely different regex.
That's true of any code. As long as you are testing your regex to make sure it matches the strings you expect, ideally with unit tests, then you should be confident at changing them.
Edit: please also read Jeff's comment to this answer about production code.
Related Topics
Difference Between Casting and Using the Convert.To() Method
Cross Platform (PHP to C# .Net) Encryption/Decryption with Rijndael
ASP.NET - Problems with Static Selected Style for a Selected Page on the Menu
Simple iOS Bluetooth Data Transmission Using Unity
Calling a C# Library from Python
How to Implement a Progress Bar Using the Mvvm Pattern
To Return Iqueryable<T> or Not Return Iqueryable<T>
How to Get Linq to Return the Object Which Has the Max Value for a Given Property
Implicit VS Explicit Interface Implementation
Uploadfile with Post Values by Webclient
ASP.NET Page Is Not Loading CSS Styles
Executing R Script Programmatically
How to Load a C# Dll in Python