Translate Perl Regular Expressions to .Net

Translate Perl regular expressions to .NET

There is a big comparison table in http://www.regular-expressions.info/refflavors.html.


Most of the basic elements are the same, the differences are:

Minor differences:

  • Unicode escape sequences. In .NET it is \u200A, in Perl it is \x{200A}.
  • \v in .NET is just the vertical tab (U+000B), in Perl it stands for the "vertical whitespace" class. Of course there is \V in Perl because of this.
  • The conditional expression for named reference in .NET is (?(name)yes|no), but (?(<name>)yes|no) in Perl.

Some elements are Perl-only:

  • Possessive quantifiers (x?+, x*+, x++ etc). Use non-backtracking subexpression ((?>…)) instead.
  • Named unicode escape sequence \N{LATIN SMALL LETTER X}, \N{U+200A}.
  • Case folding and escaping

    • \l (lower case next char), \u (upper case next char).
    • \L (lower case), \U (upper case), \Q (quote meta characters) until \E.
  • Shorthand notation for Unicode property \pL and \PL. You have to include the braces in .NET e.g. \p{L}.
  • Odd things like \X, \C.
  • Special character classes like \v, \V, \h, \H, \N, \R
  • Backreference to a specific or previous group \g1, \g{-1}. You can only use absolute group index in .NET.
  • Named backreference \g{name}. Use \k<name> instead.
  • POSIX character class [[:alpha:]].
  • Branch-reset pattern (?|…)
  • \K. Use look-behind ((?<=…)) instead.
  • Code evaluation assertion (?{…}), post-poned subexpression (??{…}).
  • Subexpression reference (recursive pattern) (?0), (?R), (?1), (?-1), (?+1), (?&name).
  • Some conditional expression's predicate are Perl-specific:

    • code (?{…})
    • recursive (R), (R1), (R&name)
    • define (DEFINE).
  • Special Backtracking Control Verbs (*VERB:ARG)
  • Python syntax

    • (?P<name>…). Use (?<name>…) instead.
    • (?P=name). Use \k<name> instead.
    • (?P>name). No equivalent in .NET.

Some elements are .NET only:

  • Variable length look-behind. In Perl, for positive look-behind, use \K instead.
  • Arbitrary regular expression in conditional expression (?(pattern)yes|no).
  • Character class subtraction (undocumented?) [a-z-[d-w]]
  • Balancing Group (?<-name>…). This could be simulated with code evaluation assertion (?{…}) followed by a (?&name).

References:

  • .NET Framework 4: Regular Expression Language Elements
  • perlre

.NET equivalent to Perl regular expressions

In Perl, you can think of the slashes as something like double-quotes with the added meaning of "between these slashes is a regex-string". The first block of code is a Perl find/replace regular expression:

$stringvar =~ s/findregex/replaceregex/;

It takes findregex and replaces it with replaceregex, in-place. The given example is a very simple search, and the .NET Regex class would be overkill. String.Replace() method will do the job:

letter = letter.Replace("Users ", "")
letter = letter.Replace("Mailboxes ", "")

The second part is Perl for find only. It returns true if the findregex string is found and leaves the actual string itself untouched.

$stringvar =~ /findregex/;

String.Contains() can handle this in .NET:

if (!(storegroup.Contains("Recovery") _
or storegroup.Contains("Users U V W X Y Z") _
or storegroup.Contains("you get the idea"))) Then
...

How could I translate regular expressions in Javascript syntax to .NET syntax

Although it is commercial (i.e. non-free, but cheap) I could not recommend "RegexBuddy" http://www.regexbuddy.com/ highly enough.

Using a standard "standard" RegEx syntax (which you can interactively build and test) it will then generate the source code in correct syntax for use in several environments and many "scenarios" including .net, javascript, Perl, PHP, Python etc.

With my lacklustre knowledge of Regex, this program is a lifesaver.

* disclaimer: No affiliation whatsoever - just a very happy multi-year customer

** Extra note -- I just notice that Jeff Attwood has a testimonial on their homepage!

  • Just for fun: Here is the RFC2822 email verification source generated by RegExBuddy for both .net (C#) and JavaScript

JavaScript:

if (/(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])/im.test(subject)) {
// Successful match
} else {
// Match attempt failed
}

.net C#

try {
if (Regex.IsMatch(subjectString, @"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|""(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*"")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])", RegexOptions.IgnoreCase | RegexOptions.Multiline)) {
// Successful match
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression

Regular expression to match any vertical whitespace

As you say, the Perl character class \v matches [\x0A-\x0D] (linefeed, vertical tab, form feed and carriage-return (although I would dispute that CR is vertical white space)) in addition to the non-ASCII code points [\x{2028}\x{2029}] (line separator and paragraph separator).

You can hand-build this character class in .NET like this

[\u0A-\u0D\u2028\u2029]

Regular expressions - C# behaves differently than Perl / Python

In your example the difference seems to be in the semantics of the 'replace' function rather than in the regular expression processing itself.

.net is doing a "global" replace, i.e. it is replacing all matches rather than just the first match.

Global Replace in Perl

(notice the small 'g' at the end of the =~s line)

$a="This is a test";
$a=~s/(.*)/George/g;
print $a;

which produces

GeorgeGeorge

Single Replace in .NET

var re = new Regex("(.*)");
var replacePattern = "George";
var newValue = re.Replace("This is nice", replacePattern, 1) ;
Console.WriteLine(newValue);

which produces

George

since it stops after the first replacement.



Related Topics



Leave a reply



Submit