.Net Regex: What Is the Word Character \W

.Net regex: what is the word character \w?

From the documentation:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

  • Ll (Letter, Lowercase)
  • Lu (Letter, Uppercase)
  • Lt (Letter, Titlecase)
  • Lo (Letter, Other)
  • Lm (Letter, Modifier)
  • Nd (Number, Decimal Digit)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

See also

  • Unicode Character Database
  • Unicode Characters in the 'Punctuation, Connector' Category

Regex \w matches ê

In .NET (as well as XMLSchema, Python 3 (not Python 2), ICU (Android, R stringr / stringi functions), \w is Unicode-aware by default.

It is not Unicode-aware by default in PCRE and Java, but you may turn it on using the right flag, /u in PCRE and (?U) / Pattern.UNICODE_CHARACTER_CLASS in Java.

See the Shorthand Character Classes reference:

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

The Unicode-aware \w meanings:

  • c# - [\p{L}\p{Nd}\p{Mn}\p{Pc}] (source)
  • python - [\p{L}\p{Mn}\p{Nd}_] (source) (Note: this is an approximate pattern that can only be used with PyPi regex since re does not support Unicode property classes, so it's really great \w is Unicode aware in Python 3)
  • android - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (source)
  • icu - [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d] (source)
  • xsd - [#x0000=#x10FFFF]-[\p{P}\p{Z}\p{C}] (source)

When \w is made Unicode-aware:

  • pcre - (With /u in PHP or (*UCP) / (*UTF)(*UCP)) - [^\p{L}\p{N}_] ("\w any character that matches \p{L} or \p{N}, plus underscore")
  • java - (With (?U) or Pattern.UNICODE_CHARACTER_CLASS) - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (same as Andoid, source)
  • perl - (make the file treat as Unicode, see Does \w match all alphanumeric characters defined in the Unicode standard?) - [\p{GC=Alphabetic}\p{GC=Mark}\p{GC=Connector_Punctuation}\p{GC=Decimal_Number}]

In JavaScript, there is no way to make \w Unicode-aware, so use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}].

Regex doesn't match all foreign characters

It is not enough to use \p{L} to match words, you also need to match diacritics. That can be done by adding \p{M} to your regex. Note that even the \w shorthand "word" character class in .NET regex by default also matches a set of diacritics, \p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need \p{M} to allow any diacritics.

Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.

It seems to me you use

@"^[\p{L}\p{M}0-9_-]+$"

It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.

See the regex demo.

Note that in case you want to allow any Unicode digit chars, you may even use

@"^[\w\p{M}-]+$"

See another demo

.Net Regular Expression matching the string C#

The \b does not match between the pound sign and a space because they both match non word characters but is does match between the pound sign and the d char.

Instead of a second word boundary \b, you could assert that what is on the right is not a non-whitspace \S character using a negative lookahead (?!:

\bC#(?!\S)

Regex demo

As pointed out in the comments by @elgonzo, to prevent breaking the match when a non word char follows C#, you could use a positive lookahead to assert what is on the right is either a non word char \W or assert the end of the string $

\bC#(?=\W|$)

Regex demo

Why \b does not match word using .net regex

The C# language and .NET Regular Expressions both have their own distinct set of backslash-escape sequences, but the C# compiler is intercepting the "\b" in your string and converting it into an ASCII backspace character so the RegEx class never sees it. You need to make your string verbatim (prefix with an at-symbol) or double-escape the 'b' so the backslash is passed to RegEx like so:

@"\bCOMPILATION UNIT";

Or

"\\bCOMPILATION UNIT"

I'll say the .NET RegEx documentation does not make this clear. It took me a while to figure this out at first too.

Fun-fact: The \r and \n characters (carriage-return and line-break respectively) and some others are recognized by both RegEx and the C# language, so the end-result is the same, even if the compiled string is different.

C# Regular expression to squeeze word where every character is separated by a space

I suggest to match chunks of single word chars separated with single whitespaces and then removing the spaces inside within a match evaluator.

The regex is

(?<!\S)\w(?:\s\w){2,}(?!\S)

See its demo at RegexStorm. The (?<!\S) and (?!\S) make sure these chunks are enclosed with whitespaces (or are at string start/end).

Details:

  • (?<!\S) - a negative lookbehind making sure there is a whitespace or start of string immediately before the current location
  • \w - a word char (letter/digit/underscore, to match a letter, use \p{L} instead)
  • (?:\s\w){2,} - 2 or more sequences of:

    • \s - a whitespace
    • \w - a word char
  • (?!\S) - a negative lookahead making sure there is a whitespace or start of string immediately after the current location

See the C# demo:

var res = Regex.Replace(s, @"(?<!\S)\w(?:\s\w){2,}(?!\S)", m => 
new string(m.Value
.Where(c => !Char.IsWhiteSpace(c))
.ToArray()));

.NET RegEx for letters and spaces

If you just need English, try this regex:

"^[A-Za-z ]+$"

The brackets specify a set of characters

A-Z: All capital letters

a-z: All lowercase letters

' ': Spaces

If you need unicode / internationalization, you can try this regex:

@"$[\\p{L}\\s]+$"

See https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#word-character-w

This regex will match all unicode letters and spaces, which may be more than you need, so if you just need English / basic Roman letters, the first regex will be simpler and faster to execute.

Note that for both regex I have included the ^ and $ operator which mean match at start and end. If you need to pull this out of a string and it doesn't need to be the entire string, you can remove those two operators.

RegEx get words with special character

You can use the following regex:

@"_[^\W_]+"

The [^\W_] negated character class will match any character other than a non-word character (so, it will match all \ws) except _.

See the regex demo

A more .NET-ish regex will be an expression with character class subtraction:

_[\w-[_]]+

See another demo

Here, with [\w-[_]], we match all \ws with the exception of _.

Use the first suggestion if you need a more portable solution, and the second one if you only plan to use the regex in a .NET environment.

Get the middle part of a filename using regex

Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.

If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.

Then match 1-10 times a word character except the _ followed by matching optional word characters until the .

^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$

.NET regex demo (Click on the table tab)


A bit broader match could be using \S instead of \w and match until the last dot:

^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$

See another regex demo | C# demo

string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = @"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}

Output

0123456789
test
anythingSh

Unicode characters in Regex

Just for reference you don't need to escape the above ',. in your character class [], and you can avoid having to escape the dash - by placing it at the beginning or end of your character class.

You can use \p{L} which matches any kind of letter from any language. See the example below:

string[] names = { "Brendán", "Jóhn", "Jason" };
Regex rgx = new Regex(@"^\p{L}+$");
foreach (string name in names)
Console.WriteLine("{0} {1} a valid name.", name, rgx.IsMatch(name) ? "is" : "is not");

// Brendán is a valid name.
// Jóhn is a valid name.
// Jason is a valid name.

Or simply just add the desired characters to your character class [] you want to include.

@"^[a-zA-Z0-9áéíóú@#%&',.\s-]+$"


Related Topics



Leave a reply



Submit