What Does the Regular Expression [\W-] Mean

In regex, what does [\w*] mean?

Quick answer: ^[\w*]$ will match a string consisting of a single character, where that character is alphanumeric (letters, numbers) an underscore (_) or an asterisk (*).

Details:

  • The "\w" means "any word character" which usually means alphanumeric (letters, numbers, regardless of case) plus underscore (_)
  • The "^" "anchors" to the beginning of a string, and the "$" "anchors" To the end of a string, which means that, in this case, the match must start at the beginning of a string and end at the end of the string.
  • The [] means a character class, which means "match any character contained in the character class".

It is also worth mentioning that normal quoting and escaping rules for strings make it very difficult to enter regular expressions (all the backslashes would need to be escaped with additional backslashes), so in Python there is a special notation which has its own special quoting rules that allow for all of the backslashes to be interpreted properly, and that is what the "r" at the beginning is for.

Note: Normally an asterisk (*) means "0 or more of the previous thing" but in the example above, it does not have that meaning, since the asterisk is inside of the character class, so it loses its "special-ness".

For more information on regular expressions in Python, the two official references are the re module, the Regular Expression HOWTO.

What does this pattern (? =\w)\W+(?=\w) mean in a Python regular expression?

Here's a breakdown of the elements:

  • \w means an alphanumeric character
  • \W+ is the opposite of \w; with the + it means one or more non-alphanumeric characters
  • ?<= is called a "lookbehind assertion"
  • ?= is a "lookahead assertion"

So this re.sub statement means "if there are one or more non-alphanumeric characters with an alphanumeric character before and after, replace the non-alphanumeric character(s) with a space".

And by the way, the third argument to re.sub must be a string (or bytes-like object); it can't be a list.

Regex: Does /w means [a-zA-Z] or [a-zA-Z0-9_] as most tutorials mention \w -Matches the word characters?

Yes, according to the Java summary of regular expression constructs found here: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html,

\d  A digit: [0-9]
\w A word character: [a-zA-Z_0-9]

So (\w|\d|_) is equivalent to ([a-zA-Z_0-9]|[0-9]|_), where the extra underscore as well as \d is redundant since it's included as part of \w.

(\w|\d|_) is equivalent to (\w)

How to interpret this regular expression /[\W_]/g

/ ... /g It's a global regex. So it'll operate on multiple matches in the string.

[ ... ] This creates a character set. Basically it'll match any single character within the listed set of characters.

\W_ This matches the inverse of "word characters" and underscores. Any non-word character.

Then you have a few one off replacements for comma and period. Honestly, if that's the complete code, /[\W_,.]/g, omitting the two other replaces, would work just as well.

What is the meaning of [\w\-] regular expression in PHP

Regex 101

\w explained

\w match any word character [a-zA-Z0-9_]

\w\- explained

\w\-
\w match any word character [a-zA-Z0-9_]
\- matches the character - literally

Matching Email Addresses Simple, not future proof

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b

Difference between \w and \b regular expression meta characters

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is
    a word character.
  • After the last character in the string, if the
    last character is a word character.
  • Between two characters in the
    string, where one is a word character and the other is not a word character.

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".

In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.

\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

\W is short for [^\w], the negated version of \w.



Related Topics



Leave a reply



Submit