Regex Word Boundary Expressions

What is a word boundary in regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).

So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

How do I use regular expressions to match a word with boundaries?

Try this regex: (?<=[\W_])work(?=[\W_])

This uses positive look-ahead and look-behind assertions to respect enclosing characters but without including them in the match.

This regex matches work

  1. if it follows a \W character or an underscore

    AND
  2. if it is followed by a \W character or an underscore.

\b for word boundary matching can't be used since _ matches \w which is not wanted here.


Further examples:

  • Matching multiple words:
    (?<=[\W_])(work|job)(?=[\W_])

  • Same as above but without creating submatches:
    (?<=[\W_])(?:work|job)(?=[\W_])

  • Also respecting line end:
    (?<=[\W_])(?:work|job)(?=[\W_]|$)


Some useful notes regarding regex syntax:

  • \w matches all alphanumeric characters and underscore; this is equivalent to [a-zA-Z0-9_]

  • \W matches the exact opposite of \w

  • \b matches boundaries between a \w and a \W character (or vise-versa)

  • Positive look-ahead assertion:
    foo(?=bar) matches foo followed by bar, without including bar in the match.

  • Positive look-behind assertion:
    (?<=foo)bar matches bar if it follows foo, without including foo in the match.

For further information on (python) regex syntax consider the python regex docs or the perl regex docs. Also, the web-based Python Regex Tool is handy for testing.

Regular Expression to match the word boundary but exclude the word if it has any prefix or suffix

You can use lookarounds to precise the boundaries:

\b(?<![./])findword\b(?![./])
^^^^^^^^^ ^^^^^^^^

The (?<![./]) lookbehind will fail the match if there is a . or / before the word, and the (?![./]) lookahead will fail the match if there is a . or / after the word.

Sample Image

Regular Expression Word Boundary and Special Characters

\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:

add +

...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.

Regex anchor \ versus \b for word boundary

"Does not work" is not correct; one works in some regex dialects, the other in others.

Most "modern" regex dialects (Python, Perl, Ruby, etc) use \b as the word boundary, on both sides.

More traditional regex dialects, like the original egrep, use \< as the left word boundary operator, and \> on the right.

(Strictly speaking, Al Aho's original egrep did not have word boundaries; this feature was added later. Maybe see https://stackoverflow.com/a/39367415/874188 for a one-minute summary of regex history.)

JavaScript regular expression for word boundaries, tolerating in-word hyphens and apostrophes

You can organize your word-boundary characters into two groups.

  1. Characters that cannot be alone.
  2. Characters that can be alone.

A regex that works with your example would be:

[\s.,'-]{2,}|[\s.]

Regex101 Demo

Now all that's left is to keep adding all non-word characters into those two groups until it fits all of your needs. So you might start adding symbols and more punctuation to those character classes.

Why is this word boundary regex not matching

. is not a word character. \b is checking word boundaries, i.e. boundaries between word and characters not considered to be part of words. Therefore you cannot expect . to be inside the "word" 1. because these two characters do not form a word.


Quick reference document describes \b as:

The match must occur on a boundary between a \w (alphanumeric) and a \W (nonalphanumeric) character.

And \w is described as:

Matches any word character.

If you check what a Word character is, you will find it includes Unicode classes Ll [Letter, Lowercase];
Lu [Letter, Uppercase];
Lt [Letter, Titlecase];
Lo [Letter, Other];
Lm [Letter, Modifier];
Mn [Mark, Nonspacing];
Nd [Number, Decimal Digit] and
Pc [Punctuation, Connector].

But . has Unicode class Po [Punctuation, Other] which is not listed above.

So if you expect \b to match a word boundary in 1., it is right between 1 and .. This answers your question Why.

Note: .NET regex expressions should be preferably tested on testing sites dedicated to them like for example Regex Storm. If you test your regex using PCRE regex flavour (like on the site you linked), you can get different results from .NET.

reg-expression : word boundary with \t \n \r

Python is going to interpolate an escaped control characters if is not a raw string.

So this is the parse result when not a raw string ( this one is single quoted ):

>>> print ('HelloWorld is a beautiful word\nHelloWorld\t\t\tHelloWorld HelloWorld \t HelloWorld nopHelloWorld  HelloWorldnop \tnopHelloWorld ...')
HelloWorld is a beautiful word
HelloWorld HelloWorld HelloWorld HelloWorld nopHelloWorld HelloWorldnop nopHelloWorld ...

Which matches 5 hello world as is expected https://regex101.com/r/8TwxCO/1

But, if the original string is a raw string, then it will only match 3 https://regex101.com/r/nUdSZQ/1

>>> print (r'HelloWorld is a beautiful word\nHelloWorld\t\t\tHelloWorld HelloWorld \t HelloWorld nopHelloWorld  HelloWorldnop \tnopHelloWorld ...')
HelloWorld is a beautiful word\nHelloWorld\t\t\tHelloWorld HelloWorld \t HelloWorld nopHelloWorld HelloWorldnop \tnopHelloWorld ...

Combining parens and word boundaries in a regular expression

You may use:

import re

text = 'Hello my name is Tom and I love Tomcat. My email address is tom@foo.bar and my phone number is (201) 5550123.'
values = ['Tom', 'tom@foo.bar', '(201) 5550123']
escaped_values = [re.escape(value) for value in values]
combined_pattern = r'(?<!\w)(?:' +'|'.join(escaped_values) + r')(?!\w)'
combined_regex = re.compile(combined_pattern)

print (combined_pattern)
print()
print (combined_regex.sub('', text))

Output:

(?<!\w)(?:Tom|tom@foo\.bar|\(201\)\ 5550123)(?!\w)

'Hello my name is and I love Tomcat. My email address is and my phone number is .'

Take note of the combined regex in use here:

(?<!\w)(?:Tom|tom@foo\.bar|\(201\)\ 5550123)(?!\w)

RegEx Demo

RegEx Explained:

  • (?<!\w): Negative lookbehind to assert that we don't have a word character before the current position
  • (?:: Start non-capture group
    • Tom|tom@foo\.bar|\(201\)\ 5550123: Match one of these substrings separated with | (alternation)
  • ): End non-capture group
  • (?!\w): Negative lookahead to assert that we don't have a word character after the current position

Match star * character at end of word boundary \b

The * is not a word character thus no mach, if followed by a \b and a non word character.

Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

\b(...)(?![\w*])

See this demo at regex101

If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])



Related Topics



Leave a reply



Submit