How Exactly Do Regular Expression Word Boundaries Work in PHP

How exactly do Regular Expression word boundaries work in PHP?

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your @ which is a \W character. So to match you need a word character before your @

something@nimal
^^

==> Match because of the word boundary between g and @.

something!@nimal
^^

==> NO match because between ! and @ there is no word boundary, both characters are \W

With word boundaries (\b) in RegEx do I need to have it before AND after the word, or just before?

You would need to use both. Without the last \b you would get a match on strings such as:

"I love football"
"You foolishly left off your second word boundary"

Regex PHP word boundaries?

A word boundary is not a character

A word boundary is \b. A word boundary is not a space, or any character at all. It is the transition between a word and a non-word, so it's actually a point between characters, not a character itself.

If you want to match 123 Main street, you will have to match a sequence of numbers, followed by a space, followed by (I think) one or more words. So something like

/^\w{2,5}(\s[a-zA-Z]+\b)+/

So the second group matches a space (that comes after the street number or the previous word of the name), a sequence of alphabetical characters, and a word boundary. It will match '123 main street', and just 'main street'.

Greedy/ungreedy

By default a regular expression is greedy and will match as much characters as possible. So in this case you won't actually need the word boundary at all. It won't match str if it can match street. So the following regular expression will have the same effect as the one above, (unless you add some ungready modifier).

/^\w{2,5}(\s[a-zA-Z]+)+/

But for an ungreedy regular expression it is important. Compare

^\w{2,5}(\s[a-zA-Z]+?)+

and

^\w{2,5}(\s[a-zA-Z]+?\b)+

The first one will match 123 M, while the second one will match 123 Main street again.

Test your regexes

If you like to test this or other regular expressions, you can visit http://www.phpliveregex.com/ It allows you to test regular expressions to see how they work with a couple of preg_* functions.

PHP regex with word boundary after escaped character

The recommended way to solve this is using lookarounds to asserts word characters instead of boundaries, e.g. (?<!\w)c\+\+(?!\w):

$string = 'I don\'t know C e C++ so well, but I can code in PHP.';
$languages = [
'PHP' => '/php/',
'C++' => '/cpp/',
'C' => '/c/',
];

foreach ($languages as $name => $uri) {
$regex = '/(?<!\w)' . preg_quote($name, '/') . '(?!\w)/';
if (preg_match($regex, $string)) {
echo "For {$name} information refer to http://foo.bar{$uri}" . PHP_EOL;
}
}

Output:

For PHP information refer to http://foo.bar/php/
For C++ information refer to http://foo.bar/cpp/
For C information refer to http://foo.bar/c/

PHP regex Word Boundaries doesnt match end of string

I though both regex should work but I got same problem as you in regex101. So, in order to fix this you can change your regex to:

$row = "TEDARİKÇİ,MÜŞTERİ";
var_dump( preg_match('#\bMÜŞTERİ(\b|$)#iu', $row));

Working demo

php regex word boundary matching in utf-8

Even in UTF-8 mode, standard class shorthands like \w and \b are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:

/(?<!\pL)weiß(?!\pL)/u

Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.

word boundaries preg_replace

Try

/\B%[a-z0-9]+\b/

You don't have a word boundary \b between a space and the %, but you have one between s and %.

\B is the opposite of \b not a word boundary.

See it here on regex101

Using word boundary instead of spaces in regex

Here are the word boundaries:

 s h e   h a t e s   m y   g u t s 
^ ^ ^ ^ ^ ^ ^ ^

So your pattern matches like this:

 s h e   h a t e s   m y   g u t s 
^\_______________________________/^
| | |
\b .+ \b

If you want to get rid of the first and last word, I'd just replace these with an empty string, using the following pattern:

^\W*?\w+\s*|\s*\w+\W*$

Both \W* are there to account for possible punctuation (ie she hates my guts.) but you can remove them if they're not needed.



Related Topics



Leave a reply



Submit