How exactly do Regular Expression word boundaries work in PHP?
The word boundary \b
matches on a change from a \w
(a word character) to a \W
a non word character. You want to match if there is a \b
before your @
which is a \W
character. So to match you need a word character before your @
something@nimal
^^
==> Match because of the word boundary between g
and @
.
something!@nimal
^^
==> NO match because between !
and @
there is no word boundary, both characters are \W
With word boundaries (\b) in RegEx do I need to have it before AND after the word, or just before?
You would need to use both. Without the last \b
you would get a match on strings such as:
"I love football"
"You foolishly left off your second word boundary"
Regex PHP word boundaries?
A word boundary is not a character
A word boundary is \b
. A word boundary is not a space, or any character at all. It is the transition between a word and a non-word, so it's actually a point between characters, not a character itself.
If you want to match 123 Main street
, you will have to match a sequence of numbers, followed by a space, followed by (I think) one or more words. So something like
/^\w{2,5}(\s[a-zA-Z]+\b)+/
So the second group matches a space (that comes after the street number or the previous word of the name), a sequence of alphabetical characters, and a word boundary. It will match '123 main street', and just 'main street'.
Greedy/ungreedy
By default a regular expression is greedy and will match as much characters as possible. So in this case you won't actually need the word boundary at all. It won't match str
if it can match street
. So the following regular expression will have the same effect as the one above, (unless you add some ungready modifier).
/^\w{2,5}(\s[a-zA-Z]+)+/
But for an ungreedy regular expression it is important. Compare
^\w{2,5}(\s[a-zA-Z]+?)+
and
^\w{2,5}(\s[a-zA-Z]+?\b)+
The first one will match 123 M
, while the second one will match 123 Main street
again.
Test your regexes
If you like to test this or other regular expressions, you can visit http://www.phpliveregex.com/ It allows you to test regular expressions to see how they work with a couple of preg_*
functions.
PHP regex with word boundary after escaped character
The recommended way to solve this is using lookarounds to asserts word characters instead of boundaries, e.g. (?<!\w)c\+\+(?!\w)
:
$string = 'I don\'t know C e C++ so well, but I can code in PHP.';
$languages = [
'PHP' => '/php/',
'C++' => '/cpp/',
'C' => '/c/',
];
foreach ($languages as $name => $uri) {
$regex = '/(?<!\w)' . preg_quote($name, '/') . '(?!\w)/';
if (preg_match($regex, $string)) {
echo "For {$name} information refer to http://foo.bar{$uri}" . PHP_EOL;
}
}
Output:
For PHP information refer to http://foo.bar/php/
For C++ information refer to http://foo.bar/cpp/
For C information refer to http://foo.bar/c/
PHP regex Word Boundaries doesnt match end of string
I though both regex should work but I got same problem as you in regex101. So, in order to fix this you can change your regex to:
$row = "TEDARİKÇİ,MÜŞTERİ";
var_dump( preg_match('#\bMÜŞTERİ(\b|$)#iu', $row));
Working demo
php regex word boundary matching in utf-8
Even in UTF-8 mode, standard class shorthands like \w
and \b
are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:
/(?<!\pL)weiß(?!\pL)/u
Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.
word boundaries preg_replace
Try
/\B%[a-z0-9]+\b/
You don't have a word boundary \b
between a space and the %
, but you have one between s
and %
.
\B
is the opposite of \b
not a word boundary.
See it here on regex101
Using word boundary instead of spaces in regex
Here are the word boundaries:
s h e h a t e s m y g u t s
^ ^ ^ ^ ^ ^ ^ ^
So your pattern matches like this:
s h e h a t e s m y g u t s
^\_______________________________/^
| | |
\b .+ \b
If you want to get rid of the first and last word, I'd just replace these with an empty string, using the following pattern:
^\W*?\w+\s*|\s*\w+\W*$
Both \W*
are there to account for possible punctuation (ie she hates my guts.
) but you can remove them if they're not needed.
Related Topics
Why Are $_Post Variables Getting Escaped in PHP
PHP Array Merge Two Arrays on Same Key
How to Successfully Rewrite Old MySQL-PHP Code With Deprecated MySQL_* Functions
Including PHP File from Another Server With PHP
PHP: How to Read a .Txt File from Ftp Server into a Variable
How to Get the Sqlsrv Extension to Work With PHP, Since Mssql Is Deprecated
PHP Curl, Extract an Xml Response
How to Check If a Word Is Contained in Another String Using PHP
Passing Multiple Variables to Another Page in Url
Does MySQL_Real_Escape_String() Fully Protect Against SQL Injection
PHP, Getting Variable from Another PHP-File
Get Absolute Path of Initially Run Script
What Is the Function _Construct Used For
Show Image Using File_Get_Contents
Decode Gzipped Web Page Retrieved Via Curl in PHP