php regex word boundary matching in utf-8
Even in UTF-8 mode, standard class shorthands like \w
and \b
are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:
/(?<!\pL)weiß(?!\pL)/u
Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.
regex word boundary does not work in utf8 on some servers
Comment here seems to suggest that PCRE needs to be compiled with --enable-unicode-properties
.
Regexp and pspell_check with UTF-8 (Umlaute)
To match a chunk of Unicode letters, you can use
'/\p{L}+/u'
The \p{L}
matches any Unicode letter, +
matches one or more occurrenes of the preceding subpattern and the /u
modifier treats the pattern and string as Unicode strings.
To only match whole words, use word boundaries:
'/\b\p{L}+\b/u'
If you have diacritics, also add \p{M}
:
'/\b[\p{M}\p{L}]+\b/u'
Regular expression - PCRE (PHP) - word boundary (\b) and accent characters
It will work, when you add the u
modifier to your regex
/\b(cum)\b/iu
PHP word boundary /b regex not working with French
It seems like its a nightmare to fix the ?? display in the French locale, but I was able to fix this problem another way by modifying the regex pattern instead. By adding 'u' as a modifier in the patter it was able to detect the French character ç in ça and all works properly with no need to change the locale.
From this:
$pattern=(\b".$value."\b)
to this:
$pattern=(\b".$value."\b/u)
Regexp word boundaries in non-ASCII situations
You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?
Regex to match with comma and equal characters in utf8
My guess is that this expression,
(?!.*,$)([\p{L}\p{N}]+=[\p{L}\p{N}]+),?
might work, which here
(?!.*,$)
we would just add a not ending with comma statement.
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Test
$re = '/(?!.*,$)([\p{L}\p{N}]+=[\p{L}\p{N}]+),?/m';
$str = 'a1=q1,a2=q2,a3=q3,a4=q4
a1=q1,a2=q2,a3=q3,a4=q4,';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
match word that contains XYZ and does not start with colon
You can have a regex like below:
/\b((?<!:)\w*XYZ\w*)\b/ui
\b
before and after is to just match a word boundary.In
((?<!:)\w*XYZ\w*)
, we check for any word that hasXYZ
in it and has zero or more characters before it and zero or more characters after it. With the help of negative lookbehind(?<!:)
, we make sure that it is not preceded by a:
.As mentioned by @unclexo in the comments, you can add the
u
modifier at the end to support UTF-8 sequence matching. See here for more info.You can also add the
i
flag for case insensitive matching.
Snippet:
<?php
$tests = [
'This isXYZ a :exampleXYZa',
'isXYZ a :exampleXYZa abcXYZ',
'isXYZ a :exampleXYZXYZa abcXYZ',
'XYZ',
'XYZjdhf',
'This isXYZ a example:XYZa',
'äöüéèXYZ :äöüéèXYZäöüéè'
];
foreach($tests as $test){
if(preg_match_all('/\b((?<!:)\w*XYZ\w*)\b/ui',$test,$matches)){
print_r($matches[0]);
}
}
Demo: https://3v4l.org/Y8SMj
Related Topics
PHP Foreach by Reference Causes Weird Glitch When Going Through Array of Objects
PHP Simplexml Attributes Are Missing
Unknown Modifier '/' in ...? What Is It
Laravel Named Route for Resource Controller
Dynamic Paypal Button Generation - Isn't It Very Insecure
Calculate the Number of Months Between Two Dates in PHP
Laravel $Q->Where() Between Dates
MySQL Get a Random Value Between Two Values
Is PHP's 'Include' a Function or a Statement
Using PHP as a Template Engine
Display Custom Order Meta Data Value in Email Notifications Woocommerce
Is "Set Character Set Utf8" Necessary
Design Patterns: How to Create Database Object/Connection Only When Needed
Issue in Installing PHP7.2-Mcrypt
Wampserver PHPmyadmin Maximum Execution Time of 360 Seconds Exceeded
What Is the Syntax for Sorting an Eloquent Collection by Multiple Columns