PHP Regex Word Boundary Matching in Utf-8

php regex word boundary matching in utf-8

Even in UTF-8 mode, standard class shorthands like \w and \b are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:

/(?<!\pL)weiß(?!\pL)/u

Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.

regex word boundary does not work in utf8 on some servers

Comment here seems to suggest that PCRE needs to be compiled with --enable-unicode-properties.

Regexp and pspell_check with UTF-8 (Umlaute)

To match a chunk of Unicode letters, you can use

'/\p{L}+/u'

The \p{L} matches any Unicode letter, + matches one or more occurrenes of the preceding subpattern and the /u modifier treats the pattern and string as Unicode strings.

To only match whole words, use word boundaries:

'/\b\p{L}+\b/u'

If you have diacritics, also add \p{M}:

'/\b[\p{M}\p{L}]+\b/u'

Regular expression - PCRE (PHP) - word boundary (\b) and accent characters

It will work, when you add the u modifier to your regex

/\b(cum)\b/iu

PHP word boundary /b regex not working with French

It seems like its a nightmare to fix the ?? display in the French locale, but I was able to fix this problem another way by modifying the regex pattern instead. By adding 'u' as a modifier in the patter it was able to detect the French character ç in ça and all works properly with no need to change the locale.

From this:

$pattern=(\b".$value."\b)

to this:

$pattern=(\b".$value."\b/u)

Regexp word boundaries in non-ASCII situations

You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?

Regex to match with comma and equal characters in utf8

My guess is that this expression,

(?!.*,$)([\p{L}\p{N}]+=[\p{L}\p{N}]+),?

might work, which here

(?!.*,$)

we would just add a not ending with comma statement.

The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Test

$re = '/(?!.*,$)([\p{L}\p{N}]+=[\p{L}\p{N}]+),?/m';
$str = 'a1=q1,a2=q2,a3=q3,a4=q4
a1=q1,a2=q2,a3=q3,a4=q4,';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

match word that contains XYZ and does not start with colon

You can have a regex like below:

/\b((?<!:)\w*XYZ\w*)\b/ui

\b before and after is to just match a word boundary.
In ((?<!:)\w*XYZ\w*), we check for any word that has XYZ in it and has zero or more characters before it and zero or more characters after it. With the help of negative lookbehind (?<!:), we make sure that it is not preceded by a :.
As mentioned by @unclexo in the comments, you can add the u modifier at the end to support UTF-8 sequence matching. See here for more info.
You can also add the i flag for case insensitive matching.

Snippet:

<?php

$tests = [
        'This isXYZ a :exampleXYZa',
        'isXYZ a :exampleXYZa abcXYZ',
        'isXYZ a :exampleXYZXYZa  abcXYZ',
        'XYZ',
        'XYZjdhf',
        'This isXYZ a example:XYZa',
        'äöüéèXYZ :äöüéèXYZäöüéè'
    ];

foreach($tests as $test){
    if(preg_match_all('/\b((?<!:)\w*XYZ\w*)\b/ui',$test,$matches)){
        print_r($matches[0]);
    }
}

Demo: https://3v4l.org/Y8SMj

PHP Regex Word Boundary Matching in Utf-8