Utf-8 Characters in Preg_Match_All (Php)

UTF-8 characters in preg_match_all (PHP)

PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.

I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.

About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions

preg_match and UTF-8 in PHP

Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

preg_match_all return proper offset with utf-8 in PHP

Solved it myself, used a roundabout method, but it works, the key is this regex:

/[一-龠]|[ぁ-ゔ]|[ァ-ヴー]|[a-zA-Z0-9]|[a-zA-Z0-9][々〆〤]/u

I used that to preg_replace any character with a single digit number and then found offsets in the new string.

preg_match rule for utf-8

This will give "0", cos مسیح ارسطوئی is not containing only 3-10 chars;

$x = preg_match('~^([\pL]{3,10})$~u', 'مسیح ارسطوئی');
echo $x ? 1 : 0;

But this gives a result in your case;

preg_match('~([\pL]+)~u', 'مسیح ارسطوئی', $m);
print_r($m);

Array
(
[0] => مسیح
[1] => مسیح
)

See more details here: PHP: Unicode character properties

A quick way to preg_match a utf-8 string

Try adding a UTF8 sequence to the beginning of the pattern:

$input = "žąsis su šešiolika žąsyčių";
preg_match_all("/(*UTF8)(žąs\S*)/iu", $input, $output_array);
print_r($output_array);

EDIT:

I tested this on PHP 5.2.17 and 5.3.20... I don't seem to have any problems while using 5.3.20 but I do get the same empty output while using 5.2.17. While I couldn't find any documentation that addressed why this happens, the problem seems to go away when removing the first \b (word boundary). Here's a screenshot with the output, PHP version, loaded extensions, and source code (if this doesn't help, make sure you're saving your documents in UTF8 instead of whatever Windows likes to save them as):

Sample Image

Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux

Try it by describing the characters by its Unicode character properties:

preg_match('/^\p{L}[\p{L} _.-]+$/u', $username)

preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters

I think after a long search I found an answer myself.

The modifier u works only if the entire string is a valid UTF-8 string.
Even if only the first character is to be found, the entire string is checked first.
The modifier u can not be used for this problem. However, regular expressions can be used.

function utf8Char($string){
$ok = preg_match(
'/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
|^[\xC0-\xDF][\x80-\xBF]
|^[\x00-\x7f]/sx',
$string,
$match);
return $ok ? $match[0] : false;
}

var_dump(utf8char("€a\xc3def")); //string(3) "€"
var_dump(utf8char("a\xc3def")); //string(1) "a"
var_dump(utf8char("\xc3def")); //bool(false)

The non-UTF8-bytes can be retrieved using the substr function.

var_dump(substr("\xc3def",0,1)); //string(1) "�"

PHP Regex - To accept all UTF-8 characters, No trailing spaces, Excluding a symbol, with a range between 2-16 length

You can use this lookahead based regex to satisfy all the conditions:

/^(?=.{2,16}$)[^@\s]+(?:\h[^@\s]+)*$/gum

RegEx Demo

preg_match_all() & UTF8 characters issue - the easiest way to go around

Try...

preg_match_all('#[a-z\x{0105}\x{015B}\x{0107}\x{0142}\x{00F3}\x{017C}\x{017A}\x{0144}]{3,}#uis', $text, $matches);

understanding preg_match_all use and regex '/./u' pattern

This is just exploding the string into one-character array.

You get 2 arrays of characters, and then combine them into a key=>value pairs array.

Which in turn is used for strtr character replacement -> strange UTF8 characters are replaced with ASCII ones.

Why we explode it with preg_match_all()? Why using regular expressions at all?

I guess, because of the /u key, which makes it work with UTF8 characters. If using normal PHP string functions like str_split(), it would explode them in bytes not characters, and it will be a mess, because of multi-byte structure of UTF8. Like, the letter Å takes 2 bytes in UTF8 string.

Basically, what you get is:

$mapping = ['Ę' => 'E',  'Ó' => 'Q', 'Ą' => 'A', ... 'ń' => 'n'];

You could also use multi-byte strings library functions, like this:

str_replace(mb_str_split($from), mb_str_split($to), $str);


Related Topics



Leave a reply



Submit