Preg_Match and Utf-8 in PHP

preg_match and UTF-8 in PHP

Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

A quick way to preg_match a utf-8 string

Try adding a UTF8 sequence to the beginning of the pattern:

$input = "žąsis su šešiolika žąsyčių";
preg_match_all("/(*UTF8)(žąs\S*)/iu", $input, $output_array);
print_r($output_array);

EDIT:

I tested this on PHP 5.2.17 and 5.3.20... I don't seem to have any problems while using 5.3.20 but I do get the same empty output while using 5.2.17. While I couldn't find any documentation that addressed why this happens, the problem seems to go away when removing the first \b (word boundary). Here's a screenshot with the output, PHP version, loaded extensions, and source code (if this doesn't help, make sure you're saving your documents in UTF8 instead of whatever Windows likes to save them as):

Sample Image

preg_match rule for utf-8

This will give "0", cos مسیح ارسطوئی is not containing only 3-10 chars;

$x = preg_match('~^([\pL]{3,10})$~u', 'مسیح ارسطوئی');
echo $x ? 1 : 0;

But this gives a result in your case;

preg_match('~([\pL]+)~u', 'مسیح ارسطوئی', $m);
print_r($m);

Array
(
    [0] => مسیح
    [1] => مسیح
)

See more details here: PHP: Unicode character properties

preg_match doesn't work in .php file with UTF-8 encoding

You need to tell it the search subject is UTF8.

function  checkEmail($email) {
    if (!preg_match("/^( [a-zA-Z0-9] )+( [a-zA-Z0-9\._-] )*@( [a-zA-Z0-9_-] )+( [a-zA-Z0-9\._-] +)+$/u" , $email)) {
          return false;
    }
          return true;
}

In that regex I added the u at the end which makes the pattern and the string utf8.

Pattern and subject strings are treated as UTF-8.

-Full documentation (and other modifiers)

UTF-8 characters in preg_match_all (PHP)

PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.

I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.

About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions

Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux

Try it by describing the characters by its Unicode character properties:

preg_match('/^\p{L}[\p{L} _.-]+$/u', $username)

UTF Regex with preg_match in PHP

The problem occurs because your csv file starts with a UTF-8 BOM. If you remove this, the regex works perfectly. I have confirmed it with this code:

<html>
<head>
<meta charset="utf-8" /> 
</head>
<body>
<?php
function remove_utf8_bom($text)
{
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}

$csvContents = remove_utf8_bom(file_get_contents('udfser_new.csv'));
$lines = str_getcsv($csvContents, "\n"); //parse the rows

foreach ($lines as &$row) {
    $row = str_getcsv($row, ";");

    $firstName = $row[0];
    $lastName = $row[1];
    echo 'First name: ' . $firstName . ' - Matches regex: ' . (preg_match("/^[\p{L}]+$/u", $firstName) ? 'yes' : 'no') . '<br>';
    echo 'Last name: ' . $lastName . ' - Matches regex: ' . (preg_match("/^[\p{L}]+$/u", $lastName) ? 'yes' : 'no') . '<br>';
}
?>
</body>
</html>

The regex match the text successfully, and the ü in Glückmann is shown correctly on the page.

Result

preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters

I think after a long search I found an answer myself.

The modifier u works only if the entire string is a valid UTF-8 string.
Even if only the first character is to be found, the entire string is checked first.
The modifier u can not be used for this problem. However, regular expressions can be used.

function utf8Char($string){
    $ok = preg_match(
      '/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
      |^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
      |^[\xC0-\xDF][\x80-\xBF]
      |^[\x00-\x7f]/sx',
      $string,
      $match);
    return $ok ? $match[0] : false;      
}


var_dump(utf8char("€a\xc3def"));  //string(3) "€"
var_dump(utf8char("a\xc3def"));  //string(1) "a"
var_dump(utf8char("\xc3def"));  //bool(false)

The non-UTF8-bytes can be retrieved using the substr function.

var_dump(substr("\xc3def",0,1)); //string(1) "�"

UTF-8 validation in PHP without using preg_match()

You can always using the Multibyte String Functions:

If you want to use it a lot and possibly change it at sometime:

1) First set the encoding you want to use in your config file

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

2) Check the String

if(mb_check_encoding($string))
{
    // do something
}

Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

if(mb_check_encoding($string, 'UTF-8'))
{
    // do something
}

preg_match_all return proper offset with utf-8 in PHP

Solved it myself, used a roundabout method, but it works, the key is this regex:

/[一-龠]|[ぁ-ゔ]|[ァ-ヴー]|[a-zA-Z0-9]|[ａ-ｚＡ-Ｚ０-９][々〆〤]/u

I used that to preg_replace any character with a single digit number and then found offsets in the new string.

Preg_Match and Utf-8 in PHP