preg_match and UTF-8 in PHP
Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391
'u' switch only makes sense for pcre, PHP itself is unaware of it.
From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").
A quick way to preg_match a utf-8 string
Try adding a UTF8 sequence to the beginning of the pattern:
$input = "žąsis su šešiolika žąsyčių";
preg_match_all("/(*UTF8)(žąs\S*)/iu", $input, $output_array);
print_r($output_array);
EDIT:
I tested this on PHP 5.2.17 and 5.3.20... I don't seem to have any problems while using 5.3.20 but I do get the same empty output while using 5.2.17. While I couldn't find any documentation that addressed why this happens, the problem seems to go away when removing the first \b
(word boundary). Here's a screenshot with the output, PHP version, loaded extensions, and source code (if this doesn't help, make sure you're saving your documents in UTF8 instead of whatever Windows likes to save them as):
preg_match rule for utf-8
This will give "0", cos مسیح ارسطوئی
is not containing only 3-10 chars;
$x = preg_match('~^([\pL]{3,10})$~u', 'مسیح ارسطوئی');
echo $x ? 1 : 0;
But this gives a result in your case;
preg_match('~([\pL]+)~u', 'مسیح ارسطوئی', $m);
print_r($m);
Array
(
[0] => مسیح
[1] => مسیح
)
See more details here: PHP: Unicode character properties
preg_match doesn't work in .php file with UTF-8 encoding
You need to tell it the search subject is UTF8.
function checkEmail($email) {
if (!preg_match("/^( [a-zA-Z0-9] )+( [a-zA-Z0-9\._-] )*@( [a-zA-Z0-9_-] )+( [a-zA-Z0-9\._-] +)+$/u" , $email)) {
return false;
}
return true;
}
In that regex I added the u
at the end which makes the pattern and the string utf8.
Pattern and subject strings are treated as UTF-8.
-Full documentation (and other modifiers)
UTF-8 characters in preg_match_all (PHP)
PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.
I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.
About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions
Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux
Try it by describing the characters by its Unicode character properties:
preg_match('/^\p{L}[\p{L} _.-]+$/u', $username)
UTF Regex with preg_match in PHP
The problem occurs because your csv file starts with a UTF-8 BOM. If you remove this, the regex works perfectly. I have confirmed it with this code:
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<?php
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
$csvContents = remove_utf8_bom(file_get_contents('udfser_new.csv'));
$lines = str_getcsv($csvContents, "\n"); //parse the rows
foreach ($lines as &$row) {
$row = str_getcsv($row, ";");
$firstName = $row[0];
$lastName = $row[1];
echo 'First name: ' . $firstName . ' - Matches regex: ' . (preg_match("/^[\p{L}]+$/u", $firstName) ? 'yes' : 'no') . '<br>';
echo 'Last name: ' . $lastName . ' - Matches regex: ' . (preg_match("/^[\p{L}]+$/u", $lastName) ? 'yes' : 'no') . '<br>';
}
?>
</body>
</html>
The regex match the text successfully, and the ü in Glückmann is shown correctly on the page.
preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters
I think after a long search I found an answer myself.
The modifier u works only if the entire string is a valid UTF-8 string.
Even if only the first character is to be found, the entire string is checked first.
The modifier u can not be used for this problem. However, regular expressions can be used.
function utf8Char($string){
$ok = preg_match(
'/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
|^[\xC0-\xDF][\x80-\xBF]
|^[\x00-\x7f]/sx',
$string,
$match);
return $ok ? $match[0] : false;
}
var_dump(utf8char("€a\xc3def")); //string(3) "€"
var_dump(utf8char("a\xc3def")); //string(1) "a"
var_dump(utf8char("\xc3def")); //bool(false)
The non-UTF8-bytes can be retrieved using the substr function.
var_dump(substr("\xc3def",0,1)); //string(1) "�"
UTF-8 validation in PHP without using preg_match()
You can always using the Multibyte String Functions:
If you want to use it a lot and possibly change it at sometime:
1) First set the encoding you want to use in your config file
/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");
2) Check the String
if(mb_check_encoding($string))
{
// do something
}
Or, if you don't plan on changing it, you can always just put the encoding straight into the function:
if(mb_check_encoding($string, 'UTF-8'))
{
// do something
}
preg_match_all return proper offset with utf-8 in PHP
Solved it myself, used a roundabout method, but it works, the key is this regex:
/[一-龠]|[ぁ-ゔ]|[ァ-ヴー]|[a-zA-Z0-9]|[a-zA-Z0-9][々〆〤]/u
I used that to preg_replace any character with a single digit number and then found offsets in the new string.
Related Topics
How to Echo or Print an Array in PHP
Get Table Column Names in MySQL
MySQLi_Real_Escape_String() Expects Exactly 2 Parameters, 1 Given
PHP: Is MySQL_Real_Escape_String Sufficient For Cleaning User Input
"The Page Has Expired Due to Inactivity" - Laravel 5.5
How to Do This in Laravel, Subquery Where In
How Safe Are PHP Session Variables
How to Check Whether Mod_Rewrite Is Enable on Server
Example of How to Use Bind_Result VS Get_Result
What Kind of String Is This? How to Unserialize This String
Convert Flat Array to a Delimited String to Be Saved in the Database
PHP.Ini & Smtp= - How to Pass Username & Password