UTF-8 characters in preg_match_all (PHP)
PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.
I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.
About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions
preg_match and UTF-8 in PHP
Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391
'u' switch only makes sense for pcre, PHP itself is unaware of it.
From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").
preg_match_all return proper offset with utf-8 in PHP
Solved it myself, used a roundabout method, but it works, the key is this regex:
/[一-龠]|[ぁ-ゔ]|[ァ-ヴー]|[a-zA-Z0-9]|[a-zA-Z0-9][々〆〤]/u
I used that to preg_replace any character with a single digit number and then found offsets in the new string.
preg_match rule for utf-8
This will give "0", cos مسیح ارسطوئی
is not containing only 3-10 chars;
$x = preg_match('~^([\pL]{3,10})$~u', 'مسیح ارسطوئی');
echo $x ? 1 : 0;
But this gives a result in your case;
preg_match('~([\pL]+)~u', 'مسیح ارسطوئی', $m);
print_r($m);
Array
(
[0] => مسیح
[1] => مسیح
)
See more details here: PHP: Unicode character properties
A quick way to preg_match a utf-8 string
Try adding a UTF8 sequence to the beginning of the pattern:
$input = "žąsis su šešiolika žąsyčių";
preg_match_all("/(*UTF8)(žąs\S*)/iu", $input, $output_array);
print_r($output_array);
EDIT:
I tested this on PHP 5.2.17 and 5.3.20... I don't seem to have any problems while using 5.3.20 but I do get the same empty output while using 5.2.17. While I couldn't find any documentation that addressed why this happens, the problem seems to go away when removing the first \b
(word boundary). Here's a screenshot with the output, PHP version, loaded extensions, and source code (if this doesn't help, make sure you're saving your documents in UTF8 instead of whatever Windows likes to save them as):
Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux
Try it by describing the characters by its Unicode character properties:
preg_match('/^\p{L}[\p{L} _.-]+$/u', $username)
preg_match does not find a UTF-8 character at the beginning of a binary string which contain non-UTF8 characters
I think after a long search I found an answer myself.
The modifier u works only if the entire string is a valid UTF-8 string.
Even if only the first character is to be found, the entire string is checked first.
The modifier u can not be used for this problem. However, regular expressions can be used.
function utf8Char($string){
$ok = preg_match(
'/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
|^[\xC0-\xDF][\x80-\xBF]
|^[\x00-\x7f]/sx',
$string,
$match);
return $ok ? $match[0] : false;
}
var_dump(utf8char("€a\xc3def")); //string(3) "€"
var_dump(utf8char("a\xc3def")); //string(1) "a"
var_dump(utf8char("\xc3def")); //bool(false)
The non-UTF8-bytes can be retrieved using the substr function.
var_dump(substr("\xc3def",0,1)); //string(1) "�"
PHP Regex - To accept all UTF-8 characters, No trailing spaces, Excluding a symbol, with a range between 2-16 length
You can use this lookahead based regex to satisfy all the conditions:
/^(?=.{2,16}$)[^@\s]+(?:\h[^@\s]+)*$/gum
RegEx Demo
preg_match_all() & UTF8 characters issue - the easiest way to go around
Try...
preg_match_all('#[a-z\x{0105}\x{015B}\x{0107}\x{0142}\x{00F3}\x{017C}\x{017A}\x{0144}]{3,}#uis', $text, $matches);
understanding preg_match_all use and regex '/./u' pattern
This is just exploding the string into one-character array.
You get 2 arrays of characters, and then combine them into a key=>value
pairs array.
Which in turn is used for strtr
character replacement -> strange UTF8 characters are replaced with ASCII ones.
Why we explode it with preg_match_all()
? Why using regular expressions at all?
I guess, because of the /u
key, which makes it work with UTF8 characters. If using normal PHP string functions like str_split()
, it would explode them in bytes
not characters
, and it will be a mess, because of multi-byte structure of UTF8. Like, the letter Å
takes 2 bytes in UTF8 string.
Basically, what you get is:
$mapping = ['Ę' => 'E', 'Ó' => 'Q', 'Ą' => 'A', ... 'ń' => 'n'];
You could also use multi-byte strings library functions, like this:
str_replace(mb_str_split($from), mb_str_split($to), $str);
Related Topics
How to Detect If a User Uploaded a File Larger Than Post_Max_Size
Does PHP Feature Short Hand Syntax for Objects
Can Not Increase File Upload Size Wamp
How to Store Variable Values Over Multiple Page Loads
Share Session Between Two Websites
Upload Images Through PHP Using Unique File Names
File_Get_Contents Is Not Working for Some Url
Converting Dates with PHP for Datetime in SQL
Php's Simplexml: How to Use Colons in Names
PHP Flush Stopped Flushing in Iis7.5
PHP Replace All Spaces with Hyphens
Regular Expression to Remove CSS Comments
How to Start Local Server with Symfony 5 or Downgrade Version to 4.4
Difference(When Being Applied to My Code) Between Int(10) and Int(12)