Utf-8 in PHP Regular Expressions

PHP Regex - To accept all UTF-8 characters, No trailing spaces, Excluding a symbol, with a range between 2-16 length

You can use this lookahead based regex to satisfy all the conditions:

/^(?=.{2,16}$)[^@\s]+(?:\h[^@\s]+)*$/gum

RegEx Demo

When do I need u-modifier in PHP regex?

There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.

The second expression may give you more spacers than you expect; for example:

echo preg_replace('/[^a-zA-Z0-9]/', "0", ");
// => 0000

The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).

This is more dangerous:

echo preg_replace('/^(.)/', "0", ");
// => 0???

Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.

Non-ASCII characters in UTF-8 mode regular expression

Because the documentation is broken. And it's not the only place where it is so, unfortunately.

PHP uses PCRE under the hood to implement its preg_* functions. PCRE's documentation is thus authoritative there. PHP's documentation is based on PCRE's, but it looks like you found yet another mistake.

Here's what you can read in PCRE's docs (emphasis mine):

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

[:alnum:]  becomes  \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}

If you dig further in PHP's docs, you'll find the following:

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

This is, unfortunately, a lie. The u modifier in PHP means PCRE_UTF8 | PCRE_UCP (UCP stands for Unicode Character Properties). The PCRE_UCP flag is the one that changes the meaning of \d, \w and the like, as you can see from the docs above. Your tests confirm that.


As a side note, don't infer properties of one regex flavor from another. It doesn't always work (heh, even this chart forgot about the PCRE_UCP option).

Regular expression for UTF-8 words

Add \p{L}\p{M} for the Posix groups Letters and combining diacritical Marks. Zero-width marks, accents, should not be forgotten because é can be written as one letter, but also as letter-e + combining accent-acute. And some alphabets have more than one accent to a letter.

As commented by @MeriaonosNikos do not forget the Unicode switch at the end of the regex /u.

Regexp and pspell_check with UTF-8 (Umlaute)

To match a chunk of Unicode letters, you can use

'/\p{L}+/u'

The \p{L} matches any Unicode letter, + matches one or more occurrenes of the preceding subpattern and the /u modifier treats the pattern and string as Unicode strings.

To only match whole words, use word boundaries:

'/\b\p{L}+\b/u'

If you have diacritics, also add \p{M}:

'/\b[\p{M}\p{L}]+\b/u'

PHP - regex to allow unicode charcaters

From http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.

That means that first you have to make sure the input string is proper UTF-8 text.

Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.

This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".

On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.

So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Store UTF-8 character in ANSI file [PHP][REGEX]

The full width colon hex code is FF1A.

In PHP regex, you can use \x{<HEX>} notation in regex.

Thus, use

\x{FF1A}

To match a single

Here is a short demo:

$re = '/\x{FF1A}\w+/u';
preg_match($re, ":here 123", $m);
print_r($m); // => [0] => :here

Regex to allow a limited number of UTF characters i.e. allows only 2, 25 utf8 characters

You can use anchors ^ and $, they will make sure matching starts at the beginning and ends at the end of the string:

preg_match('/^\p{L}{2,25}$/u', $post);

Here is a demo

Regular expressions for a range of unicode points PHP

You can use:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - is equivalent of [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex


Related Topics



Leave a reply



Submit