Matching Unicode Letter Characters in Pcre/PHP

Matching Unicode letter characters in PCRE/PHP

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

PHP - regex to allow unicode charcaters

From http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.

That means that first you have to make sure the input string is proper UTF-8 text.

Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.

This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".

On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.

So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Regular expressions for a range of unicode points PHP

You can use:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - is equivalent of [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex

PHP: How to match a range of unicode paired surrogates emoticons/emoji?

revo's comment above was very helpful to find a solution:

If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax \u{XXXX} e.g. preg_replace("~\u{1F600}~", '', $str); (Mind the double quotes)

Since I am using PHP 7, echo "\u{1F602}"; outputs according to this PHP RFC page on unicode escape. This proposal was in essence:

A new escape sequence is added for double-quoted strings and heredocs.

  • \u{ codepoint-digits } where codepoint-digits is composed of hexadecimal digits.

This implies that the matching string in preg_replace (normally single-quoted for not messing up with double-quoted strings variable expansion), now needs some preg_quote magic. This is the solution I came up with:

preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);

Here's the proof of the above in 3v4l.

EDIT: a simpler solution

In another comment made by revo, it seems that by placing unicode characters directly into the regex character class, single-quoted strings and previous PHP versions (e.g. 4.3.4) are supported:

preg_replace('/[☀-⛿-]/u','YOINK',$str);

For using PHP 7's new feature though, you still need double-quotes:

preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);

Here's revo's proof in 3v4l.

Matching Unicode letter characters in PCRE/PHP

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

Matching Unicode letter characters in PCRE/PHP

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

Regex to match letters, numbers and space, including non-ascii characters

You can use unicode letter and unicode number properties for this:

preg_match('/^([-_ \p{L}\p{N}])+$/iu', $string)

Update: You may not need a capturing group here:

preg_match('/^[-_ \p{L}\p{N}]+$/iu', $string)

Matching Unicode letter characters in PCRE/PHP

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

Weird behaviour with multibyte strings and php regex

You have to add the UTF-8 flag for tests like these, i.e '/[£]/u'.

From the PHP docs:

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.



Related Topics



Leave a reply



Submit