Are the PHP Preg_Functions Multibyte Safe

Are the PHP preg_functions multibyte safe?

PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:

The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

PHP currently uses PCRE 7.9; your system might have an older version.

Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.

Are phps mb_ereg functions safe to use (due to ereg being deprecated)?

I wouldn't depend on them. The preg functions are faster, more efficient, much more powerful and naively support UTF8. I would recommend using the preg functions for all of your regex needs.

But to directly answer your question, it does not appear that mb_ereg is deprecated...

multi-byte function to replace preg_match_all?

Have you taken a look into mb_ereg?

Additionally, you can pass an UTF-8 encoded string into preg_match using the u modifier, which might be the kind of multi-byte support you need. The other option is to encode into UTF-8 and then encode the results back.

See as well an answer to a related question: Are the PHP preg_functions multibyte safe?

multi-byte function to replace preg_match_all?

Have you taken a look into mb_ereg?

See as well an answer to a related question: Are the PHP preg_functions multibyte safe?

Multibyte trim in PHP?

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000.

Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx.

This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx can only refer to single-byte characters. PHP's trim function will therefore never trim away "half a character" assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)

The \s on ASCII regular expressions will mostly match the same characters as trim.

The preg functions with the /u modifier only works on UTF-8 encoded regular expressions, and /\s/u match also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.

If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.

In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim. When using /\s/u be careful with the meaning of nbsp for your text.

Take care:

  $s1 = html_entity_decode(" Hello   "); // the NBSP
  $s2 = " exotic test ホ ";

  echo "\nCORRECT trim: [". trim($s1) ."], [".  trim($s2) ."]";
  echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";
  echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";

  echo "\n!INCORRECT trim: [". trim($s2,' ') ."]"; // DANGER! not UTF8 safe!
  echo "\nSAFE ONLY WITH preg: [". 
       preg_replace('/^[\s]+|[\s]+$/u', '', $s2) ."]";

preg_match multi-byte characters by length

Use the multibyte safe functions mb_regex_encoding() and mb_ereg_replace(). (I'm not convinced the first one is mandatory. Also try without and see if that is sufficient.)

preg_match and UTF-8 in PHP

Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

Are the PHP Preg_Functions Multibyte Safe