How to Detect Invalid Utf8 Unicode/Binary in a Text File

Example invalid utf8 string?

Take a look at Markus Kuhn's UTF-8 decoder capability and stress test file

You'll find examples of many UTF-8 irregularities, including lonely start bytes, continuation bytes missing, overlong sequences, etc.

Regex to detect invalid UTF-8 string

You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [\xC0-\xC1] # Invalid UTF-8 Bytes
    | [\xF5-\xFF] # Invalid UTF-8 Bytes
    | \xE0[\x80-\x9F] # Overlong encoding of prior code point
    | \xF0[\x80-\x8F] # Overlong encoding of prior code point
    | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
    | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
    | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
    | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
    | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
    | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
    | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

etc...

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

How can I detect and report Unicode code points that aren't legal for interchange using Perl?

First, find encoding errors, then find undesired code points.

The latter is easy since there are Unicode properties to identify them. (See below)

To report the errors precisely, you might want to write your own decoder to find the UTF-8 errors.

sub bytes_to_hex { join ' ', map { sprintf '%02X', $_ } unpack 'C*', $_[0] }

my @errors;
my @warns;
my $output = '';
for ($input) {
   while (!/\G \z /xgc) {
      my $pos = pos;

      if (/\G (
         (?: [\x00-\x7F]
         |   [\xC0-\xDF][\x80-\xBF]
         |   [\xE0-\xEF][\x80-\xBF]{2}
         |   [\xF0-\xF7][\x80-\xBF]{3}
         |   [\xF8-\xFB][\x80-\xBF]{4}
         |   [\xFC-\xFD][\x80-\xBF]{5}
         )
      ) /xgc) {
         my $bytes = $1;
         my @bytes = unpack 'C*', $bytes;
         my $hex_bytes = bytes_to_hex($bytes);

         if ($bytes =~ /^
            (?: [\xC0-\xC1]
            |   \xE0[\x80-\x9F]
            |   \xF0[\x80-\x8F]
            |   \xF8[\x80-\x87]
            |   \xFC[\x80-\x83]
            )
         /x) {
            push @warns, "Overlong encoding $hex_bytes at pos $pos";
         }

         if ($bytes =~ /^[\xF8-\xFD]/) {
            push @warns, "Defunct 5 or 6 byte sequence $hex_bytes at pos $pos";
         }

         my $code_point_ord = @bytes == 1
            ? $bytes[0]
            : $bytes[0] & ( 0x7F >> @bytes );
         $code_point_ord = ( $code_point_ord << 6 ) | ( $_ & 0x3F )
            for @bytes[ 1..$#bytes ];
         my $code_point_hex = sprintf('U+%05X', $code_point_ord);
         my $code_point = chr($code_point_ord);

         if ($code_point_ord >= 0x110000) {
            push @errors, "Non-Unicode $code_point_hex at pos $pos";
         } else {
            push @warns, "Surrogate $code_point_hex at pos $pos"
               if $code_point =~ /\p{Cs}/;
            push @warns, "Private use $code_point_hex at pos $pos"
               if $code_point =~ /\p{Co}/;
            push @warns, "Unassigned $code_point_hex at pos $pos"
               if $code_point =~ /\p{Cn}/;

            $output .= $code_point;
         }
      }

      elsif (/\G (
         (?: [\xC0-\xDF]
         |   [\xE0-\xEF][\x80-\xBF]{0,1}
         |   [\xF0-\xF7][\x80-\xBF]{0,2}
         |   [\xF8-\xFB][\x80-\xBF]{0,3}
         |   [\xFC-\xFD][\x80-\xBF]{0,4}
         )
      ) /xgc) {
         my $bytes = $1;
         my $hex_bytes = bytes_to_hex($bytes);
         push @errors, "Incomplete sequence $hex_bytes at pos $pos";
      }

      elsif (/\G ( [\x80-\xBF] ) /xgc) {
         my $byte = $1;
         my $hex_byte = bytes_to_hex($byte);
         push @errors, "Unexpected continuation byte $hex_byte at pos $pos";
      }

      elsif (/\G ( [\xFE-\xFF] ) /xgc) {
         my $byte = $1;
         my $hex_byte = bytes_to_hex($byte);
         push @errors, "Invalid byte $hex_byte at pos $pos";
      }
      else {
         die "Bug";
      }
   }
}

How to Detect Invalid Utf8 Unicode/Binary in a Text File

Example invalid utf8 string?

Regex to detect invalid UTF-8 string

How can I detect and report Unicode code points that aren't legal for interchange using Perl?

Related Topics

Leave a reply