Example invalid utf8 string?
Take a look at Markus Kuhn's UTF-8 decoder capability and stress test file
You'll find examples of many UTF-8 irregularities, including lonely start bytes, continuation bytes missing, overlong sequences, etc.
Regex to detect invalid UTF-8 string
You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.
$regex = '/(
[\xC0-\xC1] # Invalid UTF-8 Bytes
| [\xF5-\xFF] # Invalid UTF-8 Bytes
| \xE0[\x80-\x9F] # Overlong encoding of prior code point
| \xF0[\x80-\x8F] # Overlong encoding of prior code point
| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
| (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
| (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';
We can test it by creating a few variations of text:
// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)
etc...
In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points
How can I detect and report Unicode code points that aren't legal for interchange using Perl?
First, find encoding errors, then find undesired code points.
The latter is easy since there are Unicode properties to identify them. (See below)
To report the errors precisely, you might want to write your own decoder to find the UTF-8 errors.
sub bytes_to_hex { join ' ', map { sprintf '%02X', $_ } unpack 'C*', $_[0] }
my @errors;
my @warns;
my $output = '';
for ($input) {
while (!/\G \z /xgc) {
my $pos = pos;
if (/\G (
(?: [\x00-\x7F]
| [\xC0-\xDF][\x80-\xBF]
| [\xE0-\xEF][\x80-\xBF]{2}
| [\xF0-\xF7][\x80-\xBF]{3}
| [\xF8-\xFB][\x80-\xBF]{4}
| [\xFC-\xFD][\x80-\xBF]{5}
)
) /xgc) {
my $bytes = $1;
my @bytes = unpack 'C*', $bytes;
my $hex_bytes = bytes_to_hex($bytes);
if ($bytes =~ /^
(?: [\xC0-\xC1]
| \xE0[\x80-\x9F]
| \xF0[\x80-\x8F]
| \xF8[\x80-\x87]
| \xFC[\x80-\x83]
)
/x) {
push @warns, "Overlong encoding $hex_bytes at pos $pos";
}
if ($bytes =~ /^[\xF8-\xFD]/) {
push @warns, "Defunct 5 or 6 byte sequence $hex_bytes at pos $pos";
}
my $code_point_ord = @bytes == 1
? $bytes[0]
: $bytes[0] & ( 0x7F >> @bytes );
$code_point_ord = ( $code_point_ord << 6 ) | ( $_ & 0x3F )
for @bytes[ 1..$#bytes ];
my $code_point_hex = sprintf('U+%05X', $code_point_ord);
my $code_point = chr($code_point_ord);
if ($code_point_ord >= 0x110000) {
push @errors, "Non-Unicode $code_point_hex at pos $pos";
} else {
push @warns, "Surrogate $code_point_hex at pos $pos"
if $code_point =~ /\p{Cs}/;
push @warns, "Private use $code_point_hex at pos $pos"
if $code_point =~ /\p{Co}/;
push @warns, "Unassigned $code_point_hex at pos $pos"
if $code_point =~ /\p{Cn}/;
$output .= $code_point;
}
}
elsif (/\G (
(?: [\xC0-\xDF]
| [\xE0-\xEF][\x80-\xBF]{0,1}
| [\xF0-\xF7][\x80-\xBF]{0,2}
| [\xF8-\xFB][\x80-\xBF]{0,3}
| [\xFC-\xFD][\x80-\xBF]{0,4}
)
) /xgc) {
my $bytes = $1;
my $hex_bytes = bytes_to_hex($bytes);
push @errors, "Incomplete sequence $hex_bytes at pos $pos";
}
elsif (/\G ( [\x80-\xBF] ) /xgc) {
my $byte = $1;
my $hex_byte = bytes_to_hex($byte);
push @errors, "Unexpected continuation byte $hex_byte at pos $pos";
}
elsif (/\G ( [\xFE-\xFF] ) /xgc) {
my $byte = $1;
my $hex_byte = bytes_to_hex($byte);
push @errors, "Invalid byte $hex_byte at pos $pos";
}
else {
die "Bug";
}
}
}
Related Topics
Rename All Files in a Folder With a Prefix in a Single Command
How to Simulate Just One Enter in Command Line After Executing a Jar File
Curl: (6) Could Not Resolve Host: Google.Com; Name or Service Not Known
How to Get the Bssid of Currently Connected Network Through Bash
Error:13 - Permission Denied Android Studio
Sort a List With Lowercase First
Uninstall Node.Js Using Linux Command Line
Curl Command to Repeat Url Request
How to Remove a Close_Wait Socket Connection
Mount Smb/Cifs Share Within a Docker Container
Format and Then Convert Txt to CSV Using Shell Script and Awk
How to Turn Off Echo While Executing a Shell Script Linux
How to Display Number to Two Decimal Places in Bash Function
Convert String to Hexadecimal on Command Line
Linux: Find File Names With 4 or 5 Characters