Delete Every Non Utf-8 Symbols from String

Delete every non utf-8 symbols from string

Try below code line instead of last two lines. Hope it helps:

line=line.decode('utf-8','ignore').encode("utf-8")

In Python 3, how do you remove all non-UTF8 characters from a string?

You're starting with a string. You can't decode a str (it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:

x.decode('utf-8','ignore').encode("utf-8")

to:

x.encode('utf-8','ignore').decode("utf-8")

where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.

Remove non-utf8 characters from string

Using a regex approach:

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

Remove non-UTF8 characters from file contents

Maybe something like this:

with open('text.txt', encoding='utf-8', errors='ignore') as f:
content = f.read().splitlines()

PHP remove all non UTF-8 characters from string

If I understand well, this will do what you want:

$result = preg_replace('/(?:^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$)/u', '', $input);

Where

\p{L} stands for any character that is a letter (unicode)

\p{N} stands for any character that is a digit (unicode)

[^\p{L}\p{N}] is a negative character class that matches characters that is not letter or digit.

How do I remove non UTF-8 characters from a String?

We have a few problems.

The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)

Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.

Here's your string of bytes as I see it:

>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]

And it does contain bytes that are not valid UTF-8.

>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false

We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:

>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"

How to remove non UTF-8 characters from text

The signature of gsub is:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Not sure what you wanted to do with

gsub("’","‘","",txt)

but that line is probably not doing what you want it to do...

See here for a previous SO question on gsub and non-ascii symbols.

Edit:

Suggested solution using iconv:

Removing all non-ASCII characters:

txt <- "’xxx‘"

iconv(txt, "latin1", "ASCII", sub="")

Returns:

[1] "xxx"    


Related Topics



Leave a reply



Submit