Delete every non utf-8 symbols from string
Try below code line instead of last two lines. Hope it helps:
line=line.decode('utf-8','ignore').encode("utf-8")
In Python 3, how do you remove all non-UTF8 characters from a string?
You're starting with a string. You can't decode
a str
(it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str
stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:
x.decode('utf-8','ignore').encode("utf-8")
to:
x.encode('utf-8','ignore').decode("utf-8")
where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.
Remove non-utf8 characters from string
Using a regex approach:
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);
It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.
It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.
$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);
EDIT:
!empty(x)
will match non-empty values ("0"
is considered empty).x != ""
will match non-empty values, including"0"
.x !== ""
will match anything except""
.
x != ""
seem the best one to use in this case.
I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.
Remove non-UTF8 characters from file contents
Maybe something like this:
with open('text.txt', encoding='utf-8', errors='ignore') as f:
content = f.read().splitlines()
PHP remove all non UTF-8 characters from string
If I understand well, this will do what you want:
$result = preg_replace('/(?:^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$)/u', '', $input);
Where
\p{L}
stands for any character that is a letter (unicode)\p{N}
stands for any character that is a digit (unicode)[^\p{L}\p{N}]
is a negative character class that matches characters that is not letter or digit.
How do I remove non UTF-8 characters from a String?
We have a few problems.
The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)
Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.
Here's your string of bytes as I see it:
>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]
And it does contain bytes that are not valid UTF-8.
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false
We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:
>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"
How to remove non UTF-8 characters from text
The signature of gsub
is:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Not sure what you wanted to do with
gsub("’","‘","",txt)
but that line is probably not doing what you want it to do...
See here for a previous SO question on gsub and non-ascii symbols.
Edit:Suggested solution using iconv
:
Removing all non-ASCII characters:
txt <- "’xxx‘"
iconv(txt, "latin1", "ASCII", sub="")
Returns:
[1] "xxx"
Related Topics
How to Put a Space Between Two String Items in Python
Get Discord User Id from Username
How to Clear Only Last One Line in Python Output Console
How to Limit a Number to Be Within a Specified Range (Python)
How to Restart a Program Based on User Input
How to Transfer Data from One Worksheet into Another Using Python in the Same Workbook
Python - Automatically Adjust Width of an Excel File'S Columns
Pandas Filtering for Multiple Substrings in Series
How to Find 3 Immediate Words After Keyword Match Using Python
Find the Item With Maximum Occurrences in a List
How to Select Last Row and Also How to Access Pyspark Dataframe by Index
Python Pandas: Drop Rows of a Timeserie Based on Time Range
How to Get Slope from Timeseries Data in Pandas
How to Get the Column Name in Pandas Based on Row Values