Python: Converting from ISO-8859-1/latin1 to UTF-8
Try decoding it first, then encoding:
apple.decode('iso-8859-1').encode('utf8')
How to convert ISO-8859-1 to UTF-8 using Python 3.7.4
decode
is a member of the bytes
type:
>>> help(bytes.decode)
Help on method_descriptor:
decode(self, /, encoding='utf-8', errors='strict')
Decode the bytes using the codec registered for encoding.
encoding
The encoding with which to decode the bytes.
errors
The error handling scheme to use for the handling of decoding errors.
The default is 'strict' meaning that decoding errors raise a
UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that
can handle UnicodeDecodeErrors.
So inputText needs to be of type bytes
, not str
:
>>> inputText = b"\xC4pple"
>>> inputText.decode('iso-8859-1')
'Äpple'
>>> inputText.decode('iso-8859-1').encode('utf8')
b'\xc3\x84pple'
Note that the result of decode
is type str
and of encode
is type bytes
.
Use iconv or python3 to recode utf-8 to Latin-1 (ISO-8859-1) preserving accented characters
There are two ways to encode A WITH ACUTE ACCENT in Unicode.
One is to use a combined character, as illustrated here with Python's built-in ascii
function:
>>> ascii('á')
"'\\xe1'"
But you can also use a combining accent following an unaccented letter a
:
>>> ascii('á')
"'a\\u0301'"
Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).
Now, Latin-1 has an accented letter a
, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace"
.
Fortunately, you can automatically switch between the two variants.
Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively.
In Python, you can use the standard-library module unicodedata
:
>>> import unicodedata as ud
>>> ascii(ud.normalize('NFD', 'á'))
"'a\\u0301'"
>>> ascii(ud.normalize('NFC', 'á'))
"'\\xe1'"
In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:
>>> n = 'Gonza\u0301lez, M.'
>>> print(n)
González, M.
>>> n.encode('latin1', errors='replace')
b'Gonza?lez, M.'
>>> ud.normalize('NFC', n).encode('latin1', errors='replace')
b'Gonz\xe1lez, M.'
How to Convert from ISO-8859-1 format to UTF-8 and then into hex format string python
This works for me in Python 3.7
a = '戧'
encoded_bytes = a.encode(encoding='utf-8')
print(' '.join([hex(b) for b in encoded_bytes]))
>>> 0xe6 0x88 0xa7
UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent
Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.
The following works on Python 2 and 3:
from builtins import str
import unidecode
def unidecode_fallback(e):
part = e.object[e.start:e.end]
replacement = str(unidecode.unidecode(part) or '?')
return (replacement, e.start + len(part))
codecs.register_error('unidecode_fallback', unidecode_fallback)
s = u'abcdé–fghijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))
Result:
abcdé-fgh?ijkl
This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.
Related Topics
How to Determine a Point Is Between Two Other Points on a Line Segment
Python - 'Ascii' Codec Can't Decode Byte
Writing a Dict to Txt File and Reading It Back
How to Clone a Django Model Instance Object and Save It to the Database
How to Plot a Gradient Color Line in Matplotlib
How to Pipe a Subprocess Call to a Text File
Example Use of "Continue" Statement in Python
How to Get Rid of Python Tkinter Root Window
Why Is Bubble Sort Implementation Looping Forever
Decorating Class Methods - How to Pass the Instance to the Decorator
How to Check the Difference, in Seconds, Between Two Dates
Consistently Create Same Random Numpy Array
C and Python - Different Behaviour of the Modulo (%) Operation
How Does Perspective Transformation Work in Pil
How to Deal with Multi-Level Column Names Downloaded with Yfinance
Django Upgrading to 1.9 Error "Appregistrynotready: Apps Aren't Loaded Yet."