Python: Converting from Iso-8859-1/Latin1 to Utf-8

Python: Converting from ISO-8859-1/latin1 to UTF-8

Try decoding it first, then encoding:

apple.decode('iso-8859-1').encode('utf8')

How to convert ISO-8859-1 to UTF-8 using Python 3.7.4

decode is a member of the bytes type:

>>> help(bytes.decode)
Help on method_descriptor:

decode(self, /, encoding='utf-8', errors='strict')
Decode the bytes using the codec registered for encoding.

encoding
The encoding with which to decode the bytes.
errors
The error handling scheme to use for the handling of decoding errors.
The default is 'strict' meaning that decoding errors raise a
UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that
can handle UnicodeDecodeErrors.

So inputText needs to be of type bytes, not str:

>>> inputText = b"\xC4pple"
>>> inputText.decode('iso-8859-1')
'Äpple'
>>> inputText.decode('iso-8859-1').encode('utf8')
b'\xc3\x84pple'

Note that the result of decode is type str and of encode is type bytes.

Use iconv or python3 to recode utf-8 to Latin-1 (ISO-8859-1) preserving accented characters

There are two ways to encode A WITH ACUTE ACCENT in Unicode.

One is to use a combined character, as illustrated here with Python's built-in ascii function:

>>> ascii('á')
"'\\xe1'"

But you can also use a combining accent following an unaccented letter a:

>>> ascii('á')
"'a\\u0301'"

Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).

Now, Latin-1 has an accented letter a, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace".

Fortunately, you can automatically switch between the two variants.
Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively.
In Python, you can use the standard-library module unicodedata:

>>> import unicodedata as ud
>>> ascii(ud.normalize('NFD', 'á'))
"'a\\u0301'"
>>> ascii(ud.normalize('NFC', 'á'))
"'\\xe1'"

In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:

>>> n = 'Gonza\u0301lez, M.'
>>> print(n)
González, M.
>>> n.encode('latin1', errors='replace')
b'Gonza?lez, M.'
>>> ud.normalize('NFC', n).encode('latin1', errors='replace')
b'Gonz\xe1lez, M.'

How to Convert from ISO-8859-1 format to UTF-8 and then into hex format string python

This works for me in Python 3.7

a = '戧'
encoded_bytes = a.encode(encoding='utf-8')
print(' '.join([hex(b) for b in encoded_bytes]))

>>> 0xe6 0x88 0xa7

UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent

Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.

The following works on Python 2 and 3:

from builtins import str
import unidecode

def unidecode_fallback(e):
part = e.object[e.start:e.end]
replacement = str(unidecode.unidecode(part) or '?')
return (replacement, e.start + len(part))

codecs.register_error('unidecode_fallback', unidecode_fallback)

s = u'abcdé–fghijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))

Result:

abcdé-fgh?ijkl

This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.



Related Topics



Leave a reply



Submit