Backporting Python 3 Open(Encoding="Utf-8") to Python 2

Backporting Python 3 open(encoding=utf-8) to Python 2

1. To get an encoding parameter in Python 2:

If you only need to support Python 2.6 and 2.7 you can use io.open instead of open. io is the new io subsystem for Python 3, and it exists in Python 2,6 ans 2.7 as well. Please be aware that in Python 2.6 (as well as 3.0) it's implemented purely in python and very slow, so if you need speed in reading files, it's not a good option.

If you need speed, and you need to support Python 2.6 or earlier, you can use codecs.open instead. It also has an encoding parameter, and is quite similar to io.open except it handles line-endings differently.

2. To get a Python 3 `open()` style file handler which streams bytestrings:

open(filename, 'rb')

Note the 'b', meaning 'binary'.

Is it possible for str.encode(encoding='utf-8', errors='strict') to raise UnicodeError?

Yes, it's possible:

import six

content = ''.join(map(chr, range(0x110000)))
if isinstance(content, six.string_types):
    content = content.encode(encoding='utf-8', errors='strict')

Result (Try it online!, using Python 3.7.4):

Traceback (most recent call last):
  File ".code.tio", line 5, in <module>
    content = content.encode(encoding='utf-8', errors='strict')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 55296-57343: surrogates not allowed

And UnicodeEncodeErrors are UnicodeErrors.

How am I supposed to fix this utf-8 encoding error?

When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\ instead of \), the special text encoding codec unicode_escape may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str, it needs to be turned into a bytes - assuming that the entire string is of fully valid ascii code points, that may be the codec for the initial conversion of the initial str input into bytes. The utf8 codec may be used should there are standard unicode codepoints represented inside the str, as the unicode_escape sequences wouldn't affect those codepoints. Examples:

>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La funciÃ³n estÃ¡ndar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La funciÃ³n estÃ¡ndar datetime.'

Given the assumption that the unicode_escape codec assumes decoding to latin1, this intermediate string may simply be encoded to bytes using the latin1 codec post decoding, before turning that back into unicode str type through the utf8 (or whatever appropriate target) codec:

>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'

As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2 using the ascii codec will not work, due to the presence of the unescaped á character.

>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)

encoding' is an invalid keyword argument for this function

If you are using python 2 then try:

import codecs
from io import open
with open(filename+'.txt', 'a+', encoding='utf-8') as f:
    for tweet in list_of_tweets:
        print(tweet.text.replace('\r','').replace('\n','')+'|')
        f.write(tweet.text.replace('\r','').replace('\n','')+'|')

The usual open for python2 does not accept encoding.

Switching to Python 3 causing UnicodeDecodeError

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Instead of relying on a system setting, you should open your text files using an explicit codec:

currentFile = open(filename, 'rt', encoding='latin1')

where you set the encoding parameter to match the file you are reading.

Python 3 supports UTF-8 as the default for source code.

The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.

You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

Exact equivalent of `b'...'.decode(utf-8, backslashreplace)` in Python 2

I attempted a more complete backport of the cpython implementation

This handles both UnicodeDecodeError (from .decode()) as well as UnicodeEncodeError from .encode() and UnicodeTranslateError from .translate():

from __future__ import unicode_literals

import codecs

def _bytes_repr(c):
    """py2: bytes, py3: int"""
    if not isinstance(c, int):
        c = ord(c)
    return '\\x{:x}'.format(c)

def _text_repr(c):
    d = ord(c)
    if d >= 0x10000:
        return '\\U{:08x}'.format(d)
    else:
        return '\\u{:04x}'.format(d)

def backslashescape_backport(ex):
    s, start, end = ex.object, ex.start, ex.end
    c_repr = _bytes_repr if isinstance(ex, UnicodeDecodeError) else _text_repr
    return ''.join(c_repr(c) for c in s[start:end]), end

codecs.register_error('backslashescape_backport', backslashescape_backport)

print(b'\xc2\xa1\xa1after'.decode('utf-8', 'backslashescape_backport'))
print(u'\u2603'.encode('latin1', 'backslashescape_backport'))

with path.open('r', encoding=utf-8) as file: AttributeError: 'generator' object has no attribute 'open'

Instead of:

with path.open('r', encoding="utf-8") as file:
    tree = etree.parse(file)

You can pass a filename (string) directly to parse:

tree = etree.parse(path)

path in your example is a string so it doesn't have an open function.

Maybe you meant:

with open(path, 'r', encoding="utf-8") as file:
    tree = etree.parse(file)

If you trying to find xml file names in the current directory:

[f for f in os.listdir('.') if f.endswith('.xml')]

Opening a file with universal newlines in binary mode in python 3

TLDR: Use ASCII with surrogate escapes on Python3:

def text_open(*args, **kwargs):
    return open(*args, encoding='ascii', errors='surrogateescape', **kwargs)

The recommended approach if you know only a partial encoding (e.g. ASCII \r and \n) is to use surrogate escapes for unknown code points:

What can you do if you need to make a change to a file, but don’t know
the file’s encoding? If you know the encoding is ASCII-compatible and
only want to examine or modify the ASCII parts, you can open the file
with the surrogateescape error handler:

This uses reserved placeholders to embed the unknown bytes in your text stream. For example, the byte b'\x99' becomes the "unicode" code point '\udc99'. This works for both reading and writing, allowing you to preserve arbitrary embedded data.

The common line endings (\n, \r, \r\n) are all well-defined in ASCII. It is thus sufficient to use ASCII encoding with surrogate escapes.

For compatibility code, it is easiest to provide separate Python 2 and Python 3 versions of the divergent functionality. open is sufficiently similar that for most use cases, you just need to insert the surrogate escape handling.

if sys.version_info[0] == 3:
    def text_open(*args, **kwargs):
        return open(*args, encoding='ascii', errors='surrogateescape', **kwargs)
else:
    text_open = open

This allows using universal newlines without knowing the exact encoding. You can use this to directly read or transcribe files:

with text_open(IN_PATH, 'rU') as in_csv:
    with text_open(OUT_PATH, 'wU') as out_csv:
        for line in in_csv:
            out_csv.write(line)

If you need further formatting of the csv module, the text stream provided by text_open is sufficient as well. To handle non-ascii delimiters/padding/quotes, translate them from a bytestring to the appropriate surrogate.

if sys.version_info[0] == 3:
    def surrogate_escape(symbol):
        return symbol.decode(encoding='ascii', errors='surrogateescape')
else:
    surrogate_escape = lambda x: x

Dezimeter = surrogate_escape(b'\xA9\x87')

Backporting Python 3 Open(Encoding="Utf-8") to Python 2