U'\Ufeff' in Python String

u'\ufeff' in Python string

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'A\x00B\x00C\x00'
utf-16be '\x00A\x00B\x00C'

utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.

How to get rid of \ufeff in parsed html page

I've tried your code using Python 3.6.1 with a simple str.replace(u'\ufeff', '') and it seems to work.

Code tested:

import os
from bs4 import BeautifulSoup

os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')

with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")

data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)

Output before replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham
London Borough Council', 'Labour', 'Town Hall, 1 Town Square',
'13.93', '194,352', '51°33′39″N 0°09′21″E\ufeff / \ufeff51.5607°N
0.1557°E\ufeff / 51.5607; 0.1557\ufeff (Barking and Dagenham)', '25'], ... ]

Output after replace:

[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham
London Borough Council', 'Labour', 'Town Hall, 1 Town Square',
'13.93', '194,352', '51°33′39″N 0°09′21″E / 51.5607°N0.1557°E /
51.5607; 0.1557 (Barking and Dagenham)', '25'], ... ]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to undefined

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.

This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.

You can decode such bytestrings like this:

>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām

To read such data from a file:

with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()

Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

How to remove this special character?

U+FEFF is the Byte Order Mark character, which should only occur at the start of a document. In documents, it should be treated as a ZERO WIDTH NON-BREAKING SPACE. If this causes issues, you can remove it like any other character:

>>> s = u'word1 \ufeffword2'
>>> s = s.replace(u'\ufeff', '')
>>> s
u'word1 word2'

(In Python 3.1 or 3.2, drop the u in front of strings)

Unknown value keeps getting included in unique vector

First, the fix: Specify encoding='UTF-8-sig' when reading the file.

Now, the explanation:

\ufeff is the Unicode BOM (Byte Order Mark) character. Whenever one tool writes a file with a BOM, and another tool reads the file using an explicit encoding like UTF-16-LE instead of a BOM-switching version like UTF-16, the BOM is treated as a normal character, so \ufeff shows up in your string. Outside of Microsoft-land, this specific issue (reading UTF-16 as UTF-16-LE) is by far the most common version of this problem.

But if one of the tools is from Microsoft, it's more commonly UTF-8. The Unicode standard recommends never using a BOM with UTF-8 (because bytes don't need a byte-order mark), but doesn't quite forbid it, so many Microsoft tools keep doing it. And then every other tool, including Python (and Pandas), just reads it as UTF-8 without a BOM, causing an extra \ufeff to show up. (Older, non-Unicode-friendly tools will read the same three bytes \xef\xbb\xbf as something like , which you may have seen a few times.)

But while Python (and Pandas) defaults to UTF-8, it does let you specify an encoding manually, and one of the encodings it comes with is called UTF-8-sig, which means UTF-8 with a useless BOM at the start.



Related Topics



Leave a reply



Submit