u'\ufeff' in Python string
The Unicode character U+FEFF
is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
Note that EF BB BF
is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).
Output:
utf-8 'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'A\x00B\x00C\x00'
utf-16be '\x00A\x00B\x00C'
utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.
Note that the utf-16
codec requires BOM to be present, or Python won't know if the data is big- or little-endian.
How to get rid of \ufeff in parsed html page
I've tried your code using Python 3.6.1 with a simple str.replace(u'\ufeff', '')
and it seems to work.
Code tested:
import os
from bs4 import BeautifulSoup
os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')
with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")
data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)
Output before replace:
[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham
London Borough Council', 'Labour', 'Town Hall, 1 Town Square',
'13.93', '194,352', '51°33′39″N 0°09′21″E\ufeff / \ufeff51.5607°N
0.1557°E\ufeff / 51.5607; 0.1557\ufeff (Barking and Dagenham)', '25'], ... ]
Output after replace:
[[], ['Barking and Dagenham [note 1]', '', '', 'Barking and Dagenham
London Borough Council', 'Labour', 'Town Hall, 1 Town Square',
'13.93', '194,352', '51°33′39″N 0°09′21″E / 51.5607°N0.1557°E /
51.5607; 0.1557 (Barking and Dagenham)', '25'], ... ]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to undefined
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf'
, to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
How to remove this special character?
U+FEFF is the Byte Order Mark character, which should only occur at the start of a document. In documents, it should be treated as a ZERO WIDTH NON-BREAKING SPACE
. If this causes issues, you can remove it like any other character:
>>> s = u'word1 \ufeffword2'
>>> s = s.replace(u'\ufeff', '')
>>> s
u'word1 word2'
(In Python 3.1 or 3.2, drop the u
in front of strings)
Unknown value keeps getting included in unique vector
First, the fix: Specify encoding='UTF-8-sig'
when reading the file.
Now, the explanation:
\ufeff
is the Unicode BOM (Byte Order Mark) character. Whenever one tool writes a file with a BOM, and another tool reads the file using an explicit encoding like UTF-16-LE
instead of a BOM-switching version like UTF-16
, the BOM is treated as a normal character, so \ufeff
shows up in your string. Outside of Microsoft-land, this specific issue (reading UTF-16 as UTF-16-LE) is by far the most common version of this problem.
But if one of the tools is from Microsoft, it's more commonly UTF-8. The Unicode standard recommends never using a BOM with UTF-8 (because bytes don't need a byte-order mark), but doesn't quite forbid it, so many Microsoft tools keep doing it. And then every other tool, including Python (and Pandas), just reads it as UTF-8 without a BOM, causing an extra \ufeff
to show up. (Older, non-Unicode-friendly tools will read the same three bytes \xef\xbb\xbf
as something like 
, which you may have seen a few times.)
But while Python (and Pandas) defaults to UTF-8, it does let you specify an encoding manually, and one of the encodings it comes with is called UTF-8-sig
, which means UTF-8 with a useless BOM at the start.
Related Topics
Typeerror: Unsupported Operand Type(S) for -: 'Str' and 'Int'
Creating an Empty Pandas Dataframe, Then Filling It
Why Does Adding a Trailing Comma After a Variable Name Make It a Tuple
How to Parse a Yaml File in Python
Generate Random Integers Between 0 and 9
Stripping Everything But Alphanumeric Chars from a String in Python
Encrypt & Decrypt Using Pycrypto Aes 256
How to Use Python Requests to Fake a Browser Visit A.K.A and Generate User Agent
How to One Hot Encode in Python
Create Pandas Dataframe from a String
How to Use a Dot "." to Access Members of Dictionary
What Does Asterisk * Mean in Python