How to Correctly Parse Utf-8 Encoded HTML to Unicode Strings with Beautifulsoup

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like
this:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
invalid continuation byte

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

Python correct encoding of Website (Beautiful Soup)

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser' # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

Python and BeautifulSoup encoding issues

could you try:

r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print x.prettify('latin-1')

I get the correct output.
Oh, in this special case you could also x.__str__(encoding='latin1').

I guess this is because the content is in ISO-8859-1(5) and the meta http-equiv content-type incorrectly says "UTF-8".

Could you confirm?

Issue with parsing special characters in a utf-8 encoded page with bs4

The requests library takes a strict approach to the decoding of web pages. On the other hand, BeautifulSoup has powerful tools for determining the encoding of text. So it's better to pass the raw response from the request to BeautifulSoup, and let BeautifulSoup try to determine the encoding.

>>> r = requests.get('https://www.registreentreprises.gouv.qc.ca/RQEntrepriseGRExt/GR/GR99/GR99A2_05A_PIU_AfficherMessages_PC/ActiEcon.html')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> list_all_domain = soup.find_all('th')
>>> [e.get_text() for e in list_all_domain]
['Agriculture', "Services relatifs à l'agriculture", 'Pêche et piégeage', ...]

How to avoid printing utf-8 characters in BeautifulSoup with replace_with

There is no need to get into encoding here. You can replace the text content of a Beautiful Soup element by setting the element.string as follows:

from bs4 import BeautifulSoup

html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""

soup = BeautifulSoup(html_sample)
forecast = soup.find("div", {"class": "date"})
forecast.string = 'TEST'
print(soup.prettify())

Output

<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</head>
<body>
<div class="date">
TEST
</div>
</body>
</html>

BeautifulSoup parser and cirillic characters

I found this :

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
soup = BeautifulSoup(html.decode('utf-8', 'ignore').encode("utf-8"))

From:

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

Also:

Delete every non utf-8 symbols froms string

Hope it helps ;)

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:

print u'\u2019'.encode('UTF-8')

The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.

So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get ’ instead of .

Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, , and .

If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.

Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.

Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.


If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.



Related Topics



Leave a reply



Submit