Python and Beautifulsoup Encoding Issues

Python and BeautifulSoup encoding issues

could you try:

r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print x.prettify('latin-1')

I get the correct output.
Oh, in this special case you could also x.__str__(encoding='latin1').

I guess this is because the content is in ISO-8859-1(5) and the meta http-equiv content-type incorrectly says "UTF-8".

Could you confirm?

Python correct encoding of Website (Beautiful Soup)

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser' # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

Python and BeautifulSoup encoding issue from UTF-8

Your doRequest() function returns a BeautifulSoup object, you cannot decode that object. Just use it directly:

soup = doRequest(request)

You don't need to decode the response at all; BeautifulSoup uses both hints in the HTML (<meta> headers) as well as statistical analysis to determine the correct input encoding.

In this case the HTML document claims it is Latin-1:

<meta name="content-type" content="text/html; charset=iso-8859-1">

The response doesn't include a character set in the Content-Type header either, so this is a case of a misconfigured server. You can force BeautifulSoup to ignore the <meta> header with:

soup = BeautifulSoup(requestResult, from_encoding='utf8')

Encoding problem with Beatiful Soup + Python

This was not a Beautiful Soup issue but an issue with requests.

page = requests.get("https://www.formula1.com/en/drivers/kimi-raikkonen.html")

This was the first line I had inside my scraper and it was not returning the proper encoding. This solution might be considered hacky but I just added the following to fix the issue:

page.encoding = 'utf-8'

BeautifulSoup chinese character encoding error

decode using unicode-escape:

In [6]: from bs4 import BeautifulSoup

In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""

In [8]: soup = BeautifulSoup(h, 'lxml')

In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化

If you look at the source you can see the data is utf-8 encoded:

<meta http-equiv="content-language" content="utf-8" />

For me using bs4 4.4.1 just decoding what urllib returns works fine also:

In [1]: from bs4 import BeautifulSoup

In [2]: import urllib

In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')

In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化

When you are writing to a csv you will want to encode the data to a utf-8 str:

 .decode("unicode-escape").encode("utf-8")

You can do the encode when you save the data in your dict.

Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)

In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:

TypeError: write() argument must be str, not bytes

If you want to write bytes, you should open the file in binary mode.

BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))

In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.

If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.

data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']

with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')

Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.



Related Topics



Leave a reply



Submit