Python and BeautifulSoup encoding issues
could you try:
r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()
print x.prettify('latin-1')
I get the correct output.
Oh, in this special case you could also x.__str__(encoding='latin1')
.
I guess this is because the content is in ISO-8859-1(5) and the meta http-equiv content-type incorrectly says "UTF-8".
Could you confirm?
Python correct encoding of Website (Beautiful Soup)
You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.
First of all, don't use response.text
! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests
library will default to Latin-1 encoding for text/*
content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the
Content-Type
header containstext
. In this situation, RFC 2616 specifies that the default charset must beISO-8859-1
. Requests follows the specification in this case. If you require a different encoding, you can manually set theResponse.encoding
property, or use the rawResponse.content
.
Bold emphasis mine.
Pass in the response.content
raw data instead:
soup = BeautifulSoup(r.content)
I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4
project, and use from bs4 import BeautifulSoup
.
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta>
tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests
used a default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser' # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)
Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text()
:
text = soup.get_text()
print text
You are instead converting a result list (the return value of soup.findAll()
) to a string. This never can work because containers in Python use repr()
on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.
Python and BeautifulSoup encoding issue from UTF-8
Your doRequest()
function returns a BeautifulSoup object, you cannot decode that object. Just use it directly:
soup = doRequest(request)
You don't need to decode the response at all; BeautifulSoup uses both hints in the HTML (<meta>
headers) as well as statistical analysis to determine the correct input encoding.
In this case the HTML document claims it is Latin-1:
<meta name="content-type" content="text/html; charset=iso-8859-1">
The response doesn't include a character set in the Content-Type header either, so this is a case of a misconfigured server. You can force BeautifulSoup to ignore the <meta>
header with:
soup = BeautifulSoup(requestResult, from_encoding='utf8')
Encoding problem with Beatiful Soup + Python
This was not a Beautiful Soup issue but an issue with requests
.
page = requests.get("https://www.formula1.com/en/drivers/kimi-raikkonen.html")
This was the first line I had inside my scraper and it was not returning the proper encoding. This solution might be considered hacky but I just added the following to fix the issue:
page.encoding = 'utf-8'
BeautifulSoup chinese character encoding error
decode using unicode-escape
:
In [6]: from bs4 import BeautifulSoup
In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""
In [8]: soup = BeautifulSoup(h, 'lxml')
In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化
If you look at the source you can see the data is utf-8 encoded:
<meta http-equiv="content-language" content="utf-8" />
For me using bs4 4.4.1 just decoding what urllib returns works fine also:
In [1]: from bs4 import BeautifulSoup
In [2]: import urllib
In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')
In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化
When you are writing to a csv you will want to encode the data to a utf-8 str:
.decode("unicode-escape").encode("utf-8")
You can do the encode when you save the data in your dict.
Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)
In your code you open the file in text mode, but then you attempt to write bytes (str.encode
returns bytes) and so Python throws an exception:
TypeError: write() argument must be str, not bytes
If you want to write bytes, you should open the file in binary mode.
BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding
, and use it to encode the content when writting to file. For example,
soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii
with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))
In order for this to work you should pass your html as bytes to BeautifulSoup
, so don't decode the response content.
If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.
data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']
with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')
Alternatively, you could open the file in text mode and set the encoding in open
(instead of encoding the content), but note that this option is not available in Python2.
Related Topics
What Is the Fastest Way to Open Urls in New Tabs via Selenium - Python
Processing Single File from Multiple Processes
Getting a MAChine's External Ip Address with Python
Selenium Element Not Visible Exception
Django Aggregation: Summation of Multiplication of Two Fields
I Expect 'True' But Get 'None'
Django Datetime Issues (Default=Datetime.Now())
Compare Two Different Files Line by Line in Python
Escape Special Characters in a Python String
When Are Objects Garbage Collected in Python
Convert Columns to String in Pandas
Pipe Subprocess Standard Output to a Variable
How to Fix "Importerror: No Module Named ..." Error in Python
Multiprocessing.Dummy in Python Is Not Utilising 100% Cpu
How to Get System Timezone Setting and Pass It to Pytz.Timezone
Subprocess.Call() Arguments Ignored When Using Shell=True W/ List
How to Add a New Column to a Spark Dataframe (Using Pyspark)