How to Set the Mechanize Page Encoding

Fix Character encoding of webpage using python Mechanize

Your problem are some broken HTML comment tags, leading to an invalid website which mechanize's parser can't read. But you can use the included BeautifulSoup parser instead, which works in my case (Python 2.7.9, mechanize 0.2.5):

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import mechanize

br = mechanize.Browser(factory=mechanize.RobustFactory())
br.open('http://mspc.bii.a-star.edu.sg/tankp/run_depth.html')
br.select_form(nr=0)
br['pdb_id'] = '1atp'
response = br.submit()

mechanize submit form character encoding problem

Ok ,found it. It's beautiful soup that converts to unicode and prettify returns utf-8 by default.
You should use:

response.set_data(soup.prettify(encoding='latin-1'))

How to fix encoding in Python Mechanize?

Fixed by setting

br._factory.encoding = enc
br._factory._forms_factory.encoding = enc
br._factory._links_factory._encoding = enc

(note the underscores) after br.open()

Encoding problem downloading HTML using mechanize and Python 2.6

It was gzipped

def ungzipResponse(r,b):
headers = r.info()
if headers['Content-Encoding']=='gzip':
import gzip
gz = gzip.GzipFile(fileobj=r, mode='rb')
html = gz.read()
gz.close()
headers["Content-type"] = "text/html; charset=utf-8"
r.set_data( html )
b.set_response(r)

response = browser.open(url)
ungzipResponse(response, browser)
html = response.read()

How to get Mechanize to auto-convert body to UTF8?

Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.

See the Mechanize documentation:

pre_connect_hooks()

A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

 

post_connect_hooks()

A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:

class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end

agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...

mechanize: first form works, then unknown GET form encoding type 'utf-8'

How I got past this problem:

I re-installed mechanize after changing the source

line 3233 of _form.py:

if (self.enctype != "application/x-www-form-urlencoded") and (self.enctype != "utf-8"):

it's probably very wrong and can only probably handle my case.
but in my specific case it works.



Related Topics



Leave a reply



Submit