Fix Character encoding of webpage using python Mechanize
Your problem are some broken HTML comment tags, leading to an invalid website which mechanize's parser can't read. But you can use the included BeautifulSoup parser instead, which works in my case (Python 2.7.9, mechanize 0.2.5):
#!/usr/bin/env python
#-*- coding: utf-8 -*-
import mechanize
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.open('http://mspc.bii.a-star.edu.sg/tankp/run_depth.html')
br.select_form(nr=0)
br['pdb_id'] = '1atp'
response = br.submit()
mechanize submit form character encoding problem
Ok ,found it. It's beautiful soup that converts to unicode and prettify returns utf-8 by default.
You should use:
response.set_data(soup.prettify(encoding='latin-1'))
How to fix encoding in Python Mechanize?
Fixed by setting
br._factory.encoding = enc
br._factory._forms_factory.encoding = enc
br._factory._links_factory._encoding = enc
(note the underscores) after br.open()
Encoding problem downloading HTML using mechanize and Python 2.6
It was gzipped
def ungzipResponse(r,b):
headers = r.info()
if headers['Content-Encoding']=='gzip':
import gzip
gz = gzip.GzipFile(fileobj=r, mode='rb')
html = gz.read()
gz.close()
headers["Content-type"] = "text/html; charset=utf-8"
r.set_data( html )
b.set_response(r)
response = browser.open(url)
ungzipResponse(response, browser)
html = response.read()
How to get Mechanize to auto-convert body to UTF8?
Since Mechanize 2.0, arguments of pre_connect_hooks()
and post_connect_hooks()
were changed.
See the Mechanize documentation:
pre_connect_hooks()
A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
post_connect_hooks()
A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:
class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end
agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
mechanize: first form works, then unknown GET form encoding type 'utf-8'
How I got past this problem:
I re-installed mechanize after changing the source
line 3233 of _form.py
:
if (self.enctype != "application/x-www-form-urlencoded") and (self.enctype != "utf-8"):
it's probably very wrong and can only probably handle my case.
but in my specific case it works.
Related Topics
Why Does Single '=' Work in 'If' Statement
Rails 3: Belongs_To, Has_One and Migrations
New Way of Creating Hashes in Ruby 2.2.0
Bundler: Not Executable: Script/Delayed_Job
Recursive Rails Nested Resources
Rescuing "Command Not Found" for Io::Popen
Which Global Variable Is for Last Expression
How to Http Post Stream Data from Memory in Ruby
Ruby 1.9 - No Such File to Load 'Win32/Open3'
How to Efficiently Extract Repeated Elements in a Ruby Array
Validating Phone Number in Ruby
Ruby Open-Uri Can't Open Url (M1 MAC)
Ruby: Difference Between Read_Timeout and Open_Timeout
How to Get the Nth Element of an Enumerable in Ruby
How to Create Automatically a Instance of Every Class in a Directory