How to Use Mechanize to Process JavaScript

How do I use Mechanize to process JavaScript?

If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.

Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

How to properly use mechanize to scrape AJAX sites

From the mechanize faq "mechanize does not provide any support for JavaScript", it then elaborates on your options (the choices aren't great).

If you've got something working then great but using selenium webdriver is a much better solution for scraping ajax sites than mechanize

How to emulate a browser with JavaScript support via Mechanize?

For linux

Foremost, I know some people dont just wanta suggestion to switch to another option. However, I believe that if you want to access the page entirely after logging in, (which currently fails due to no javascript support) you should look into using Selenium.

You can grab it with a quick sudo pip install selenium.

Accessing a webpage is as easy as declaring your browser, then telling your browser to go to the desired webpage. Here, i have attached a basic sample to make your browser go to a webpage, the page im using relies heavily on javascript:

import selenium
from selenium import webdriver

try:
browser = webdriver.Firefox()
browser.get('mikekus.com')
except KeyboardInterrupt:
browser.quit()

This works, because selenium actually opens a browser. However, if you wish to hide the browser, so you dont have to see it and have it in your taskbar.

I recommend the following setup using pyvirtualdisplay which will hide the browser using visible=0. It is worth noting pyvirtualdisplay is a wrapper, for Xvfb and as such requires you install it as well. You can get it with sudo apt-get install xvfb:

import selenium
from selenium import webdriver
from pyvirtualdisplay import Display


try:
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get('mikekus.com')

except KeyboardInterrupt:
browser.quit()
display.stop()

I will leave the filling in login forms, etc. To you, as its quite simple if your read the docs, as everyone should. Navigating With Selenium

Granted, in your situation you are trying to access the proxy, then access another site. This method implies you would direct the proxy to the webpage from the proxys page itself, through accessing fields on the page. Im sure with a bit of time you could continue navigating to multiple pages and page elements, again with a bit of research.

I hope this helps. Good luck.

Mechanize + Python: how to follow a link in a simple javascript?

I solved it! in this way:

    cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

...

br.follow_link("www.address1.com")
refe= br.geturl()
req = urllib2.Request(url='www.site2.com')
req.add_header('Referer', refe)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj) )
f = opener.open(req)
htm = f.read()
print "\n\n", htm

Python mechanize javascript

If you know the station IDs, it is easier to POST the request yourself:

import mechanize
import urllib

post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'

orig = 295 #BEACON FALLS
dest = 292 #ANSONIA

params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)

fares_page = mechanize.urlopen(rq)

print fares_page.read()

If you have the code to find the list of destination IDs for a given starting ID (i.e. a variant of refillList()), you can then run this request for each combination:

import mechanize
import urllib, urllib2
from bs4 import BeautifulSoup

url = 'http://as0.mta.info/mnr/fares/choosestation.cfm'
post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'

def get_fares(orig, dest):
params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)

fares_page = mechanize.urlopen(rq)
print(fares_page.read())

pool = BeautifulSoup(urllib2.urlopen(url).read())

#let's keep our stations organised
stations = {}

# dict by station id
for option in pool.find('select', {'name':'orig_stat'}).findChildren():
stations[option['value']] = {'name':option.string}

#iterate over all routes
for origin in stations:
destinations = get_list_of_dests(origin) #use your code for this
stations[origin]['dests'] = destinations

for destination in destinations:
print('Processing from %s to %s' % (origin, destination))
get_fares(origin, destination)

Can mechanize support ajax / filling out forms via javascript?

AJAX calls are performed by javascript, and mechanize has no way to run javascript. Mechanize only looks at form fields on a static HTML page and allows you to fill & submit those. This is why your research is pointing you towards things like Selenium or Ghost, which run on top of a real browser that can execute javascript.

There is a simpler way to do this though! If you use the developer tools on your browser (e.g. the Network tab in Firefox or Chrome) and fill out the form you can see the request your browser is making behind the scenes, even with AJAX:

Network tab in Firefox

This tells you:

  • The browser made a POST request
  • To this URL: https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS
  • With the following form params:

    • location=ALAMEDA+ALAMEDA
    • coverageType=HOMEOWNERS
    • coverageAmount=150000
    • homeAge=New

You can use this information to make the same POST request in Python:

import urllib.parse, urllib.request

url = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
data = urllib.parse.urlencode(dict(
location="ALAMEDA ALAMEDA",
coverageType="HOMEOWNERS",
coverageAmount="150000",
homeAge="New",
))
res = urllib.request.urlopen(URL, data.encode("utf8"))

print(res.read())

This is python3. The requests library provides an even nicer API for making HTTP requests.


Edit: In response to your three questions:

is it possible for the dictionary that you've created to have more than 1 location and cycle through them using a for loop?

Yes, just add a loop around the code and pass a different value for location each time. I would put this code into a function to make the code cleaner, like this:

https://gist.github.com/lost-theory/08786e3a27c8d8ce3839

the results are in a lot of jibberish, so I'd have to find a way to sift through it huh. Like pick out which is which

Yes, the jibberish is HTML that you will need to parse to collect the data you're looking for. Look at HTMLParser in the python standard library, or install a library like lxml or BeautifulSoup, which have a little nicer API. You can also just try parsing the text by hand using str.split.

If you want to convert the table's rows into python lists you'll need to find all the rows, which look like this:

  <tr Valign="top">
<td align="left">Bankers Standard <a href='http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&doFunction=getCompanyProfile&eid=5906'><small>(Info)</small></a></td>
<td align="left"><div align="right">          N/A</td>
<td align="left"><div align="right">250</div></td>
<td align="left"> </td>
<td align="left">Bankers Standard <a href='http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&doFunction=getCompanyProfile&eid=5906'><small>(Info)</small></a></td>
<td align="left"><div align="right">         1255</td>
<td align="left"><div align="right">500</div></td>
</tr>

You want to loop over all the <tr> (row) elements, grabbing all the <td> (column) elements inside each row, then clean up the text in each column (removing those   spaces, etc.).

There are lots of questions on StackOverflow and tutorials on the internet on how to parse or scrape HTML in python, like this or this.

could you explain why we had to do the data.encode line

Sure! In the documentation for urlopen, it says:

data must be a bytes object specifying additional data to be sent to the server, or None if no such data is needed.

The urlencode function returns a unicode string, and if we try to pass that into urlopen, we get this error:

TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

So we use data.encode('utf8') to convert the unicode string to bytes. You typically need to use bytes for input & output like reading from or writing to files on disk, sending or receiving data over the network like HTTP requests, etc. This presentation has a good explanation of bytes vs. unicode strings in python and why you need to decode/encode when doing I/O.



Related Topics



Leave a reply



Submit