How do I use Mechanize to process JavaScript?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.
How to properly use mechanize to scrape AJAX sites
From the mechanize faq "mechanize does not provide any support for JavaScript", it then elaborates on your options (the choices aren't great).
If you've got something working then great but using selenium webdriver is a much better solution for scraping ajax sites than mechanize
How to emulate a browser with JavaScript support via Mechanize?
For linux
Foremost, I know some people dont just wanta suggestion to switch to another option. However, I believe that if you want to access the page entirely after logging in, (which currently fails due to no javascript support) you should look into using Selenium.
You can grab it with a quick sudo pip install selenium
.
Accessing a webpage is as easy as declaring your browser, then telling your browser to go to the desired webpage. Here, i have attached a basic sample to make your browser go to a webpage, the page im using relies heavily on javascript:
import selenium
from selenium import webdriver
try:
browser = webdriver.Firefox()
browser.get('mikekus.com')
except KeyboardInterrupt:
browser.quit()
This works, because selenium actually opens a browser. However, if you wish to hide the browser, so you dont have to see it and have it in your taskbar.
I recommend the following setup using pyvirtualdisplay which will hide the browser using visible=0
. It is worth noting pyvirtualdisplay is a wrapper, for Xvfb and as such requires you install it as well. You can get it with sudo apt-get install xvfb
:
import selenium
from selenium import webdriver
from pyvirtualdisplay import Display
try:
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get('mikekus.com')
except KeyboardInterrupt:
browser.quit()
display.stop()
I will leave the filling in login forms, etc. To you, as its quite simple if your read the docs, as everyone should. Navigating With Selenium
Granted, in your situation you are trying to access the proxy, then access another site. This method implies you would direct the proxy to the webpage from the proxys page itself, through accessing fields on the page. Im sure with a bit of time you could continue navigating to multiple pages and page elements, again with a bit of research.
I hope this helps. Good luck.
Mechanize + Python: how to follow a link in a simple javascript?
I solved it! in this way:
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
...
br.follow_link("www.address1.com")
refe= br.geturl()
req = urllib2.Request(url='www.site2.com')
req.add_header('Referer', refe)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj) )
f = opener.open(req)
htm = f.read()
print "\n\n", htm
Python mechanize javascript
If you know the station IDs, it is easier to POST the request yourself:
import mechanize
import urllib
post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'
orig = 295 #BEACON FALLS
dest = 292 #ANSONIA
params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)
fares_page = mechanize.urlopen(rq)
print fares_page.read()
If you have the code to find the list of destination IDs for a given starting ID (i.e. a variant of refillList()
), you can then run this request for each combination:
import mechanize
import urllib, urllib2
from bs4 import BeautifulSoup
url = 'http://as0.mta.info/mnr/fares/choosestation.cfm'
post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'
def get_fares(orig, dest):
params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)
fares_page = mechanize.urlopen(rq)
print(fares_page.read())
pool = BeautifulSoup(urllib2.urlopen(url).read())
#let's keep our stations organised
stations = {}
# dict by station id
for option in pool.find('select', {'name':'orig_stat'}).findChildren():
stations[option['value']] = {'name':option.string}
#iterate over all routes
for origin in stations:
destinations = get_list_of_dests(origin) #use your code for this
stations[origin]['dests'] = destinations
for destination in destinations:
print('Processing from %s to %s' % (origin, destination))
get_fares(origin, destination)
Can mechanize support ajax / filling out forms via javascript?
AJAX calls are performed by javascript, and mechanize has no way to run javascript. Mechanize only looks at form fields on a static HTML page and allows you to fill & submit those. This is why your research is pointing you towards things like Selenium or Ghost, which run on top of a real browser that can execute javascript.
There is a simpler way to do this though! If you use the developer tools on your browser (e.g. the Network tab in Firefox or Chrome) and fill out the form you can see the request your browser is making behind the scenes, even with AJAX:
This tells you:
- The browser made a
POST
request - To this URL:
https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS
- With the following form params:
- location=ALAMEDA+ALAMEDA
- coverageType=HOMEOWNERS
- coverageAmount=150000
- homeAge=New
You can use this information to make the same POST request in Python:
import urllib.parse, urllib.request
url = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"
data = urllib.parse.urlencode(dict(
location="ALAMEDA ALAMEDA",
coverageType="HOMEOWNERS",
coverageAmount="150000",
homeAge="New",
))
res = urllib.request.urlopen(URL, data.encode("utf8"))
print(res.read())
This is python3. The requests library provides an even nicer API for making HTTP requests.
Edit: In response to your three questions:
is it possible for the dictionary that you've created to have more than 1 location and cycle through them using a for loop?
Yes, just add a loop around the code and pass a different value for location
each time. I would put this code into a function to make the code cleaner, like this:
https://gist.github.com/lost-theory/08786e3a27c8d8ce3839
the results are in a lot of jibberish, so I'd have to find a way to sift through it huh. Like pick out which is which
Yes, the jibberish is HTML that you will need to parse to collect the data you're looking for. Look at HTMLParser in the python standard library, or install a library like lxml or BeautifulSoup, which have a little nicer API. You can also just try parsing the text by hand using str.split
.
If you want to convert the table's rows into python list
s you'll need to find all the rows, which look like this:
<tr Valign="top">
<td align="left">Bankers Standard <a href='http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&doFunction=getCompanyProfile&eid=5906'><small>(Info)</small></a></td>
<td align="left"><div align="right"> N/A</td>
<td align="left"><div align="right">250</div></td>
<td align="left"> </td>
<td align="left">Bankers Standard <a href='http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&doFunction=getCompanyProfile&eid=5906'><small>(Info)</small></a></td>
<td align="left"><div align="right"> 1255</td>
<td align="left"><div align="right">500</div></td>
</tr>
You want to loop over all the <tr>
(row) elements, grabbing all the <td>
(column) elements inside each row, then clean up the text in each column (removing those
spaces, etc.).
There are lots of questions on StackOverflow and tutorials on the internet on how to parse or scrape HTML in python, like this or this.
could you explain why we had to do the data.encode line
Sure! In the documentation for urlopen
, it says:
data must be a bytes object specifying additional data to be sent to the server, or None if no such data is needed.
The urlencode
function returns a unicode string, and if we try to pass that into urlopen
, we get this error:
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
So we use data.encode('utf8')
to convert the unicode string to bytes. You typically need to use bytes for input & output like reading from or writing to files on disk, sending or receiving data over the network like HTTP requests, etc. This presentation has a good explanation of bytes vs. unicode strings in python and why you need to decode/encode when doing I/O.
Related Topics
Using Html5 File Uploads With Ajax and Jquery
How to Get Current Formatted Date Dd/Mm/Yyyy in JavaScript and Append It to an Input
Innertext VS Innerhtml VS Label VS Text VS Textcontent VS Outertext
How to Add/Update an Attribute to an HTML Element Using JavaScript
How Might I Get the Script Filename from Within That Script
Sanitize/Rewrite HTML on the Client Side
Removing All Script Tags from HTML With Js Regular Expression
How to Fix Getimagedata() Error the Canvas Has Been Tainted by Cross-Origin Data
How to Make Text Unselectable on an HTML Page
Browser/Html Force Download of Image from Src="Data:Image/Jpeg;Base64..."
Html/JavaScript: How to Access Json Data Loaded in a Script Tag With Src Set
Make Iframe Automatically Adjust Height According to the Contents Without Using Scrollbar