Get Page Generated With JavaScript in Python

Get page generated with Javascript in Python

You could use Selenium Webdriver:

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
button = browser.find_element_by_name('button')
button.click()
# wait for the page to load
WebDriverWait(browser, timeout=10).until(
lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
# store it to string variable
page_source = browser.page_source
print(page_source)

scrape html generated by javascript with python

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

Python Scraping JavaScript page without the need of an installed browser

Aside from automating a browser your other 2 options are as follows:

  1. try find the backend query that loads the data via javascript. It's not a guarantee that it will exist but open your browser's Developer Tools - Network tab - fetch/Xhr and then refresh the page, hopefully you'll see requests to a backend api that loads the data you want. If you do find a request click on it and explore the endpoint, headers and possibly the payload that is sent to get the response you are looking for, these can all be recreated in python using requests to that hidden endpoint.

  2. the other possiblility is that the data hidden in the HTML within a script tag possibly in a json file... Open the Elements tab of your developer tools where you can see the HTML of the page, right click on the tag and click "expand recursively" this will open every tag (it might take a second) and you'll be able to scroll down and search for the data you want. Ignore the regular HTML tags, we know it is loaded by javascript so look through any "script" tag. If you do find it then you can hopefully find it in your script with a combination of Beautiful Soup to get the script tag and string slicing to just get out the json.

If neither of those produce results then try requests_html package, and specifically the "render" method. It automatically installs a headless browser when you first run the render method in your script.

What site is it, perhaps I can offer more help if I can see it?

Scraping elements generated by javascript queries using python

Working example :

import urllib
import requests
import json

url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"

encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])

# => 5008

Explanation :

Inspecting the webpage, you can see that it does a request to this :

https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true

If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.

So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

How to scrape a javascript website in Python?

You can access data via API (check out the Network tab):

JavaScript Sample Image 13


For example,

import requests
url = "https://www.example.com/api/v3/news_feed/7"
data = requests.get(url).json()

How to get Javascript generated content

Try changing

html = driver.execute_script("return document.getElementsByTagName('table')[38].innerHTML")
print(html)

To:

target = driver.find_element_by_xpath('//table[@class="ACA_GridView ACA_Grid_Caption"]')
print(target.text)

Output:

Showing 1-13 of 13
License Type
License Number
First Name
Middle Initial
Last Name
Organization Name
DBA/Trade Name
License Status
License Expiration Date
Pharmacist
5302017621
Arthur
James

etc.

Using python Requests with javascript pages

You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.



Related Topics



Leave a reply



Submit