scrape html generated by javascript with python
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Python Scraping JavaScript page without the need of an installed browser
Aside from automating a browser your other 2 options are as follows:
try find the backend query that loads the data via javascript. It's not a guarantee that it will exist but open your browser's Developer Tools - Network tab - fetch/Xhr and then refresh the page, hopefully you'll see requests to a backend api that loads the data you want. If you do find a request click on it and explore the endpoint, headers and possibly the payload that is sent to get the response you are looking for, these can all be recreated in python using requests to that hidden endpoint.
the other possiblility is that the data hidden in the HTML within a script tag possibly in a json file... Open the Elements tab of your developer tools where you can see the HTML of the page, right click on the tag and click "expand recursively" this will open every tag (it might take a second) and you'll be able to scroll down and search for the data you want. Ignore the regular HTML tags, we know it is loaded by javascript so look through any "script" tag. If you do find it then you can hopefully find it in your script with a combination of Beautiful Soup to get the script tag and string slicing to just get out the json.
If neither of those produce results then try requests_html package, and specifically the "render" method. It automatically installs a headless browser when you first run the render method in your script.
What site is it, perhaps I can offer more help if I can see it?
How to scrape a javascript website in Python?
You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()
Scraping elements generated by javascript queries using python
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url
parameter of the request with the url of the page you want to get the twitter counts.
Web scraping with python in javascript dynamic website
The website does 3 API calls in order to get the data.
The code below does the same and get the data.
(In the browser do F12 -> Network -> XHR and see the API calls)
import requests
payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
print(r1.json())
print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
print(r2.json())
print('------------------')
payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
print(r3.json())
Scraping dynamically generated html by JavaScript with Python and Selenium
The advantage with selenium is that you can actually start a browser session from your program and enable an event in javascript (like in this case scroll)
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://cavemendev.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: driver.execute_script("window.scrollTo(0, Y)")
In [15]: for tag in soup.find_all('title'):
....: print tag.text
Let me know if doesn't make much sense
Related Topics
Open Url in New Tab or Reuse Existing One Whenever Possible
How to Get All the Applied Styles of an Element by Just Giving Its Id
How to Break Line in JavaScript
Need to Find Height of Hidden Div on Page (Set to Display:None)
Decoding Url Parameters with JavaScript
Python Library for Rendering HTML and JavaScript
How to Disable the Save Password Bubble in Chrome Using JavaScript
JavaScript Change Background Color on Click
JavaScript Beforeunload Detect Refresh Versus Close
Html:Draw Table Using Innerhtml
How to Open a File Browser with Default Directory in JavaScript
What Is the Purpose of the HTML "No-Js" Class
No 'Access-Control-Allow-Origin' Header Is Present on the Requested Resource Error
Execute JavaScript Using Selenium Webdriver in C#
What Is the Fastest or Most Elegant Way to Compute a Set Difference Using JavaScript Arrays
Cross-Platform, Cross-Browser Way to Play Sound from JavaScript
Spread Syntax VS Rest Parameter in Es2015/Es6
How to Declare Hash.New(0) with 0 Default Value for Counting Objects in JavaScript