Using python Requests with javascript pages
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
Python Scraping JavaScript page without the need of an installed browser
Aside from automating a browser your other 2 options are as follows:
try find the backend query that loads the data via javascript. It's not a guarantee that it will exist but open your browser's Developer Tools - Network tab - fetch/Xhr and then refresh the page, hopefully you'll see requests to a backend api that loads the data you want. If you do find a request click on it and explore the endpoint, headers and possibly the payload that is sent to get the response you are looking for, these can all be recreated in python using requests to that hidden endpoint.
the other possiblility is that the data hidden in the HTML within a script tag possibly in a json file... Open the Elements tab of your developer tools where you can see the HTML of the page, right click on the tag and click "expand recursively" this will open every tag (it might take a second) and you'll be able to scroll down and search for the data you want. Ignore the regular HTML tags, we know it is loaded by javascript so look through any "script" tag. If you do find it then you can hopefully find it in your script with a combination of Beautiful Soup to get the script tag and string slicing to just get out the json.
If neither of those produce results then try requests_html package, and specifically the "render" method. It automatically installs a headless browser when you first run the render method in your script.
What site is it, perhaps I can offer more help if I can see it?
Python requests.get(url) returning javascript code instead of the page html
Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present “richer” content – something more dynamic and styled. And using the bot won't help to see these websites.
To solve this problem, you need to follow these steps:
- Download chrome-driver from here. Choose the one that matches your OS.
- Extract the driver and put it in a certain directory. For example,
\usr
- Install
Selenium
which is a python module by runningpip install selenium
.
Note that, selenium depends on another package calledmsgpack
. So, you should install it first using this commandpip install msgpack
. - Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>
Now, you have the whole page. I hope this answers your question!!
Python Requests run JS file from GET
Alright I figured this one out, despite it fighting me the whole way. Idk why dtPC
wasn't showing up in the s.cookies
like it should, but I wasn't using the script
keyword quite right. Apparently, whatever JS you pass it will be executed after everything else has rendered, like you opened the console on your browser and pasted it in there. When i actually tried that in Chrome, I got some errors. Eventually i realized i could just run a simple JS script to return the cookies generated by the other JS.
s=HTMLSession()
r=s.get(url,headers=headers)
print(r.status_code)
c=r.html.render(script='document.cookie')
c=urllib.parse.unquote(c)
c=[x.split('=') for x in c.split(';')]
c={x[0]:x[1] for x in c}
print(c)
at this point, c
will be a dict with 'dtPC'
as a key and the corresponding value.
Related Topics
Checking Whether a String Starts with Xxxx
Django. Override Save for Model
Reading Rather Large JSON Files
Differencebetween a Function, an Unbound Method and a Bound Method
Resetting Generator Object in Python
How to Use a Python Script in the Command Line Without Cd-Ing to Its Directory? Is It the Pythonpath
How to Check If All Elements of a List Match a Condition
How to Use Xpath with Beautifulsoup
How to Efficiently Parse Fixed Width Files
How to Send a Head Http Request in Python 2
Change the Name of a Key in Dictionary
Numpy Array Is Not JSON Serializable
Set Matplotlib Colorbar Size to Match Graph
Finding Local Maxima/Minima with Numpy in a 1D Numpy Array
How to Do a Recursive Sub-Folder Search and Return Files in a List