Using Python Requests with JavaScript Pages

Using python Requests with javascript pages

You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.

Python Scraping JavaScript page without the need of an installed browser

Aside from automating a browser your other 2 options are as follows:

  1. try find the backend query that loads the data via javascript. It's not a guarantee that it will exist but open your browser's Developer Tools - Network tab - fetch/Xhr and then refresh the page, hopefully you'll see requests to a backend api that loads the data you want. If you do find a request click on it and explore the endpoint, headers and possibly the payload that is sent to get the response you are looking for, these can all be recreated in python using requests to that hidden endpoint.

  2. the other possiblility is that the data hidden in the HTML within a script tag possibly in a json file... Open the Elements tab of your developer tools where you can see the HTML of the page, right click on the tag and click "expand recursively" this will open every tag (it might take a second) and you'll be able to scroll down and search for the data you want. Ignore the regular HTML tags, we know it is loaded by javascript so look through any "script" tag. If you do find it then you can hopefully find it in your script with a combination of Beautiful Soup to get the script tag and string slicing to just get out the json.

If neither of those produce results then try requests_html package, and specifically the "render" method. It automatically installs a headless browser when you first run the render method in your script.

What site is it, perhaps I can offer more help if I can see it?

Python requests.get(url) returning javascript code instead of the page html

Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present “richer” content – something more dynamic and styled. And using the bot won't help to see these websites.

To solve this problem, you need to follow these steps:

  1. Download chrome-driver from here. Choose the one that matches your OS.
  2. Extract the driver and put it in a certain directory. For example, \usr
  3. Install Selenium which is a python module by running pip install selenium.
    Note that, selenium depends on another package called msgpack. So, you should install it first using this command pip install msgpack.
  4. Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser

url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>

Now, you have the whole page. I hope this answers your question!!

Python Requests run JS file from GET

Alright I figured this one out, despite it fighting me the whole way. Idk why dtPC wasn't showing up in the s.cookies like it should, but I wasn't using the script keyword quite right. Apparently, whatever JS you pass it will be executed after everything else has rendered, like you opened the console on your browser and pasted it in there. When i actually tried that in Chrome, I got some errors. Eventually i realized i could just run a simple JS script to return the cookies generated by the other JS.

s=HTMLSession()
r=s.get(url,headers=headers)
print(r.status_code)

c=r.html.render(script='document.cookie')

c=urllib.parse.unquote(c)
c=[x.split('=') for x in c.split(';')]
c={x[0]:x[1] for x in c}
print(c)

at this point, c will be a dict with 'dtPC' as a key and the corresponding value.



Related Topics



Leave a reply



Submit