How to save all the network traffic (both request and response headers) from a website using python
You can use the selenium-wire
library if you want to use Selenium
to work with this. However, if you're only concerned for a specific API, then rather than using Selenium, you can use the requests
library for hitting the API and then print the results of the request
and response
headers.
Considering you're looking for the earlier, using the Selenium way, one way to achieve this is using selenium-wire
library. However, it will give the result for all the background API's/requests being hit - which you can then easily filter after either piping the result to a text file or in terminal itself
Install using pip install selenium-wire
Install webdriver-manager
using pip install webdriver-manager
Install Selenium 4 using pip install selenium==4.0.0.b4
Use this code
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
svc= Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=svc)
driver.maximize_window()
# To use firefox browser
driver.get("https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en")
for request in driver.requests:
if request.response:
print(
request.url,
request.response.status_code,
request.headers,
request.response.headers
)
which gives a detailed output of all the requests - copying the relavent one -
https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en 200
Host: epco.taleo.net
Connection: keep-alive
sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
Date: Tue, 28 Sep 2021 11:14:14 GMT
Server: Taleo Web Server 8
Cache-Control: private
P3P: CP="CAO PSA OUR"
Content-Encoding: gzip
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Set-Cookie: locale=en; path=/careersection/; secure; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
X-XSS-Protection: 1
X-UA-Compatible: IE=edge
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8
Scrape JSON response with Selenium Browser [closed]
The website you are trying to scrape has a dynamically generated content by JavaScript .
You have two options to work your way around that
Simulate a human browser interaction using selenium and open the website then wait till all the content is rendered and then use selenium to Extract the data you seek . this approach deals with the Elements tab. you just use css or xpath selectors to get the tags you want
instead of finding a way to make selenium go to network tab and save the content ( which you will find extremely hard to do ) you should get the URL of the XHR request and build the same request with the same headers and parameters if any exists and then use
requests
to send that request and you can save the content easily .
Let's try to scrape Home | Microsoft Academic
First approach :
from selenium import webdriver
driver = webdriver.Chrome() # Launch the browser
driver.get("https://academic.microsoft.com/home") # Go to the given url
authors = driver.find_elements_by_xpath('//a[@data-appinsights-action="TopAuthorSelected"]') # get the elements using selectors
for author in authors: # loop through them
print(author.text)
Output :
1. Yoshua Bengio
2. Geoffrey E. Hinton
3. Andrew Zisserman
4. Ilya Sutskever
5. Jian Sun
6. Trevor Darrell
7. Scott Shenker
8. Jiawei Han
9. Kaiming He
10. Ross Girshick
11. Ion Stoica
12. Hari Balakrishnan
13. R Core Team
14. Jitendra Malik
15. Jeffrey Dean
Second approach :
import requests
res = requests.get('https://academic.microsoft.com/api/analytics/authors/topauthors?topicId=41008148&take=15&filter=1&dateRange=1').json()
#The XHR Response is Usually in Json format
#res = [{'name': 'Yoshua Bengio', 'id': '161269817', 'lat': 0.0, 'lon': 0.0}, {'name': 'Geoffrey E. Hinton', 'id': '563069026', 'lat': 0.0, 'lon': 0.0}, {'name': 'Andrew Zisserman', 'id': '2469405535', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ilya Sutskever', 'id': '215131072', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jian Sun', 'id': '2200192130', 'lat': 0.0, 'lon': 0.0}, {'name': 'Trevor Darrell', 'id': '2174985400', 'lat': 0.0, 'lon': 0.0}, {'name': 'Scott Shenker', 'id': '719828399', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jiawei Han', 'id': '2121939561', 'lat': 0.0, 'lon': 0.0}, {'name': 'Kaiming He', 'id': '2164292938', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ross Girshick', 'id': '2473549963', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ion Stoica', 'id': '2161479384', 'lat': 0.0, 'lon': 0.0}, {'name': 'Hari Balakrishnan', 'id': '1998464616', 'lat': 0.0, 'lon': 0.0}, {'name': 'R Core Team', 'id': '2976715238', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jitendra Malik', 'id': '2136556746', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jeffrey Dean', 'id': '2429370538', 'lat': 0.0, 'lon': 0.0}]
for author in res:
print(author['name'])
Output:
Yoshua Bengio
Geoffrey E. Hinton
Andrew Zisserman
Ilya Sutskever
Jian Sun
Trevor Darrell
Scott Shenker
Jiawei Han
Kaiming He
Ross Girshick
Ion Stoica
Hari Balakrishnan
R Core Team
Jitendra Malik
Jeffrey Dean
Second approach saves time , resources and straight forward .
Using First approach Image
Using Second approach Image
Capture AJAX response with selenium and python
I was unable to capture AJAX response with selenium but here is what works, although without selenium:
1- Find out the XML request by monitoring the network analyzing tools in your browser
2= Once you've identified the request, regenerate it using Python's requests or urllib2 modules. I personally recommend requests because of its additional features, most important to me was requests.Session.
You can find plenty of help and relevant posts regarding these two steps.
Hope it will help someone someday.
Related Topics
Python: Editing List While Iterating Over It
How to Normalize a Numpy Array to Within a Certain Range
How to Plot Pandas Dataframe With Date (Year/Month)
Converting Pandas Column of Comma-Separated Strings into Integers
How to Specify File Path in Jupyter Notebook
Fast Way to Split Column into Multiple Rows in Pandas
How to Count the Total Number of Words in a Pandas Dataframe Cell and Add Those to a New Column
How to Delete the Words Between Two Delimiters
Python Pandas Dataframe Get All Combinations of Column Values
How to Convert Datetime by Removing Nanoseconds
Django: Check Whether an Object Already Exists Before Adding
How to Get All Users in a Telegram Channel Using Telethon
Passing a List of Values from Python to the in Clause of an SQL Query
How to Make Python Code to Execute Only Once
How to Repeatedly Execute a Function Every X Seconds
How to Display Last 2 Digits from a Number in Python
How to Store Python Dictionary in to MySQL Db Through Python