How to use Python requests to fake a browser visit a.k.a and generate User Agent?
Provide a User-Agent
header:
import requests
url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
FYI, here is a list of User-Agent strings for different browsers:
- List of all Browsers
As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:
fake-useragent
Up to date simple useragent faker with real world database
Demo:
>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
Python requests get stuck when trying to get web content
It looks like that site (the data on its charts) is loaded dynamically using Javascript, so requests
won't return a useable result. You can use Selenium to simulate an actual browser instance which will run the Javascript needed for grabbing data off the page.
You'll need:
- Selenium installed using
pip install selenium
- A browser driver binary in PATH or in the directory of your Python script. I suggest Mozilla's Geckodriver found here: https://github.com/mozilla/geckodriver/releases
Usage example:
from selenium import webdriver
from selenium.webdriver.common.by import By
options = webdriver.FirefoxOptions()
# options.headless = True # This is normally the first google search after people find Selenium.
driver = webdriver.Firefox(options=options)
# Grabbing a URL using the browser instance.
driver.get("URL")
# Finding an element by ID
example_element = driver.find_element(By.ID, "Element ID")
print(example_element.text)
# Closing the browser instance
driver.quit()
It'll take some messing around to figure out how to utilize all of Selenium's capabilities in your code, but there's a lot of documentation (https://selenium-python.readthedocs.io) out there for figuring it all out.
How to extract all links from a website using python
The site is blocked for Python Bots:
<h1>Access denied</h1>
<p>This website is using a security service to protect itself from online attacks.</p>
You can try adding an user agent to your code, like below:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
Web=requests.get("https://www.jarir.com/", headers=headers)
soup=BeautifulSoup(Web.text)
for link in soup.findAll('a'):
print(link['href'])
The output is something like:
https://www.jarir.com/wishlist/
https://www.jarir.com/sales/order/history/
https://www.jarir.com/afs/
https://www.jarir.com/contacts/
tel:+966920000089
/cdn-cgi/l/email-protection#6300021106230902110a114d000c0e
https://www.jarir.com/faq/
https://www.jarir.com/warranty_policy/
https://www.jarir.com/return_exchange/
https://www.jarir.com/contacts/
https://www.jarir.com/terms-of-service/
https://www.jarir.com/privacy-policy/
https://www.jarir.com/storelocator/
Related Topics
Iterating on a File Doesn't Work the Second Time
Python C Program Subprocess Hangs at "For Line in Iter"
Beautifulsoup Grab Visible Webpage Text
Python List Sort in Descending Order
Using Property() on Classmethods
CSV in Python Adding an Extra Carriage Return, on Windows
Checking If a String Can Be Converted to Float in Python
How to Convert Number Words to Integers
Accessing Pandas Column Using Squared Brackets VS Using a Dot (Like an Attribute)
Why Does Id({}) == Id({}) and Id([]) == Id([]) in Cpython
Extracting Extension from Filename in Python
Scope of Python Variable in for Loop
How to "Perfectly" Override a Dict