How to Use Python Requests to Fake a Browser Visit A.K.A and Generate User Agent

How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:

  • List of all Browsers

As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

Python requests get stuck when trying to get web content

It looks like that site (the data on its charts) is loaded dynamically using Javascript, so requests won't return a useable result. You can use Selenium to simulate an actual browser instance which will run the Javascript needed for grabbing data off the page.

You'll need:

  • Selenium installed using pip install selenium
  • A browser driver binary in PATH or in the directory of your Python script. I suggest Mozilla's Geckodriver found here: https://github.com/mozilla/geckodriver/releases

Usage example:

from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.FirefoxOptions()
# options.headless = True # This is normally the first google search after people find Selenium.
driver = webdriver.Firefox(options=options)

# Grabbing a URL using the browser instance.
driver.get("URL")

# Finding an element by ID
example_element = driver.find_element(By.ID, "Element ID")
print(example_element.text)

# Closing the browser instance
driver.quit()

It'll take some messing around to figure out how to utilize all of Selenium's capabilities in your code, but there's a lot of documentation (https://selenium-python.readthedocs.io) out there for figuring it all out.

How to extract all links from a website using python

The site is blocked for Python Bots:

<h1>Access denied</h1>
<p>This website is using a security service to protect itself from online attacks.</p>

You can try adding an user agent to your code, like below:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
Web=requests.get("https://www.jarir.com/", headers=headers)
soup=BeautifulSoup(Web.text)
for link in soup.findAll('a'):
print(link['href'])

The output is something like:

https://www.jarir.com/wishlist/
https://www.jarir.com/sales/order/history/
https://www.jarir.com/afs/
https://www.jarir.com/contacts/
tel:+966920000089
/cdn-cgi/l/email-protection#6300021106230902110a114d000c0e
https://www.jarir.com/faq/
https://www.jarir.com/warranty_policy/
https://www.jarir.com/return_exchange/
https://www.jarir.com/contacts/
https://www.jarir.com/terms-of-service/
https://www.jarir.com/privacy-policy/
https://www.jarir.com/storelocator/


Related Topics



Leave a reply



Submit