Web scraping with Python
Use urllib2 in combination with the brilliant BeautifulSoup library:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
# will print date and sunrise
Web scraping with Python without loading the whole page
Reverse engineering the api calls.
You should analyze the network tab for the incoming and outgoing requests and view response for each request. Alternatively you can also copy the request as curl and use postman to analyze the request. Postman has feature unique which generates python code for requests library and urllib library. Most of the sites return json
response but sometimes however you may get html
response.
Some sites do not allow scraping.
Make sure to check robot.txt for the website you will be scraping. You can find robot.txt by www.sitename.com/robots.txt
.
For more info - https://www.youtube.com/watch?v=LPU08ZfP-II&list=PLL2hlSFBmWwwvFk4bBqaPRV4GP19CgZug
Scrape websites with python
Instead of using scrapy
you can use urllib
.
Instead of beautifulsoup
you can use regex
.
But scrapy
and beautifulsoup
do your life easier.
Scrapy
, not easy library so you can use requests
or urllib
.
Web Scraping html using python
As already explained, the data is loaded by an API. You can use the same to extract the details using requests
.
Have only tried for page 1
.
import requests
response = requests.get("https://www.kucoin.com/_api/cms/articles?page=1&pageSize=10&category=listing&lang=en_US")
jsoncode = response.json()
options = jsoncode['items']
for i in range(len(options)):
title = options[i]['title']
date = options[i]['summary']
print(f"{title} : {date}")
Cryowar (CWAR) Gets Listed on KuCoin! World Premiere! : Trading: 14:00 on November 12, 2021 (UTC)
Deeper Network (DPR) Gets Listed on KuCoin! : Trading: 06:00 on November 12, 2021 (UTC)
Vectorspace AI (VXV) Gets Listed on KuCoin! : Trading: 8:00 on November 12, 2021 (UTC)
...
web scraping with python requests post request
This page sends cookies with PHPSESSIONID
and in HTML
it sends token
like this
<script>token = "NDQ4MTg3MjMw"
and it uses JavaScript
to get this value and add in headers
num: NDQ4MTg3MjMw,
And server needs PHPSESSIONID
and num
to send data.
Every connection creates new value in PHPSESSIONID
and token
- so you could hardcode some values in your code, but session ID can be valid only for a few minutes - and it is better to get fresh values from GET
request before POST
request.
So you have to use requests.Session
to work with cookies
and first send GET
to https://vahaninfos.com/vehicle-details-by-number-plate
to get cookie PHPSESSIONID
and HTML
with <script>token = "..."
Next you have to get this token
from HTML
- ie. using regex
- and add it as header num: ....
in POST
request.
It seems other headers are not important - even X-Requested-With
.
This page needs to send data as form
so you need data=payload
instead of data=json.load(payload)
. And it creates automatically headers Content-Type
and Content-Length
with correct values.
import requests
import re
session = requests.Session()
# --- GET ---
headers = {
# "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
}
url = "https://vahaninfos.com/vehicle-details-by-number-plate"
res = session.get(url, verify=False)
number = re.findall('token = "([^"]*)"', res.text)[0]
# --- POST ---
headers = {
# "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
# "X-Requested-With": "XMLHttpRequest",
'num': number,
}
payload = {
"number": "UP32AT5472",
"g-recaptcha-response": "",
}
url = "https://vahaninfos.com/getdetails.php"
res = session.post(url, data=payload, headers=headers, verify=False)
print(res.text)
Result:
<tr><td>Registration Number</td><td>:</td><td>UP32AT5472</td></tr>
<tr><td>Registration Authority</td><td>:</td><td>LUCKNOW</td></tr>
<tr><td>Registration Date</td><td>:</td><td>2003-06-06</td></tr>
<tr><td>Chassis Number</td><td>:</td><td>487530</td></tr>
<tr><td>Engine Number</td><td>:</td><td>490062</td></tr>
<tr><td>Fuel Type</td><td>:</td><td>PETROL</td></tr>
<tr><td>Engine Capacity</td><td>:</td><td></td></tr>
<tr><td>Model/Model Name</td><td>:</td><td>TVS VICTOR</td></tr>
<tr><td>Color</td><td>:</td><td></td></tr>
<tr><td>Owner Name</td><td>:</td><td>HARI MOHAN PANDEY</td></tr>
<tr><td>Ownership Type</td><td>:</td><td></td></tr>
<tr><td>Financer</td><td>:</td><td>CENTRAL BANK OF INDIA</td></tr>
<tr><td>Vehicle Class</td><td>:</td><td>M-CYCLE/SCOOTER(2WN)</td></tr>
<tr><td>Fitness/Regn Upto</td><td>:</td><td></td></tr>
<tr><td>Insurance Company</td><td>:</td><td>NATIONAL INSURANCE CO LTD.</td></tr>
<tr><td>Insurance Policy No</td><td>:</td><td>4165465465465</td></tr>
<tr><td>Insurance expiry</td><td>:</td><td>2004-06-05</td></tr>
<tr><td>Vehicle Age</td><td>:</td><td></td></tr>
<tr><td>Vehicle Type</td><td>:</td><td></td></tr>
<tr><td>Vehicle Category</td><td>:</td><td></td></tr>
Now you can use beautifulsoup
or lxml
(or other module) to get values from HTML
.
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')
for row in soup.find_all('tr'):
cols = row.find_all('td')
key = cols[0].text
val = cols[-1].text
print(f'{key:22} | {val}')
Result:
Registration Number | UP32AT5472
Registration Authority | LUCKNOW
Registration Date | 2003-06-06
Chassis Number | 487530
Engine Number | 490062
Fuel Type | PETROL
Engine Capacity |
Model/Model Name | TVS VICTOR
Color |
Owner Name | HARI MOHAN PANDEY
Ownership Type |
Financer | CENTRAL BANK OF INDIA
Vehicle Class | M-CYCLE/SCOOTER(2WN)
Fitness/Regn Upto |
Insurance Company | NATIONAL INSURANCE CO LTD.
Insurance Policy No | 4165465465465
Insurance expiry | 2004-06-05
Vehicle Age |
Vehicle Type |
Vehicle Category |
EDIT:
After running code few times POST
started sending me only values R
- maybe it needs some other headers to hide bot (ie. User-Agent
), or maybe sometimes it needs to send correct code for ReCaptcha
.
At least in Chrome it stops sending R
when I set ReCaptha
.
But Firefox
still send R
.
Originally I was using User-Agent
from my Firefox
and it may remeber it.
EDIT:
If I use User-Agent
different then my Firefox
then code again gets correct values and Firefox
still gets only R
.
headers = {
"User-Agent": "Mozilla/5.0",
}
So it seems code may need to use random User-Agent
in every request to hide bot.
Web scraping with python in javascript dynamic website
The website does 3 API calls in order to get the data.
The code below does the same and get the data.
(In the browser do F12 -> Network -> XHR and see the API calls)
import requests
payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
print(r1.json())
print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
print(r2.json())
print('------------------')
payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
print(r3.json())
Related Topics
How to Filter Query Objects by Date Range in Django
Install a Module Using Pip for Specific Python Version
What Are the Arguments to Tkinter Variable Trace Method Callbacks
How to Add a Constant Column in a Spark Dataframe
How to Pass Variables Across Functions
What Does the 'U' Symbol Mean in Front of String Values
Fast Haversine Approximation (Python/Pandas)
How to Bind Self Events in Tkinter Text Widget After It Will Binded by Text Widget
How to Get Variable Data from a Class
Nested Arguments Not Compiling
How to Lowercase a String in Python
Directory-Tree Listing in Python
Double Iteration in List Comprehension
Error "Filename.Whl Is Not a Supported Wheel on This Platform"