Scraping Ajax Pages Using Python

Scraping ajax pages using python

First of all, scrapy docs are available at https://scrapy.readthedocs.org/en/latest/.

Speaking about handling ajax while web scraping. Basically, the idea is rather simple:

  • open browser developer tools, network tab
  • go to the target site
  • click submit button and see what XHR request is going to the server
  • simulate this XHR request in your spider

Also see:

  • Can scrapy be used to scrape dynamic content from websites that are using AJAX?
  • Pagination using scrapy

Hope that helps.

Scrape ajax table from website using post request

You can't send that form data as a dictionary/json. Send it as a string and it should work:

import pandas as pd
import requests

s = requests.Session()
s.get('https://apps.usp.org/app/USPNF/columnsDB.html')
cookies = s.cookies.get_dict()

cookieStr = ''
for k,v in cookies.items():
cookieStr += f'{k}={v};'

url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": cookieStr,
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.141 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

final_df = pd.DataFrame()
nextPage = True

page = 0
while nextPage == True:
i = page*10
payload = f'cpaint_function=updatePQRIResults&cpaint_argument[]=Acclaim%20120%20C18&cpaint_argument[]=1&cpaint_argument[]=0&cpaint_argument[]=0&cpaint_argument[]=2.8&cpaint_argument[]={i}&cpaint_response_type=OBJECT'

response = s.post(url, data=payload, headers=headers).text

df = pd.read_xml(response).iloc[3:-1,3:]

if (df.iloc[0]['psr'] == 0) and (len(df) == 1):
nextPage = False
final_df = final_df.drop_duplicates().reset_index(drop=True)

print('Complete')

else:
final_df = pd.concat([final_df, df], axis=0)

print(f'Page: {page + 1}')
page+=1

Output:

print(final_df)
psr psf psn ... psvb psvc28 psvc70
0 0.0 0.00 Acclaim 120 C18 ... -0.027 0.086 -0.002
1 1.0 0.24 TSKgel ODS-100Z ... -0.031 -0.064 -0.161
2 2.0 0.67 Inertsil ODS-3 ... -0.023 -0.474 -0.334
3 3.0 0.74 LaChrom C18 ... -0.006 -0.278 -0.120
4 4.0 0.80 Prodigy ODS(3) ... -0.012 -0.195 -0.134
.. ... ... ... ... ... ... ...
753 753.0 29.55 Cosmosil 5PYE ... 0.092 0.521 1.318
754 754.0 30.44 BioBasic Phenyl ... 0.217 0.014 0.390
755 755.0 34.56 Microsorb-MV 100 CN ... -0.029 0.148 0.785
756 756.0 41.62 Inertsil ODS-EP ... 0.050 -0.620 -0.070
757 757.0 41.84 Flare C18+ ... 0.966 -0.507 1.178

[758 rows x 12 columns]

Scraping AJAX e-commerce site using python

Welcome to StackOverflow! You can inspect where the ajax request is being sent to and replicate that.

In this case the request goes to this api url. You can then use requests to perform a similar request. Notice however that this api endpoint requires a correct UserAgent header. You can use a package like fake-useragent or just hardcode a string for the agent.

import requests

# fake useragent
from fake_useragent import UserAgent
user_agent = UserAgent().chrome

# or hardcode
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36'

url = 'https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword=h370m&limit=50&newest=0&order=desc&page_type=search'
resp = requests.get(url, headers={
'User-Agent': user_agent
})
data = resp.json()
products = data.get('items')

Scrape ajax pages

Okay, try the following script to get all the fields you wish to grab from there traversing all the exhibitor list:

import scrapy
from scrapy.selector import Selector

class MapYourShowSpider(scrapy.Spider):
name = "mapyourshow"

content_url = 'https://aaos22.mapyourshow.com/8_0/ajax/remote-proxy.cfm'
inner_base = 'https://aaos22.mapyourshow.com/8_0/exhibitor/exhibitor-details.cfm?exhid={}'

headers = {
'x-requested-with': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
'action': 'search',
'searchtype': 'exhibitorgallery',
'searchsize': '557',
'start': '0',
}

def start_requests(self):
yield scrapy.FormRequest(
url=self.content_url,
method='GET',
headers=self.headers,
formdata=self.params,
callback=self.parse,
)

def parse(self,response):
for item in response.json()['DATA']['results']['exhibitor']['hit']:
inner_link = self.inner_base.format(item['fields']['exhid_l'])
yield scrapy.Request(
url=inner_link,
headers=self.headers,
callback=self.parse_content,
)

def parse_content(self,response):
elem = response.json()['DATA']['BODYHTML']
sel = Selector(text=elem)
title = sel.css("h2::text").get()
try:
address = ' '.join([' '.join(i.split()) for i in sel.css("p.showcase-address::text").getall()])
except AttributeError: address = ""
website = sel.css("a[title*='website']::text").get()
phone = sel.xpath("normalize-space(//*[starts-with(@class,'showcase-web-phone')]/li[./*[.='Phone:']]/span/following::text())").get()
yield {"title":title,"address":address,"website":website,"phone":phone}

Scraping an AJAX web page using python and requests

This works for me. I had to dig around in the dev tools but found it

import requests

geturl=r'https://www.barchart.com/futures/quotes/CLJ19/all-futures'
apiurl=r'https://www.barchart.com/proxies/core-api/v1/quotes/get'

getheaders={

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}

getpay={
'page': 'all'
}

s=requests.Session()
r=s.get(geturl,params=getpay, headers=getheaders)

headers={
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://www.barchart.com/futures/quotes/CLJ19/all-futures?page=all',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']

}

payload={
'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
'list': 'futures.contractInRoot',
'root': 'CL',
'meta': 'field.shortName,field.type,field.description',
'hasOptions': 'true',
'raw': '1'

}

r=s.get(apiurl,params=payload,headers=headers)
j=r.json()
print(j)

>{'count': 108, 'total': 108, 'data': [{'symbol': 'CLY00', 'contractSymbol': 'CLY00 (Cash)', ........

AJAX web scraping using python Requests

While it lacked initially the IFC-Cache-Header http header in the first place, there is also a JWT token that is passed via Authorization header.

To retrieve this token, you first need to extract values from the root page :

GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre

which features the following javacript object:

window.HSBC.dpas = {
"pageInformation": {
"country": "X", <========= HERE
"language": "X", <========= HERE
"tokenIssue": {
"url": "/api/v1/token/issue",
},
"dataUrl": {
"url": "/api/v1/nav/funds",
"id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
},
....
}
}

You can extract the window.HSBC.dpas javascript object value using regex and then reformat the string so that it becomes valid JSON

These values are then passed in http headers such as X-COUNTRY, X-COMPONENT and X-LANGUAGE to the following call:

GET https://www.assetmanagement.hsbc.de/api/v1/token/issue

It returns the JWT token directly and add the Authorization header to the request as Authorization: Bearer {token}:

GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds

Example:

import requests
import re
import json

api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"

# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
"f":"Yes",
"n": "-1",
"v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)

# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)

result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])

# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
"X-Country": result["pageInformation"]["country"],
"X-Component": result["pageInformation"]["dataUrl"]["id"],
"X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)

# call /nav/funds API
payload={
"appliedFilters":[[{"active":True,"id":"Yes"}]],
"paging":{"fundsPerPage":-1,"currentPage":1},
"view":"Documents",
"searchTerm":[],
"selectedValues":[],
"pageInformation": result["pageInformation"]
}
headers={
"IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
"Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)

Try this on repl.it

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.


Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
{
xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
}
catch (e)
{
try
{
xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
}
catch (e)
{
alert("Your browser does not support AJAX!");
return false;
}
}
}
xmlHttp.onreadystatechange=function()
{
if(xmlHttp.readyState==4)
{
document.myForm.time.value=xmlHttp.responseText;
}
}
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Advanced scraping with C++:

For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.

Advanced scraping with Java:

For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino

Advanced scraping with .NET:

For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

scraping AJAX content on webpage with requests python

You just need to compare two post data, then you will find they have almost same except the a few parameter(draw=page...start=xx). That means you can scrape Ajax data by modifying draw and start.

Edit: Data was transformed to dictionary so we do not need urlencode, also we don't need cookie(i tested).

import requests
import json

headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Origin": "https://cafe.bithumb.com",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Referer": "https://cafe.bithumb.com/view/boards/43",
"Accept-Encoding": "gzip, deflate, br"
}

string = """columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=2&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=false&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=3&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=false&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=4&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=false&columns[4][search][value]=&columns[4][search][regex]=false&start=30&length=30&search[value]=&search[regex]=false"""

article_root = "https://cafe.bithumb.com/view/board-contents/{}"

for page in range(1,4):
with requests.Session() as s:
s.headers.update(headers)

data = {"draw":page}
data.update( { ele[:ele.find("=")]:ele[ele.find("=")+1:] for ele in string.split("&") } )
data["start"] = 30 * (page - 1)

r = s.post('https://cafe.bithumb.com/boards/43/contents', data = data, verify = False) # set verify = False while you are using fiddler

json_data = json.loads(r.text).get("data") # transform string to dict then we can extract data easier
for each in json_data:
url = article_root.format(each[0])
print(url)


Related Topics



Leave a reply



Submit