How to Scrape Pages Which Have Lazy Loading

How to scrape pages which have lazy loading

You have 2 options here:

  1. Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.

  2. Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.

Have a nice day!

UPDATE:

Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:

{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}

What you need is unwrap JSON and then parse as HTML:

require 'json' 

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like

I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.

UPDATE 2: Minimal working script for me:

require 'json'
require 'open-uri'
require 'nokogiri'

url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

Scrape data from lazy Loading Page

Scrolling down means the data is being generated by JavaScript so you have more than one option here
first one is to use selenium
second one is to send the same Ajax request the website is using as follows :

def get_source(page_num = 1):
url = 'https://www.ajio.com/api/category/830216001?fields=SITE¤tPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'

res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
if res.status_code == 200 :
return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
print(f'Getting page {i}')
data = get_source(page_num = i)['products']
for item in data:
prodpage.append('https://www.ajio.com{}'.format(item['url']))
if i == 3: break
print(len(prodpage)) # output 135 for 3 pages

Web scraping images that get lazy loaded if not in the viewport

Using BeautifulSoup or Selenium for this is way more than what's required, and, as I'm sure you've discovered on your own already, pretty cumbersome to try and use for this particular use-case.

The easier and cleaner thing to do is this: If you open your browser's network traffic logger, and view only the XHR (XmlHttpRequest) requests, you'll see that everytime you scroll down and new products start getting loaded, your browser makes an HTTP POST request to this API: https://mnrwefss2q-dsn.algolia.net/1/indexes/*/queries

If you simply imitate that POST request to that API, using the same query string and POST form data, you can get all the product information you could ever want, including URLs to the product images - and it's all in JSON.

For whatever reason, the API doesn't care about request headers, but that's fine. It's just the query string and the POST form data that it cares about. You can also change the hitsPerPage key-value pair to change the number of products requested. By default it seems to load 40 new products each time, but you can change that number to whatever you want:

def main():

import requests
from urllib.parse import urlencode

url = "https://mnrwefss2q-dsn.algolia.net/1/indexes/*/queries"

params = {
"x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser; react (16.13.1); react-instantsearch (6.6.0); JS Helper (3.1.2)",
"x-algolia-application-id": "MNRWEFSS2Q",
"x-algolia-api-key": "a3a4de2e05d9e9b463911705fb6323ad"
}

post_json = {
"requests":[
{
"indexName": "Listing_production",
"params": urlencode({
"highlightPreTag": "<ais-highlight-0000000000>",
"highlightPostTag": "</ais-highlight-0000000000>",
"maxValuesPerFacet": "100",
"hitsPerPage": "40",
"filters": "",
"page": "4",
"query": "",
"facets": "[\"designers.name\",\"category_path\",\"category_size\",\"price_i\",\"condition\",\"location\",\"badges\",\"strata\"]",
"tagFilters": "",
"facetFilters": "[[\"category_path:footwear.hitop_sneakers\"],[\"designers.name:Jordan Brand\"]]",
"numericFilters": "[\"price_i>=0\",\"price_i<=99999\"]"
})
}
]
}

response = requests.post(url, params=params, json=post_json)
response.raise_for_status()

results = response.json()["results"]
items = results[0]["hits"]

for item in items:
print(f"{item['title']} - price: ${item['price']}")
print(f"Image URL: \"{item['cover_photo']['url']}\"\n")

return 0

if __name__ == "__main__":
import sys
sys.exit(main())

Output:

Air Jordan 13 Retro Grey Toe 2014 - price: $150
Image URL: "https://cdn.fs.grailed.com/api/file/HZfvq06fSYOZWvB6OTxA"

Air Jordan 5 Retro Grape 2013 - price: $300
Image URL: "https://cdn.fs.grailed.com/api/file/nwQmfUzITOSVa2Qg5gCt"

Air Jordan 11 BG Legend Blue - price: $243
Image URL: "https://cdn.fs.grailed.com/api/file/AKYdESePSdK1XqDEqYMr"

Air Jordan 12 Retro Cool Grey 2012 - price: $200
Image URL: "https://cdn.fs.grailed.com/api/file/oAl5cxdCSPyCQaXUKPRm"

Air Jordan 11 Retro GS Space Jam 2009 - price: $162
Image URL: "https://cdn.fs.grailed.com/api/file/oRAMFOMTeu9fkWp5l640"

Jordan 1 Flight Mens High Tops Shoes - Size 10.5 White - price: $50
Image URL: "https://cdn.fs.grailed.com/api/file/hRZFggEyRImxzi1T623p"

Air Jordan 1 Retro High OG Royal Toe Sz 8.5 - price: $400
Image URL: "https://cdn.fs.grailed.com/api/file/MWwBxHyNRDCuDZ4Pc2LC"

Air Jordan 14 GS Black Toe 2014 Size 4Y (5.5 Womans) - price: $58
Image URL: "https://cdn.fs.grailed.com/api/file/FzV1GrMGSRqPjV0dPFcB"

Air Jordan 5 Retro Fire Red 2020 - price: $250
Image URL: "https://cdn.fs.grailed.com/api/file/vGda4X6qReBq42muTatG"

Air Jordan 6 Retro All Star Chameleon - price: $70
Image URL: "https://cdn.fs.grailed.com/api/file/wk1ySwZDQqWHtCqvi9I0"

Air Jordan 5 Retro GS Oreo - price: $64
Image URL: "https://cdn.fs.grailed.com/api/file/rWrcrhdiSBG4hvRZ53aS"

Air Jordan 11 Retro GS Bred 2012 - price: $76
Image URL: "https://cdn.fs.grailed.com/api/file/ceUppZDSo6YgvPM6OoMg"

Air Jordan 1 Mid (GS) 6Y (5.5 Uk) - price: $115
Image URL: "https://cdn.fs.grailed.com/api/file/cRFIYqE8TKOSAiKaV2uA"

Air Jordan 7 Retro GS Pure Money 3.5Y - price: $87
Image URL: "https://cdn.fs.grailed.com/api/file/uXZIVZMQQain0pUSyJIe"

Air Jordan 5 Retro Olympic 2011 - price: $120
Image URL: "https://cdn.fs.grailed.com/api/file/D7E4GvaJSiywv3vaz3A5"

J12 Grey/University Blue - price: $117
Image URL: "https://cdn.fs.grailed.com/api/file/0T9GNBSUTDSGQsqrg4IS"

1994 Jordan 1 Bred - price: $801
Image URL: "https://cdn.fs.grailed.com/api/file/rfQ68e8PRDOpKN6zwwyw"

Nike Air Jordan Retro 13 He Got Game HGG - price: $90
Image URL: "https://cdn.fs.grailed.com/api/file/k9N12MAQnKcKk3Vl0dek"

Nike Air Jordan Retro 13 XIII Grey Toe He Got Game Playoff - price: $90
Image URL: "https://cdn.fs.grailed.com/api/file/wo1hKnUQeKt0lioebLgG"

Nike Air Jordan 1 Retro Countdown Pack 2008 Vintage - price: $429
Image URL: "https://cdn.fs.grailed.com/api/file/ktAH2VmbTgaG6zdtzHrw"

Air Jordan 1 Retro High OG Bred Toe - price: $495
Image URL: "https://cdn.fs.grailed.com/api/file/a41NTCSXTgGm7KyqkTqB"

Air Jordan 10 Retro 2018 Orlando 2018 - price: $121
Image URL: "https://cdn.fs.grailed.com/api/file/dMAbTBYVSYSX6KNqVvQA"

Air Jordan 12 Retro Winterized Triple Black 2018 - price: $84
Image URL: "https://cdn.fs.grailed.com/api/file/Jg69Af0QrelcZxWM2OEW"

Air Jordan 6 Retro Diffused Blue 2018 - price: $145
Image URL: "https://cdn.fs.grailed.com/api/file/bVPC1SomTKO3yPLCm0rC"

AIR JORDAN 7 RETRO "2005 CARDINAL" - price: $57
Image URL: "https://cdn.fs.grailed.com/api/file/QzvDCMVRqykY1CnkrMwC"

Air Jordan 5 Retro Camo 2017 - price: $220
Image URL: "https://cdn.fs.grailed.com/api/file/eLDeagz4TF6Yu5aItPAE"

Air Jordan 1 Retro High Grand Purple 2009 - price: $261
Image URL: "https://cdn.fs.grailed.com/api/file/XhhIFIQ5SyjNQkEXpQjK"

Air Jordan 6 Retro 2015 Maroon 2015 - price: $150
Image URL: "https://cdn.fs.grailed.com/api/file/5UTk9ctnTnSsZEixeYwW"

ORIGINAL 1985 WEARABLE Chicago OG Air Jordan 1 Last Dance - price: $2268
Image URL: "https://cdn.fs.grailed.com/api/file/CE4eMehmQvOYtpx16R3X"

Air Jordan 5 Retro Top 3 - price: $247
Image URL: "https://cdn.fs.grailed.com/api/file/B1A8oopSR226r2rmOjfR"

AIR JORDAN 4 METALLIC RED - price: $350
Image URL: "https://cdn.fs.grailed.com/api/file/ZHJVHw7fRC2jO5wDj7JF"

Air Jordan 1 Retro Mid Size 7 GS White Gym Red - price: $67
Image URL: "https://cdn.fs.grailed.com/api/file/XPMpEjlZSmCNikSeHjg4"

Air Jordan 5 Retro Metallic White 2015 Size 13 - price: $57
Image URL: "https://cdn.fs.grailed.com/api/file/LHvMnyxxQaOCyYiTSvHw"

Air Jordan 8 Retro C&C Trophy 2016 - price: $115
Image URL: "https://cdn.fs.grailed.com/api/file/7sh2iU42T6kUKjJaAyE8"

JIRDAN 7 UNIVERSITY BLUE - price: $120
Image URL: "https://cdn.fs.grailed.com/api/file/zJzKkd1QyKvMasw1ZWRB"

AIR JORDAN 3 5LAB3 - price: $70
Image URL: "https://cdn.fs.grailed.com/api/file/IJvMAIeMTPycUk7n0idU"

Jordan 6 rings UNC - price: $189
Image URL: "https://cdn.fs.grailed.com/api/file/nE6A02dKQBa7ZTSJWGu3"

Nike Air Jordan 5 Retro Top 3 - price: $255
Image URL: "https://cdn.fs.grailed.com/api/file/UfcVOOO0QJSM6csHc1cK"

Nike Air Jordan 13 White Pink Soar Aurora Green (GS) - price: $155
Image URL: "https://cdn.fs.grailed.com/api/file/58rVbApeR8K8Ojd1Zskg"

Jordan Thunder 4 Retro 2012 - price: $350
Image URL: "https://cdn.fs.grailed.com/api/file/ZJ1EuskfThGImbij5QkS"

>>>

EDIT - Here is the updated code which looks at the other API:

def main():

import requests
from urllib.parse import urlencode

url = "https://mnrwefss2q-1.algolianet.com/1/indexes/Listing_production/browse"

params = {
"x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser",
"x-algolia-application-id": "MNRWEFSS2Q",
"x-algolia-api-key": "a1c6338ffe41249d0284a5a1105eafe4"
}

post_json = {
"params": "query=&" + urlencode({
"offset": "0",
"length": "100",
"facetFilters": "[[\"category_path:footwear.hitop_sneakers\"], [\"designers.name:Jordan Brand\"]]",
"filters": ""
})
}

response = requests.post(url, params=params, json=post_json)
response.raise_for_status()

items = response.json()["hits"]

# ...

return 0

if __name__ == "__main__":
import sys
sys.exit(main())

How to scrape this website items where the lazy load pauses with each scroll before loading the items?

Do you need to use Selenium. You get the data through a POST and just change the page parameter to get more. Essentially thats what happens when you scroll. And then just change the 'value' parameter to go through your list as well.

import requests
import pandas as pd

url = 'https://api.sayurbox.io/graphql'
headers = {
'authorization': 'eyJhbGciOiJSUzI1NiIsImtpZCI6ImY4NDY2MjEyMTQxMjQ4NzUxOWJiZjhlYWQ4ZGZiYjM3ODYwMjk5ZDciLCJ0eXAiOiJKV1QifQ.eyJhbm9ueW1vdXMiOnRydWUsImF1ZCI6InNheXVyYm94LWF1ZGllbmNlIiwiYXV0aF90aW1lIjoxNjUwNTUxMDYxLCJleHAiOjE2NTMxNDMwNjEsImlhdCI6MTY1MDU1MTA2MSwiaXNzIjoiaHR0cHM6Ly93d3cuc2F5dXJib3guY29tIiwibWV0YWRhdGEiOnsiZGV2aWNlX2luZm8iOm51bGx9LCJuYW1lIjpudWxsLCJwaWN0dXJlIjpudWxsLCJwcm92aWRlcl9pZCI6ImFub255bW91cyIsInNpZCI6IjFjNDE1ODFiLWQzMjItNDFhZi1hOWE5LWE4YTQ4OTZkODMxZiIsInN1YiI6InFSWXF2OFV2bEFucVR3NlE1NGhfbHdTNFBvTk8iLCJ1c2VyX2lkIjoicVJZcXY4VXZsQW5xVHc2UTU0aF9sd1M0UG9OTyJ9.MSmOz0mAe3UjhH9KSRp-fCk65tkTUPlxiJrRHweDEY2vqBSnUP43TO8ug3P38x8igxC4qguCOlwCTCPfUEWFhr3X8ePY7u7I7D22tV1LOF7Tm6T8PuLzHbmlBTgPK9C_GJpXwLAKnD2A535r-9DttYGt4QytIeWua8NKyW_riURfWGnhZBBMjEPeVPJBqGn1jMtZoh_iUeRb-kWccJ8IhBDQr0T1Op6IDMJuw3x6uf1Ks_SVqEVA0ZGIM1GVwuyZ87JYT4kqITNgi6yNy69jVH6gDFqBkTwJ7ZNWj8NCQsaRfh03bZROZzY9MeCtL6if_8D9newYZagyZu5mKTJNzg',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'}

rows = []
for page in range(1,10):
print(page)
payload = {
'operationName': "getCatalogVariant",
'query': "query getCatalogVariant($deliveryDate: String!, $deliveryArea: String!, $deliveryCode: String, $limit: Int!, $page: Int!, $type: CatalogType, $value: String) {\n catalogVariantList(deliveryDate: $deliveryDate, deliveryArea: $deliveryArea, deliveryCode: $deliveryCode, limit: $limit, page: $page, type: $type, value: $value) {\n limit\n page\n size\n hasNextPage\n category {\n displayName\n }\n list {\n key\n availability\n categories\n farmers {\n image\n name\n }\n image {\n md\n sm\n lg\n }\n isDiscount\n discount\n labelDesc\n labelName\n maxQty\n name\n displayName\n nextAvailableDates\n packDesc\n packNote\n price\n priceFormatted\n actualPrice\n actualPriceFormatted\n shortDesc\n stockAvailable\n type\n emptyMessageHtml\n promoMessageHtml\n }\n }\n}\n",
'variables': {
'deliveryArea': "Jabodetabek",
'deliveryCode': "JK01",
'deliveryDate': "Friday, 22 April 2022",
'limit': 12,
'page': page,
'type': "SEARCH",
'value': "ayam"}}

jsonData = requests.post(url, headers = headers, json=payload).json()
items = jsonData['data']['catalogVariantList']['list']

rows += items

df = pd.DataFrame(rows)

Output:

print(df)
key ... promoMessageHtml
0 Sreeya Sayap Ayam Frozen 500 gram ... None
1 SunOne Kulit Ayam 1 kg ... None
2 Bundling Ayam & Pisau 1 pack ... Promo!! maksimal 5
3 SunOne Hati Ayam 1 kg ... Hanya tersedia 1
4 Wellfed Daging Ayam Giling 250 gram ... None
.. ... ... ...
103 Frozchick Ayam Bumbu Kecap 400 gram ... Hanya tersedia 5
104 Sasa Larasa Bumbu Ungkep Ayam Kalasan 33 gram ... Promo!! maksimal 5
105 Bundling Indomie Kuah Ayam Bawang 69 gram 5 pcs ... Promo!! maksimal 7
106 Bundling MPASI Dada Ayam 1 pack ... Promo!! maksimal 10
107 Berkah Chicken Paha Bawah Probiotik Organik 55... ... Promo!! maksimal 10

[108 rows x 24 columns]

Selenium in Python: Run scraping code after all lazy-loading component is loaded

Selenium was not designed for web-scraping (although in complicated cases it can be useful). In your case, do F12 -> Network and look at the XHR tab when you scroll down the page. You can see that the queries that are added contain the year in their urls. So the page generates subqueries when you scroll down and reach other years.

Look at response tab to find divs and classes and build beautifulsoup 'find_all'.
A simple little loop through years with requests and bs is enough :

import requests as rq
from bs4 import BeautifulSoup as bs

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}

resultats = []

for year in range(1998, 2021+1, 1):

url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
resp = rq.get(url, headers=headers)
soup = bs(resp.content)

titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
dates = map(lambda x: x.text, soup.find_all("dt"))

zipped = list(zip(dates, titles, subtitles))
resultats.extend(zipped)

resultat contains :

...
('8 November 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Frankfurt am Main, 8 November 2012'),
('4 October 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Brdo pri Kranju, 4 October 2012'),
...

How to scrape lazy loading images using python Scrapy

The problem is that lazy loading is being made by Javascript which scrapy can't handle, casperjs handles this.

To make this work with scrapy you have to mix it with Selenium or scrapyjs

How to get all the data from a webpage manipulating lazy-loading method?

I guess you could use Selenium for this but if speed is your concern aften @Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.

If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:

categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()

For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:

bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]

Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.

This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):

items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'

for uri in uris:
r = requests.get(items_API.format(uri)).json()
products = r['products']
for product in products:
name = product['title']
# testing for promo_price - if its 0.0 go with the normal price
price = product['pricing']['promo_price']
if price == 0.0:
price = product['pricing']['price']
print("Name: {}. Price: {}".format(name, price))

Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.

check_height = driver.execute_script("return document.body.scrollHeight;") 
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height


Related Topics



Leave a reply



Submit