Can Scrapy Be Used to Scrape Dynamic Content from Websites That Are Using Ajax

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

Sample Image

In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.

Scrapy for dynamic content

Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:

examples.json

After removing the JSONP paramter, the URL is pretty straightforward:

https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0

By making the minimal number of requests, your spider will be fast.

If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

web scraping a webpage which has dynamic contents loaded via ajax

You can get the catalogId and other parameter values needed to make the POST request from the form with id="search":

<form id="search" name="search" action="http://shop.coles.com.au/online/SearchDisplay?pageView=image&catalogId=10576&beginIndex=0&langId=-1&storeId=10601" method="get" role="search">
<input type="hidden" name="storeId" value="10601" id="WC_CachedHeaderDisplay_FormInput_storeId_In_CatalogSearchForm_1">
<input type="hidden" name="catalogId" value="10576" id="WC_CachedHeaderDisplay_FormInput_catalogId_In_CatalogSearchForm_1">
<input type="hidden" name="langId" value="-1" id="WC_CachedHeaderDisplay_FormInput_langId_In_CatalogSearchForm_1">
<input type="hidden" name="beginIndex" value="0" id="WC_CachedHeaderDisplay_FormInput_beginIndex_In_CatalogSearchForm_1">
<input type="hidden" name="browseView" value="false" id="WC_CachedHeaderDisplay_FormInput_browseView_In_CatalogSearchForm_1">
<input type="hidden" name="searchSource" value="Q" id="WC_CachedHeaderDisplay_FormInput_searchSource_In_CatalogSearchForm_1">
...
</form>

Use the FormRequest to submit this form.


I'm wondering is it possible to get the response after the ajax call is finished?

Scrapy is not a browser - it does not make additional AJAX requests to load the page and there is nothing built-in to execute JavaScript. You may look into using a real browser and solve it on a higher level - look into selenium package. There is also the related scrapy-splash project.

See also:

  • selenium with scrapy for dynamic page

How do I scrape dynamic search results page with scrapy?

Actually, Data is generating from external url which is API calls HTML response as POST method.

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
url = 'https://howlongtobeat.com/search_results?page=1'
payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
headers = {
"content-type":"application/x-www-form-urlencoded",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}

yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

def parse(self, response):
cards = response.css('div[class="search_list_details"]')

for card in cards:
game_name = card.css('a[class=text_white]::attr(title)').get()
yield {
"game_name":game_name
}


if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(TestSpider)
process.start()

Output:

{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 2754,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.49537,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
'httpcompression/response_bytes': 23986,
'httpcompression/response_count': 1,
'item_scraped_count': 20,


Related Topics



Leave a reply



Submit