Scrapy Crawl Urls in Order

Scrapy Crawl URLs in Order

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider

First of all, please see this thread - I think you'll find all the answers there.

the order of the urls used by downloader? Will the requests made by
just_test1,just_test2 be used by downloader only after the all
start_urls are used?(I have made some tests, it seems that the answer
is No)

You are right, the answer is No. The behavior is completely asynchronous: when the spider starts, start_requests method is called (source):

def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
return Request(url, dont_filter=True)

What decides the order? Why and How is this order? How can we control
it?

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous.

See this answer on how you may control the order.
Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0})). For example, the value of priority can be the line number where the url was found.

Is this a good way to deal with so many urls which are already in a
file? What else?

I think you should read your file and yield urls directly in start_requests method: see this answer.

So, you should do smth like this:

def start_requests(self):
with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
for index, line in enumerate(f):
try:
url = line.strip()
yield Request(url, meta={'priority': index})
except:
continue

Hope that helps.

How to fix the order problem when using scrapy?

If your aim is only to keep the correspondence between URL and title, you can add the URL to your scraped item:

def parse(self, response):
for quote in response.css('h1.title'):
yield {
'Title': quote.css('h1.title::text').extract_first(),
'url': response.url
}

On the contrary, if you want to process URLs in order, there are various ways, a bit more complex.
The most common idea is to write a method start_request, where you request only the first URL; then, in the method parse, you request the second URL, setting the same method (parse) as callback; and so on...

See Sequential scraping from multiple start_urls leading to error in parsing and Scrapy Crawl URLs in Order

Scrapy crawl nested urls

According to the comments you provided, the issue starts with you skipping a request in your chain.

Your start_urls will request this page: https://www.karton.eu/Faltkartons
The page will be parse by the parse method and yield new requests from https://www.karton.eu/Karton-weiss to
https://www.karton.eu/Einwellige-Kartonagen

Those pages will be parsed in the parse_item method, but they are not the final page you want. You need to parse between the cards and yield new requests, like this:

for url in response.xpath('//div[@class="cat-thumbnails"]/div/a/@href')
yield scrapy.Request(response.urljoin(url.get()), callback=self.new_parsing_method)

Example here, when parsing https://www.karton.eu/Zweiwellige-Kartons will find 9 new links from

  • https://www.karton.eu/zweiwellig-ab-100-mm to...

  • https://www.karton.eu/zweiwellig-ab-1000-mm

Finally you need a parsing method to scrape the items in those pages. Since there are more than one item, I suggest you to run them in a for loop. (You need the proper xpath to scrape the data.)

EDIT:

Re-editing as now I observed the page structure and saw that my code was base on the wrong assumption. The thing is that some pages don't have the subcategory page, others do.

Page structure:

ROOT: www.karton.eu/Faltkartons
|_ Einwellige Kartons
|_ Subcategory: Kartons ab 100 mm Länge
|_ Item List (www.karton.eu/einwellig-ab-100-mm)
|_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
...
|_ Subcategory: Kartons ab 1000 mm Länge
|_ ...
|_ Zweiwellige Kartons #Same as above
|_ Lange Kartons #Same as above
|_ quadratische Kartons #There is no subcategory
|_ Item List (www.karton.eu/quadratische-Kartons)
|_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
|_ Kartons Höhenvariabel #There is no subcategory
|_ Kartons weiß #There is no subcategory

The code bellow will scrape items from the pages with subcategories, as I think that's what you want. Either way I left a print statements to show you pages that will be skipped due to having no subcategory page. In case you want to include them later.

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
name = "kartons12"
allow_domains = ['karton.eu']
start_urls = [
'https://www.karton.eu/Faltkartons'
]
custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] }

def parse(self, response):
url = response.xpath('//div[@class="cat-thumbnails"]')

for a in url:
link = a.xpath('a/@href')
yield response.follow(url=link.get(), callback=self.parse_category_cartons)

def parse_category_cartons(self, response):
url2 = response.xpath('//div[@class="cat-thumbnails"]')

if not url2:
print('Empty url2:', response.url)

for a in url2:
link = a.xpath('a/@href')
yield response.follow(url=link.get(), callback=self.parse_target_page)

def parse_target_page(self, response):
card = response.xpath('//div[@class="text-center artikelbox"]')

for a in card:
items = KartonageItem()
link = a.xpath('a/@href')
items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

def parse_item(self,response):
table = response.xpath('//div[@class="product-info-inner"]')

#items = KartonageItem() # You don't need this here, as the line bellow you are overwriting the variable.
items = response.meta['items']
items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
yield items

Notes

Changed this:

    card = response.xpath('//div[@class="text-center articelbox"]')

to this: (K instead of C)

    card = response.xpath('//div[@class="text-center artikelbox"]')

Commented this, as the items in meta is already a KartonageItem. (You can remove it)

def parse_item(self,response):
table = response.xpath('//div[@class="product-info-inner"]')
#items = KartonageItem()
items = response.meta['items']

Changed this in the parse_items method:

    items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()

To this:

    items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()

As a doesn't exists in that method.



Related Topics



Leave a reply



Submit