Scrapy Crawl URLs in Order
start_urls
defines urls which are used in start_requests
method. Your parse
method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse
.
A solution -- override start_requests
method and add to generated requests a meta
with priority
key. In parse
extract this priority
value and add it to the item
. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).
Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls
the first of them. In parse
process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse
.
the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider
First of all, please see this thread - I think you'll find all the answers there.
the order of the urls used by downloader? Will the requests made by
just_test1,just_test2 be used by downloader only after the all
start_urls are used?(I have made some tests, it seems that the answer
is No)
You are right, the answer is No
. The behavior is completely asynchronous: when the spider starts, start_requests
method is called (source):
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
What decides the order? Why and How is this order? How can we control
it?
By default, there is no pre-defined order - you cannot know when Requests
from make_requests_from_url
will arrive - it's asynchronous.
See this answer on how you may control the order.
Long story short, you can override start_requests
and mark yielded Requests
with priority
key (like yield Request(url, meta={'priority': 0})
). For example, the value of priority
can be the line number where the url was found.
Is this a good way to deal with so many urls which are already in a
file? What else?
I think you should read your file and yield urls directly in start_requests
method: see this answer.
So, you should do smth like this:
def start_requests(self):
with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
for index, line in enumerate(f):
try:
url = line.strip()
yield Request(url, meta={'priority': index})
except:
continue
Hope that helps.
How to fix the order problem when using scrapy?
If your aim is only to keep the correspondence between URL and title, you can add the URL to your scraped item:
def parse(self, response):
for quote in response.css('h1.title'):
yield {
'Title': quote.css('h1.title::text').extract_first(),
'url': response.url
}
On the contrary, if you want to process URLs in order, there are various ways, a bit more complex.
The most common idea is to write a method start_request, where you request only the first URL; then, in the method parse, you request the second URL, setting the same method (parse) as callback; and so on...
See Sequential scraping from multiple start_urls leading to error in parsing and Scrapy Crawl URLs in Order
Scrapy crawl nested urls
According to the comments you provided, the issue starts with you skipping a request in your chain.
Your start_urls
will request this page: https://www.karton.eu/Faltkartons
The page will be parse by the parse
method and yield new requests from https://www.karton.eu/Karton-weiss to
https://www.karton.eu/Einwellige-Kartonagen
Those pages will be parsed in the parse_item
method, but they are not the final page you want. You need to parse between the cards and yield new requests, like this:
for url in response.xpath('//div[@class="cat-thumbnails"]/div/a/@href')
yield scrapy.Request(response.urljoin(url.get()), callback=self.new_parsing_method)
Example here, when parsing https://www.karton.eu/Zweiwellige-Kartons will find 9 new links from
https://www.karton.eu/zweiwellig-ab-100-mm to...
https://www.karton.eu/zweiwellig-ab-1000-mm
Finally you need a parsing method to scrape the items in those pages. Since there are more than one item, I suggest you to run them in a for loop. (You need the proper xpath to scrape the data.)
EDIT:
Re-editing as now I observed the page structure and saw that my code was base on the wrong assumption. The thing is that some pages don't have the subcategory page, others do.
Page structure:
ROOT: www.karton.eu/Faltkartons
|_ Einwellige Kartons
|_ Subcategory: Kartons ab 100 mm Länge
|_ Item List (www.karton.eu/einwellig-ab-100-mm)
|_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
...
|_ Subcategory: Kartons ab 1000 mm Länge
|_ ...
|_ Zweiwellige Kartons #Same as above
|_ Lange Kartons #Same as above
|_ quadratische Kartons #There is no subcategory
|_ Item List (www.karton.eu/quadratische-Kartons)
|_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
|_ Kartons Höhenvariabel #There is no subcategory
|_ Kartons weiß #There is no subcategory
The code bellow will scrape items from the pages with subcategories, as I think that's what you want. Either way I left a print
statements to show you pages that will be skipped due to having no subcategory page. In case you want to include them later.
import scrapy
from ..items import KartonageItem
class KartonSpider(scrapy.Spider):
name = "kartons12"
allow_domains = ['karton.eu']
start_urls = [
'https://www.karton.eu/Faltkartons'
]
custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] }
def parse(self, response):
url = response.xpath('//div[@class="cat-thumbnails"]')
for a in url:
link = a.xpath('a/@href')
yield response.follow(url=link.get(), callback=self.parse_category_cartons)
def parse_category_cartons(self, response):
url2 = response.xpath('//div[@class="cat-thumbnails"]')
if not url2:
print('Empty url2:', response.url)
for a in url2:
link = a.xpath('a/@href')
yield response.follow(url=link.get(), callback=self.parse_target_page)
def parse_target_page(self, response):
card = response.xpath('//div[@class="text-center artikelbox"]')
for a in card:
items = KartonageItem()
link = a.xpath('a/@href')
items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
def parse_item(self,response):
table = response.xpath('//div[@class="product-info-inner"]')
#items = KartonageItem() # You don't need this here, as the line bellow you are overwriting the variable.
items = response.meta['items']
items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
yield items
Notes
Changed this:
card = response.xpath('//div[@class="text-center articelbox"]')
to this: (K instead of C)
card = response.xpath('//div[@class="text-center artikelbox"]')
Commented this, as the items in meta is already a KartonageItem
. (You can remove it)
def parse_item(self,response):
table = response.xpath('//div[@class="product-info-inner"]')
#items = KartonageItem()
items = response.meta['items']
Changed this in the parse_items
method:
items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()
To this:
items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
As a
doesn't exists in that method.
Related Topics
Is Python's Sorted() Function Guaranteed to Be Stable
How to Install Python Windows Packages into Virtualenvs
How to Create a Large Pandas Dataframe from an SQL Query Without Running Out of Memory
Log All Requests from the Python-Requests Module
Best Way to Find the Months Between Two Dates
Python Find Elements in One List That Are Not in the Other
Translate Every Element in Numpy Array According to Key
How to Perform Element-Wise Multiplication of Two Lists
Python Mocking Raw Input in Unittests
How to Convert a File into a Dictionary
Check If String Ends with One of the Strings from a List
Python: Maximum Recursion Depth Exceeded
Attributeerror: 'Module' Object Has No Attribute 'Urlopen'
How to Split a Multi-Line String into Multiple Lines
How to Get Exit Code When Using Python Subprocess Communicate Method
Best Way to Set Entry Background Color in Python Gtk3 and Set Back to Default