How to Pass a User Defined Argument in Scrapy Spider

How to pass a user defined argument in scrapy spider

Spider arguments are passed in the crawl command using the -a option. For example:

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

class MySpider(scrapy.Spider):
name = 'myspider'

def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3

def parse(self, response)
self.log(self.domain) # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

Update 2015: Adjust wording

Update 2016: Use newer base class and add super, thanks @Birla

Update 2017: Use Python3 super

# previously
super(MySpider, self).__init__(**kwargs) # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes

How to pass a user-defined argument to a scrapy Spider when running it from a script

scrapy crawl FundaMaxPagesSpider -a url='http://stackoverflow.com/'

Is equivalent to:

process.crawl(FundaMaxPagesSpider, url='http://stackoverflow.com/')

Now you just treat the arguments as decribed in the answer you mentioned

def __init__(self, url='http://www.funda.nl/koop/amsterdam/'):
self.start_urls = [url]

How to pass two user-defined arguments to a scrapy spider

When providing multiple arguments you need to prefix -a for every argument.

The correct line for your case would be:

scrapy crawl funda1 -a place=rotterdam -a page=2

passing arguments to scrapy

you should call super(companySpider, self).__init__(*args, **kwargs) at the beginning of your __init__.

def __init__(self, domains="", *args, **kwargs):
super(companySpider, self).__init__(*args, **kwargs)
self.domains = domains

In your case where your first requests depend on a spider argument, what I usually do is only override start_requests() method, without overriding __init__(). The parameter name from the command line is aleady available as an attribute to the spider:

class companySpider(BaseSpider):
name = "woz"
deny_domains = [""]

def start_requests(self):
yield Request(self.domains) # for example if domains is a single URL

def parse(self, response):
...

How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?

No scrapy problem, I guess. It's how your shell interprets input, spliting tokens in spaces. So, you must not have any of them between the key and its value. Try with:

scrapy crawl dmoz -a address="40-18 48th st" -a borough="4"

Pass argument to scrapy spider within a python script

You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string:

from datetime import datetime

class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']

def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)

date = kwargs.get('date')
if not date:
raise ValueError('No date given')

dt = datetime.strptime(date, "%m-%d-%Y")
self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

Then, you would instantiate the spider this way:

spider = MySpider(date='01-01-2015')

Or, you can even avoid parsing the date at all, passing a datetime instance in the first place:

class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']

def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)

dt = kwargs.get('dt')
if not dt:
raise ValueError('No date given')

self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

spider = MySpider(dt=datetime(year=2014, month=01, day=01))

And, just FYI, see this answer as a detailed example about how to run Scrapy from script.



Related Topics



Leave a reply



Submit