How to pass a user defined argument in scrapy spider
Spider arguments are passed in the crawl
command using the -a
option. For example:
scrapy crawl myspider -a category=electronics -a domain=system
Spiders can access arguments as attributes:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3
def parse(self, response)
self.log(self.domain) # system
Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
Update 2013: Add second argument
Update 2015: Adjust wording
Update 2016: Use newer base class and add super, thanks @Birla
Update 2017: Use Python3 super
# previously
super(MySpider, self).__init__(**kwargs) # python2
Update 2018: As @eLRuLL points out, spiders can access arguments as attributes
How to pass a user-defined argument to a scrapy Spider when running it from a script
scrapy crawl FundaMaxPagesSpider -a url='http://stackoverflow.com/'
Is equivalent to:
process.crawl(FundaMaxPagesSpider, url='http://stackoverflow.com/')
Now you just treat the arguments as decribed in the answer you mentioned
def __init__(self, url='http://www.funda.nl/koop/amsterdam/'):
self.start_urls = [url]
How to pass two user-defined arguments to a scrapy spider
When providing multiple arguments you need to prefix -a
for every argument.
The correct line for your case would be:
scrapy crawl funda1 -a place=rotterdam -a page=2
passing arguments to scrapy
you should call super(companySpider, self).__init__(*args, **kwargs)
at the beginning of your __init__
.
def __init__(self, domains="", *args, **kwargs):
super(companySpider, self).__init__(*args, **kwargs)
self.domains = domains
In your case where your first requests depend on a spider argument, what I usually do is only override start_requests()
method, without overriding __init__()
. The parameter name from the command line is aleady available as an attribute to the spider:
class companySpider(BaseSpider):
name = "woz"
deny_domains = [""]
def start_requests(self):
yield Request(self.domains) # for example if domains is a single URL
def parse(self, response):
...
How to pass multiple arguments to Scrapy spider (getting error running 'scrapy crawl' with more than one spider is no longer supported)?
No scrapy
problem, I guess. It's how your shell
interprets input, spliting tokens in spaces. So, you must not have any of them between the key and its value. Try with:
scrapy crawl dmoz -a address="40-18 48th st" -a borough="4"
Pass argument to scrapy spider within a python script
You need to modify your __init__()
constructor to accept the date
argument. Also, I would use datetime.strptime()
to parse the date string:
from datetime import datetime
class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
date = kwargs.get('date')
if not date:
raise ValueError('No date given')
dt = datetime.strptime(date, "%m-%d-%Y")
self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
Then, you would instantiate the spider this way:
spider = MySpider(date='01-01-2015')
Or, you can even avoid parsing the date at all, passing a datetime
instance in the first place:
class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
dt = kwargs.get('dt')
if not dt:
raise ValueError('No date given')
self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
spider = MySpider(dt=datetime(year=2014, month=01, day=01))
And, just FYI, see this answer as a detailed example about how to run Scrapy from script.
Related Topics
If Range() Is a Generator in Python 3.3, Why How to Not Call Next() on a Range
Find P-Value (Significance) in Scikit-Learn Linearregression
Override a Method at Instance Level
Take the Content of a List and Append It to Another List
Executing Command Using Paramiko Exec_Command on Device Is Not Working
Pandas: Cast Column to String Does Not Work
Importerror: No Module Named Matplotlib.Pyplot
How to Print a List in Python "Nicely"
High-Precision Clock in Python
How to Group a List of Tuples/Objects by Similar Index/Attribute in Python
Importing Pyspark in Python Shell
How to Check Blas/Lapack Linkage in Numpy and Scipy
Pip - Fatal Error in Launcher: Unable to Create Process Using '"'
Add 'Decimal-Mark' Thousands Separators to a Number