How run a scrapy spider programmatically like a simple script?
You can run spider directly in python script without using project.
You have to use scrapy.crawler.CrawlerProcess
or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.
See more in documentation: Common Practices
Or you can put your command in bash script on Linux or in .bat
file on Windows.
BTW: on Linux you can add shebang in first line (#!/bin/bash
) and set attribute "executable" -
ie. chmod +x your_script
- and it will run as normal program.
Working example
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['http://quotes.toqoute.com']
#start_urls = []
#def start_requests(self):
# for tag in self.tags:
# for page in range(self.pages):
# url = self.url_template.format(tag, page)
# yield scrapy.Request(url)
def parse(self, response):
print('url:', response.url)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()
How to run and save scrapy state from a python script
As your reference question points out you can pass settings to CrawlerProcess instance.
So all you need to do is pass JOBDIR
setting:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'JOBDIR': 'crawls/somespider-1' # <----- Here
})
process.crawl(MySpider)
process.start()
Running a scrapy program from another python script
You could just set spider class with an if statement:
import sys
import scrapy
from scrapy.crawler import CrawlerProcess
from project.spiders import Spider1, Spider2
def main():
process = CrawlerProcess({})
if sys.argv[1] == '1':
spider_cls = Spider1
elif sys.argv[1] == '2':
spider_cls = Spider2
else:
print('1st argument must be either 1 or 2')
return
process.crawl(spider_cls)
process.start() # the script will block here until the crawling is finished
if __name__ == '__main__':
main()
Confused about running Scrapy from within a Python script
Terminal prints the result because the default log level is set to DEBUG
.
When you are running your spider from the script and call log.start()
, the default log level is set to INFO
.
Just replace:
log.start()
with
log.start(loglevel=log.DEBUG)
UPD:
To get the result as string, you can log everything to a file and then read from it, e.g.:
log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
reactor.run()
with open("results.log", "r") as f:
result = f.read()
print result
Hope that helps.
Python Scrapy: How do you run your spider from a seperate file?
In Scrapy doc Common Practices you can see Run Scrapy from a script
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# ... Your spider definition ...
# ... run it ...
process = CrawlerProcess(settings={ ... })
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
If you add own __init__
class MySpider(scrapy.Spider):
def __init__(self, urls, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = urls
then you could run it with urls
as parameter
process.crawl(MySpider, urls=['http://books.toscrape.com/', 'http://quotes.toscrape.com/'])
Run scrapy spider from script
The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider
which creates a new instance of ClassSpider
. It does so without passing url
and nbrPage
.
It could work if you replaced subprocess.check_output(['scrapy crawl mySpider'])
with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}'])
. Also you should make sure that start_urls is a list.
However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run
as a function taking url and nbrPage as arguments.
There are also other methods of using Scrapy and Flask in the same script. For that purpose check this question.
Run scrapy program from within python script
As the error says quote5 is undefined you need to define quote5 before passing it to the method. Or try something like this :
run_spider(“quotes5”)
Edited:
import WS_Vardata.spiders.quotes_spiders as quote_spider_module
def run_spider(spiderName):
#get the class from within the module
spiderClass = getattr(quote_spider_module, spiderName)
#create the object and your good to go
spiderObj= spiderClass()
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj)
run_spider("QuotesSpider")
This script should run in the same directory as WS_Vardata
So in your case:
- TEST
| the_code.py
| WS_Vardata
| spiders
| quotes_spider <= containing QuotesSpider class
Related Topics
Processing Single File from Multiple Processes
What Does the Term "Broadcasting" Mean in Pandas Documentation
Pyqt4 Wait in Thread for User Input from Gui
Full Examples of Using Pyserial Package
How to Enumerate an Object's Properties in Python
How to Open a Website in My Web Browser Using Python
Improper Use of _New_ to Generate Class Instances
Convert Columns to String in Pandas
Pandas Dataframe with Multiindex Column - Merge Levels
Regex Error - Nothing to Repeat
How to Set a Default Parameter Equal to Another Parameter Value
How to Change Plot Background Color