Scrapy - Reactor not Restartable
You cannot restart the reactor, but you should be able to run it more times by forking a separate process:
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Run it twice:
configure_logging()
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
Result:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
Reactor not restartable while running multiple spiders
You need to follow the sequential execution example in the documentation:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
yield runner.crawl('urlClothes_spider')
yield runner.crawl('clothes_spider')
yield runner.crawl('finalClothes_spider')
reactor.stop()
crawl()
reactor.run()
Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice
If you use CrawlerRunner
instead of CrawlerProcess
in conjunction with pytest-twisted
, you should be able to use run your tests like this:
Install Twisted integration for Pytest: pip install pytest-twisted
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
@deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
@deferred.addErrback
def _error(failure):
raise failure.value
return deferred
To put it plainly, _run_crawler()
will schedule a crawl in the Twisted reactor and execute callbacks when the scrape completes. In those callbacks (_success()
and _error()
) is where you will do your assertions. Lastly, you have to return the Deferred
object from _run_crawler()
so that the test waits until the crawl is complete. This part with the Deferred
, is essential and must be done for all tests.
Here's an example of how to run multiple crawls and aggregate results using gatherResults
.
from twisted.internet import defer
def test_multiple_crawls():
d1 = _run_crawler(Spider1, settings)
d2 = _run_crawler(Spider2, settings)
d_list = defer.gatherResults([d1, d2])
@d_list.addCallback
def _success(results):
assert True
@d_list.addErrback
def _error(failure):
assert False
return d_list
I hope this helps, if it doesn't please ask where you're struggling.
ReactorNotRestartable with scrapy when using Google Cloud Functions
By default, the asynchronous nature of scrapy
is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use scrapydo
to run your existing spider in a blocking fashion:
requirements.txt
:
scrapydo
main.py
:
import scrapy
import scrapydo
scrapydo.setup()
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/"]
def parse(self, response):
yield MyItem(url=response.url)
def run_single_crawl(data, context):
results = scrapydo.run_spider(MySpider)
This also shows a simple example of how to yield one or more scrapy.Item
from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo
.
Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.
Related Topics
Python How to Pad Numpy Array with Zeros
Pipe Subprocess Standard Output to a Variable
What Is a '"Python"' Layer in Caffe
Pandas Dataframe with Multiindex Column - Merge Levels
Nested for Loops Using List Comprehension
Nonlocal Keyword in Python 2.X
How to Run Scrapy from Within a Python Script
How to Increment a Shared Counter from Multiple Processes
What Is the Syntax to Insert One List into Another List in Python
Looping Over All Member Variables of a Class in Python
Algorithm to Find Which Number in a List Sum Up to a Certain Number
Numpy: Find First Index of Value Fast
Conditionally Fill Column Values Based on Another Columns Value in Pandas
How to Capture Output of Python's Interpreter and Show in a Text Widget
Should I Call Close() After Urllib.Urlopen()
Is There a Ceiling Equivalent of // Operator in Python
How to Delete All Blank Lines in the File with the Help of Python