Scrapy - Reactor Not Restartable

Scrapy - Reactor not Restartable

You cannot restart the reactor, but you should be able to run it more times by forking a separate process:

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

Run it twice:

configure_logging()

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

Result:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

Reactor not restartable while running multiple spiders

You need to follow the sequential execution example in the documentation:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    yield runner.crawl('urlClothes_spider')
    yield runner.crawl('clothes_spider')
    yield runner.crawl('finalClothes_spider')
    reactor.stop()

crawl()
reactor.run()

Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice

If you use CrawlerRunner instead of CrawlerProcess in conjunction with pytest-twisted, you should be able to use run your tests like this:

Install Twisted integration for Pytest: pip install pytest-twisted

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
    """
    spider_cls: Scrapy Spider class
    settings: Scrapy settings
    returns: Twisted Deferred
    """
    runner = CrawlerRunner(settings)
    return runner.crawl(spider_cls)     # return Deferred

def test_scrapy_crawler():
    deferred = _run_crawler(MySpider, settings)

    @deferred.addCallback
    def _success(results):
        """
        After crawler completes, this function will execute.
        Do your assertions in this function.
        """

    @deferred.addErrback
    def _error(failure):
        raise failure.value

    return deferred

To put it plainly, _run_crawler() will schedule a crawl in the Twisted reactor and execute callbacks when the scrape completes. In those callbacks (_success() and _error()) is where you will do your assertions. Lastly, you have to return the Deferred object from _run_crawler() so that the test waits until the crawl is complete. This part with the Deferred, is essential and must be done for all tests.

Here's an example of how to run multiple crawls and aggregate results using gatherResults.

from twisted.internet import defer

def test_multiple_crawls():
    d1 = _run_crawler(Spider1, settings)
    d2 = _run_crawler(Spider2, settings)

    d_list = defer.gatherResults([d1, d2])

    @d_list.addCallback
    def _success(results):
        assert True

    @d_list.addErrback
    def _error(failure):
        assert False

    return d_list

I hope this helps, if it doesn't please ask where you're struggling.

ReactorNotRestartable with scrapy when using Google Cloud Functions

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

Instead, we can use scrapydo to run your existing spider in a blocking fashion:

requirements.txt:

scrapydo

main.py:

import scrapy
import scrapydo

scrapydo.setup()

class MyItem(scrapy.Item):
    url = scrapy.Field()

class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/"]

    def parse(self, response):
        yield MyItem(url=response.url)

def run_single_crawl(data, context):
    results = scrapydo.run_spider(MySpider)

This also shows a simple example of how to yield one or more scrapy.Item from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo.

Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.

Scrapy - Reactor Not Restartable