How to Run Scrapy from Within a Python Script

How run a scrapy spider programmatically like a simple script?

You can run spider directly in python script without using project.

You have to use scrapy.crawler.CrawlerProcess or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.

See more in documentation: Common Practices

Or you can put your command in bash script on Linux or in .bat file on Windows.

BTW: on Linux you can add shebang in first line (#!/bin/bash) and set attribute "executable" -

ie. chmod +x your_script - and it will run as normal program.


Working example

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):

name = 'myspider'

allowed_domains = ['http://quotes.toqoute.com']

#start_urls = []

#def start_requests(self):
# for tag in self.tags:
# for page in range(self.pages):
# url = self.url_template.format(tag, page)
# yield scrapy.Request(url)

def parse(self, response):
print('url:', response.url)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()

How to run and save scrapy state from a python script

As your reference question points out you can pass settings to CrawlerProcess instance.

So all you need to do is pass JOBDIR setting:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
# Your spider definition
...

process = CrawlerProcess({
'JOBDIR': 'crawls/somespider-1' # <----- Here
})

process.crawl(MySpider)
process.start()

Running a scrapy program from another python script

You could just set spider class with an if statement:

import sys

import scrapy
from scrapy.crawler import CrawlerProcess

from project.spiders import Spider1, Spider2

def main():
process = CrawlerProcess({})

if sys.argv[1] == '1':
spider_cls = Spider1
elif sys.argv[1] == '2':
spider_cls = Spider2
else:
print('1st argument must be either 1 or 2')
return
process.crawl(spider_cls)
process.start() # the script will block here until the crawling is finished

if __name__ == '__main__':
main()

Confused about running Scrapy from within a Python script

Terminal prints the result because the default log level is set to DEBUG.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

Just replace:

log.start()

with

log.start(loglevel=log.DEBUG)

UPD:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
result = f.read()
print result

Hope that helps.

Python Scrapy: How do you run your spider from a seperate file?

In Scrapy doc Common Practices you can see Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
# ... Your spider definition ...

# ... run it ...

process = CrawlerProcess(settings={ ... })
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

If you add own __init__

class MySpider(scrapy.Spider):

def __init__(self, urls, *args, **kwargs):
super().__init__(*args, **kwargs)

self.start_urls = urls

then you could run it with urls as parameter

process.crawl(MySpider, urls=['http://books.toscrape.com/', 'http://quotes.toscrape.com/'])

Run scrapy spider from script

The reason you are getting this error message is that you start the crawling process with the command scrapy crawl mySpider which creates a new instance of ClassSpider. It does so without passing url and nbrPage.

It could work if you replaced subprocess.check_output(['scrapy crawl mySpider']) with subprocess.check_output([f'scrapy crawl mySpider -a url={self.start_urls} nbrPage={self.pages}']). Also you should make sure that start_urls is a list.

However, then you would still create two separate instances of the same spider, so I would suggest that you should implement run as a function taking url and nbrPage as arguments.

There are also other methods of using Scrapy and Flask in the same script. For that purpose check this question.

Run scrapy program from within python script

As the error says quote5 is undefined you need to define quote5 before passing it to the method. Or try something like this :

run_spider(“quotes5”)

Edited:

import WS_Vardata.spiders.quotes_spiders as quote_spider_module
def run_spider(spiderName):
#get the class from within the module
spiderClass = getattr(quote_spider_module, spiderName)
#create the object and your good to go
spiderObj= spiderClass()
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj)

run_spider("QuotesSpider")

This script should run in the same directory as WS_Vardata

So in your case:

- TEST
| the_code.py
| WS_Vardata
| spiders
| quotes_spider <= containing QuotesSpider class


Related Topics



Leave a reply



Submit