Run Scrapy from Flask
Here's a minimal example how you can do it with ScrapyRT.
This is the project structure:
project/
├── scraping
│ ├── example
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ ├── __init__.py
│ │ └── quotes.py
│ └── scrapy.cfg
└── webapp
└── example.py
scraping
directory contains the Scrapy project. This project contains one spider quotes.py
to scrape some quotes from quotes.toscrape.com:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
'text': quote.xpath('normalize-space(./span[@class="text"])').extract_first()
}
In order to start ScrapyRT and listen to requests for scraping, go to the Scrapy project's directory scraping
and issue scrapyrt
command:
$ cd ./project/scraping
$ scrapyrt
ScrapyRT will now listen on localhost:9080.
webapp
directory contains simple Flask app that scrapes quotes on demand (using the spider above) and simply displays them to user:
from __future__ import unicode_literals
import json
import requests
from flask import Flask
app = Flask(__name__)
@app.route('/')
def show_quotes():
params = {
'spider_name': 'quotes',
'start_requests': True
}
response = requests.get('http://localhost:9080/crawl.json', params)
data = json.loads(response.text)
result = '\n'.join('<p><b>{}</b> - {}</p>'.format(item['author'], item['text'])
for item in data['items'])
return result
To start the app:
$ cd ./project/webapp
$ FLASK_APP=example.py flask run
Now when you point the browser on localhost:5000, you'll the list of quotes freshly scraped from quotes.toscrape.com.
Building a RESTful Flask API for Scrapy
I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.
Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it
function may block for an extended period of time, and so you will need many threads/processes.
I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.
Run scrapy from Flask application
When you use sudo
the shell this starts will ask for a password on the tty - it specifically doesn't read standard input for this information. Since flask
and other web applications typically run detached from a terminal, sudo
has no way to ask for a password, so it looks for a program that can provide the password. You can find more information on this topic in this answer.
The reason you aren't finding scrapy
is most likely because of differences in your $PATH
between the interactive shells you used in testing and the process that's running flask
. The easiest way to get around this is to give the full path to the scrapy
program in your command.
Running a scrapy spider in the background in a Flask app
Its not the best idea to have flask start long running threads like this.
I would recommend using a queue system like celery or rabbitmq. Your flask application can put tasks on the queue that you would like to do in the background and then return immediately.
Then you can have workers outside of your main app process those tasks and do all of your scraping.
Why does scrapy crawler only work once in flask app?
Scrapy recommended the use of CrawlerRunner
instead of CrawlerProcess
.
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(e):
print("finished")
def spider_error(e):
print("spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
More information about reactor is available here:ReactorBasic
Related Topics
How to Remove Leading and Trailing Zeros in a String? Python
Using Django Database Layer Outside of Django
Pandas: Convert Dtype 'Object' to Int
What's the Difference Between 'R+' and 'A+' When Open File in Python
Flask Importerror: No Module Named Flask
Duplicate Items in Legend in Matplotlib
Typeerror: 'Range' Object Does Not Support Item Assignment
How to Fix "Webdriverexception: Message: Connection Refused"
Is There a Python Module to Solve Linear Equations
Split a List into Nested Lists on a Value
How to Set Selenium Webdriver from Headless Mode to Normal Mode Within the Same Session
Re.Sub Erroring with "Expected String or Bytes-Like Object"
Why Does the 'Is' Operator Behave Differently in a Script VS the Repl
Django.Db.Utils.Operationalerror Could Not Connect to Server
Screenshot of Inactive Window Printwindow + Win32Gui
Fitting a 2D Gaussian Function Using Scipy.Optimize.Curve_Fit - Valueerror and Minpack.Error