How to Integrate Flask & Scrapy

Run Scrapy from Flask

Here's a minimal example how you can do it with ScrapyRT.

This is the project structure:

project/
├── scraping
│   ├── example
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │   ├── __init__.py
│   │   └── quotes.py
│   └── scrapy.cfg
└── webapp
└── example.py

scraping directory contains the Scrapy project. This project contains one spider quotes.py to scrape some quotes from quotes.toscrape.com:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

import scrapy

class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
'text': quote.xpath('normalize-space(./span[@class="text"])').extract_first()
}

In order to start ScrapyRT and listen to requests for scraping, go to the Scrapy project's directory scraping and issue scrapyrt command:

$ cd ./project/scraping
$ scrapyrt

ScrapyRT will now listen on localhost:9080.

webapp directory contains simple Flask app that scrapes quotes on demand (using the spider above) and simply displays them to user:

from __future__ import unicode_literals

import json
import requests

from flask import Flask

app = Flask(__name__)

@app.route('/')
def show_quotes():
params = {
'spider_name': 'quotes',
'start_requests': True
}
response = requests.get('http://localhost:9080/crawl.json', params)
data = json.loads(response.text)
result = '\n'.join('<p><b>{}</b> - {}</p>'.format(item['author'], item['text'])
for item in data['items'])
return result

To start the app:

$ cd ./project/webapp
$ FLASK_APP=example.py flask run

Now when you point the browser on localhost:5000, you'll the list of quotes freshly scraped from quotes.toscrape.com.

Building a RESTful Flask API for Scrapy

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

Run scrapy from Flask application

When you use sudo the shell this starts will ask for a password on the tty - it specifically doesn't read standard input for this information. Since flask and other web applications typically run detached from a terminal, sudo has no way to ask for a password, so it looks for a program that can provide the password. You can find more information on this topic in this answer.

The reason you aren't finding scrapy is most likely because of differences in your $PATH between the interactive shells you used in testing and the process that's running flask. The easiest way to get around this is to give the full path to the scrapy program in your command.

Running a scrapy spider in the background in a Flask app

Its not the best idea to have flask start long running threads like this.

I would recommend using a queue system like celery or rabbitmq. Your flask application can put tasks on the queue that you would like to do in the background and then return immediately.

Then you can have workers outside of your main app process those tasks and do all of your scraping.

Why does scrapy crawler only work once in flask app?

Scrapy recommended the use of CrawlerRunner instead of CrawlerProcess.

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(e):
print("finished")
def spider_error(e):
print("spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()

More information about reactor is available here:ReactorBasic



Related Topics



Leave a reply



Submit