Scrapy Very Basic Example
You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial
will create a folder called tutorial
several files already set up for you.
For example, in my case, the modules/packages items
, pipelines
, settings
and spiders
have been added to the root package tutorial
.
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem
class would be placed inside items.py
, and the MininovaSpider
class would go inside the spiders
folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py
Scrapy Tutorial Example
Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]/a')
for site in sites:
item = dict()
item['name'] = site.xpath("text()").extract_first()
item['url'] = site.xpath("@href").extract_first()
item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
yield item
For future reference you can always test whether a specific xpath works with scrapy shell
command.
e.g. what I did to test this:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li')
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!
Very Simple Scrapy+Splash project
The problem is that I didn't start the scrapy project with the comand: "scrapy startproject name" but I created the folders and the files myself.
Scraping all text using Scrapy without knowing webpages' structure
What you are looking for here is scrapy CrawlSpider
CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.
Here's a good example how your spider might look with CrawlSpider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'crawlspider'
start_urls = ['http://scrapy.org']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
# extracting basic body
item['body'] = '\n'.join(response.xpath('//text()').extract())
# or better just save whole source
item['source'] = response.body
return item
This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss.
Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.
There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page
A web crawler in a self-contained python file
You can run scrapy spiders as a single script without starting a project by using runspider
Is this what you wanted?
#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider
class MyItem(Item):
title = Field()
link = Field()
class MySpider(Spider):
start_urls = ['http://www.example.com']
name = 'samplespider'
def parse(self, response):
item = MyItem()
item['title'] = response.xpath('//h1/text()').extract()
item['link'] = response.url
yield item
Now you can run this with scrapy runspider myscript.py -o out.json
Scrapy: scrape succesive urls
Add a start_requests
method to your class and generate those requests as you needed:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
n = ??? # set the limit here
for i in range(1, n):
yield scrapy.Request('http://www.example.com/{}'.format(i), self.parse)
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}
Another option is, you can put multiple urls in the start_urls
parameter:
class QuotesSpider(scrapy.Spider):
name = "scraper"
start_urls = ['http://www.example.com/{}'.format(i) for i in range(1, 100)]
# choose your limit here ^^^
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}
Related Topics
Python's JSON Module, Converts Int Dictionary Keys to Strings
Ssl.Sslerror: [Ssl: Certificate_Verify_Failed] Certificate Verify Failed (_Ssl.C:749)
Time.Sleep -- Sleeps Thread or Process
Retrieving a Foreign Key Value with Django-Rest-Framework Serializers
Imports in _Init_.Py and 'Import As' Statement
Parsing a JSON String Which Was Loaded from a CSV Using Pandas
How to Access the Previous/Next Element in a for Loop
How to Load Existing Db File to Memory in Python SQLite3
Why Aren't Superclass _Init_ Methods Automatically Invoked
How to Make Scipy.Interpolate Give an Extrapolated Result Beyond the Input Range
Numpy - Create Matrix with Rows of Vector
Convolve2D Just by Using Numpy
Runtimeerror: Main Thread Is Not in Main Loop
How to Specify an Authenticated Proxy for a Python Http Connection
How Does the Key Argument in Python's Sorted Function Work
Product Code Looks Like Abcd2343, How to Split by Letters and Numbers
How to Pass an Argument to a Function Pointer Parameter
Typeerror: Module._Init_() Takes at Most 2 Arguments (3 Given)