Scrapy Very Basic Example

Scrapy Very Basic Example

You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.

The tutorial implies that Scrapy is, in fact, a separate program.

Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.

For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .

tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.

Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:

scrapy crawl <website-name> -o <output-file> -t <output-type>

Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:

scrapy runspider my_spider.py

Scrapy Tutorial Example

Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:

def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]/a')
for site in sites:
item = dict()
item['name'] = site.xpath("text()").extract_first()
item['url'] = site.xpath("@href").extract_first()
item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
yield item

For future reference you can always test whether a specific xpath works with scrapy shell command.

e.g. what I did to test this:

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li')
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!

Very Simple Scrapy+Splash project

The problem is that I didn't start the scrapy project with the comand: "scrapy startproject name" but I created the folders and the files myself.

Scraping all text using Scrapy without knowing webpages' structure

What you are looking for here is scrapy CrawlSpider

CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.

Here's a good example how your spider might look with CrawlSpider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
name = 'crawlspider'
start_urls = ['http://scrapy.org']

rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
# extracting basic body
item['body'] = '\n'.join(response.xpath('//text()').extract())
# or better just save whole source
item['source'] = response.body
return item

This spider will crawl every webpage it can find on the website and log the title, url and whole text body.

For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss.
Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.

There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page

A web crawler in a self-contained python file

You can run scrapy spiders as a single script without starting a project by using runspider
Is this what you wanted?

#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider

class MyItem(Item):
title = Field()
link = Field()

class MySpider(Spider):

start_urls = ['http://www.example.com']
name = 'samplespider'

def parse(self, response):
item = MyItem()
item['title'] = response.xpath('//h1/text()').extract()
item['link'] = response.url
yield item

Now you can run this with scrapy runspider myscript.py -o out.json

Scrapy: scrape succesive urls

Add a start_requests method to your class and generate those requests as you needed:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "scraper"

def start_requests(self):
n = ??? # set the limit here
for i in range(1, n):
yield scrapy.Request('http://www.example.com/{}'.format(i), self.parse)

def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}

Another option is, you can put multiple urls in the start_urls parameter:

class QuotesSpider(scrapy.Spider):
name = "scraper"
start_urls = ['http://www.example.com/{}'.format(i) for i in range(1, 100)]
# choose your limit here ^^^

def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}


Related Topics



Leave a reply



Submit