Crawling with an Authenticated Session in Scrapy

Crawling with an authenticated session in Scrapy

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.


Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).


An example:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['example.com']
login_page = 'http://www.example.com/login'
start_urls = ['http://www.example.com/useful_page/',
'http://www.example.com/another_useful_page/']

rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
callback='parse_item', follow=True),
)

def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)

def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.check_login_response)

def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Hi Herman" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.

def parse_item(self, response):

# Scrape data from page

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

Using Scrapy with authenticated (logged in) user session

In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.

It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.

Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:

from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

# ...

def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)

def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')

# etc.

If you look here, there's an example of a spider that authenticates before scraping.

In this case, it handles things in the parse function (the default callback of any request).

def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[@id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)

So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.

I hope this is clear, feel free to ask if you have any other questions!


Edit:

Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.

To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:

def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests

"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/@href')

# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)

As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.

What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.

How to crawl a website that requires login using scrapy?

Short answer : Yes, you can scrape data after login. Check Formdata in scrapy and this answer post request using scrapy and documentation

Long Answer : Login pages are just forms. You can access those fields to fill in the required details and post that data. You can manually login and check the chrome developer tools [ctrl + shift + i] for network call being made when you press the submit/login button. You can then inspect the post request made and duplicate it in your scraper. You can check the above links to read about how to post data, and how requests and responses work in scrapy.

Scrapy authenticated crawl

Alright the login from the scrapy documentation works. It was just a small configuration error with the cookie jars.



Related Topics



Leave a reply



Submit