Crawling with an authenticated session in Scrapy
Do not override the parse
function in a CrawlSpider
:
When you are using a CrawlSpider
, you shouldn't override the parse
function. There's a warning in the CrawlSpider
documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule
This is because with a CrawlSpider
, parse
(the default callback of any request) sends the response to be processed by the Rule
s.
Logging in before crawling:
In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider
(which inherits from a CrawlSpider
), and override the init_request
function. This function will be called when the spider is initialising, and before it starts crawling.
In order for the Spider to begin crawling, you need to call self.initialized
.
You can read the code that's responsible for this here (it has helpful docstrings).
An example:
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['example.com']
login_page = 'http://www.example.com/login'
start_urls = ['http://www.example.com/useful_page/',
'http://www.example.com/another_useful_page/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
callback='parse_item', follow=True),
)
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Hi Herman" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
# Scrape data from page
Saving items:
Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html
If you have any problems/questions in regards to Item
s, don't hesitate to pop open a new question and I'll do my best to help.
Using Scrapy with authenticated (logged in) user session
In the code above, the FormRequest
that is being used to authenticate has the after_login
function set as its callback. This means that the after_login
function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed"
. If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse
function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[@id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/@href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider
rather than doing things manually.
How to crawl a website that requires login using scrapy?
Short answer : Yes, you can scrape data after login. Check Formdata in scrapy and this answer post request using scrapy and documentation
Long Answer : Login pages are just forms. You can access those fields to fill in the required details and post that data. You can manually login and check the chrome developer tools [ctrl + shift + i] for network call being made when you press the submit/login button. You can then inspect the post request made and duplicate it in your scraper. You can check the above links to read about how to post data, and how requests and responses work in scrapy.
Scrapy authenticated crawl
Alright the login from the scrapy documentation works. It was just a small configuration error with the cookie jars.
Related Topics
Ambiguity in Pandas Dataframe/Numpy Array "Axis" Definition
Add Column with Constant Value to Pandas Dataframe
Any Gotchas Using Unicode_Literals in Python 2.6
How to Set Layer-Wise Learning Rate in Tensorflow
Initialize List to a Variable in a Dictionary Inside a Loop
Convert Pandas Datetimeindex to Unix Time
Tuple or List When Using 'In' in an 'If' Clause
A Queryset by Aggregate Field Value
How to Make the Width of the Title Box Span the Entire Plot
Run Code After Flask Application Has Started
How to Check If an Object Is a List or Tuple (But Not String)
Difference Between Data and JSON Parameters in Python Requests Package
How to Get the Executable's Current Directory in Py2Exe
I Have Python on My Ubuntu System, But Gcc Can't Find Python.H
Default Filter in Django Admin
About the Pil Error -- Ioerror: Decoder Zip Not Available
How to Force a Python Wheel to Be Platform Specific When Building It