Scrapy - How to Manage Cookies/Sessions

Scrapy - how to manage cookies/sessions

Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

Scrapy to manage session cookies with full webkit javascript execution

Actually, the first approach does work, but with one modification. The path to the cookies needs to be '/' (at least in my application), and not 'None' as in the code above. Ie, the line should be

libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'/',-1))

Unfortunately this only pushes the question back a bit. Now the cookies are saved properly, but the full page (including the frames) is still not being loaded and rendered with webkit as I had expected, and so the DOM is not complete as I see it in within the browser. If I simply request the frame that I want, then I get the error page instead of the content that is shown in a real browser. I'd love to see how to use webkit to render the whole page, including frames. Or how to achieve the second approach, completing the entire session in webkit.

Scrapy - How to Manage Cookies/Sessions

Scrapy - how to manage cookies/sessions

Scrapy to manage session cookies with full webkit javascript execution

Related Topics

Leave a reply