Scrapy and Proxies

Scrapy and proxies

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.


C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,


C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

How to use rotating proxy in Scrapy?

If adding proxy to request parameters does not work then

#1

You can add a proxy middleware pipeline and add that to the project setting. (better, safer option)

Here is a working code for the middleware -

from w3lib.http import basic_auth_header
from scrapy.utils.project import get_project_settings

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        settings = get_project_settings()
        request.meta['proxy'] = settings.get('PROXY_HOST') + ':' + settings.get('PROXY_PORT')
        request.headers["Proxy-Authorization"] = basic_auth_header(settings.get('PROXY_USER'), settings.get('PROXY_PASSWORD'))
        spider.log('Proxy : %s' % request.meta['proxy'])

settings file (activate DOWNLOADER_MIDDLEWARES) -

import os
from dotenv import load_dotenv

load_dotenv()
....
....

# Proxy setup

PROXY_HOST = os.environ.get("PROXY_HOST")
PROXY_PORT = os.environ.get("PROXY_PORT")
PROXY_USER = os.environ.get("PROXY_USER")
PROXY_PASSWORD = os.environ.get("PROXY_PASSWORD")
.....
.....
.....

DOWNLOADER_MIDDLEWARES = {
   # 'project.middlewares.projectDownloaderMiddleware': 543,
    'project.proxy_middlewares.ProxyMiddleware': 350,
}

.env file -

PROXY_HOST=127.0.0.1
PROXY_PORT=6666
PROXY_USER=proxy_user
PROXY_PASSWORD=proxy_password

#2

Have a look at this middleware - scrapy-rotating-proxies

Why would proxies fail in Scrapy, but make succesful requests under the python-requests library

On your code sample built with requests - You implemented multiple sessions (1 session - 1 proxy).

However on scrapy default settings - application will use single cookiejar for all proxies.

It will send the same cookie data for each proxy.

You need to use cookiejar meta key in your requests

If webserver receive requests from multiple IPs with single sessionId transferred in cookieheaders - it looks suspicious and webserver is able identify it as bot and ban all used IPs. - probably exact this thing happened.

Scrapy and Proxies