Scrapy and proxies
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware
.
The easiest way to use a proxy is to set the environment variable http_proxy
. How this is done depends on your shell.
C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy
you should follow below,
C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port
How to use rotating proxy in Scrapy?
If adding proxy to request parameters does not work then
#1
You can add a proxy middleware pipeline
and add that to the project setting. (better, safer option)
Here is a working code for the middleware -
from w3lib.http import basic_auth_header
from scrapy.utils.project import get_project_settings
class ProxyMiddleware(object):
def process_request(self, request, spider):
settings = get_project_settings()
request.meta['proxy'] = settings.get('PROXY_HOST') + ':' + settings.get('PROXY_PORT')
request.headers["Proxy-Authorization"] = basic_auth_header(settings.get('PROXY_USER'), settings.get('PROXY_PASSWORD'))
spider.log('Proxy : %s' % request.meta['proxy'])
settings file (activate DOWNLOADER_MIDDLEWARES
) -
import os
from dotenv import load_dotenv
load_dotenv()
....
....
# Proxy setup
PROXY_HOST = os.environ.get("PROXY_HOST")
PROXY_PORT = os.environ.get("PROXY_PORT")
PROXY_USER = os.environ.get("PROXY_USER")
PROXY_PASSWORD = os.environ.get("PROXY_PASSWORD")
.....
.....
.....
DOWNLOADER_MIDDLEWARES = {
# 'project.middlewares.projectDownloaderMiddleware': 543,
'project.proxy_middlewares.ProxyMiddleware': 350,
}
.env
file -
PROXY_HOST=127.0.0.1
PROXY_PORT=6666
PROXY_USER=proxy_user
PROXY_PASSWORD=proxy_password
#2
Have a look at this middleware - scrapy-rotating-proxies
Why would proxies fail in Scrapy, but make succesful requests under the python-requests library
On your code sample built with requests
- You implemented multiple sessions (1 session - 1 proxy).
However on scrapy default settings - application will use single cookiejar
for all proxies.
It will send the same cookie data for each proxy.
You need to use cookiejar
meta key in your requests
If webserver receive requests from multiple IPs with single sessionId
transferred in cookieheaders - it looks suspicious and webserver is able identify it as bot and ban all used IPs. - probably exact this thing happened.
Related Topics
Pip Broke. How to Fix Distributionnotfound Error
Python - Initializing Multiple Lists/Line
Python Saving Multiple Figures into One PDF File
Take Multiple Lists into Dataframe
Type Annotations for *Args and **Kwargs
How to Upload a File to Directory in S3 Bucket Using Boto
Python Slice How-To, I Know the Python Slice But How to Use Built-In Slice Object for It
In Python, How to Put a Thread to Sleep Until a Specific Time
Simple Animation Using Tkinter
Importerror: No Module Named <Something>
Xrange(2**100)' -> Overflowerror: Long Int Too Large to Convert to Int
How to Fix "Importerror: Dll Load Failed" While Importing Win32Api
Finding Index of Nearest Point in Numpy Arrays of X and Y Coordinates
Understanding Time.Perf_Counter() and Time.Process_Time()
How to Get a List of Column Names in SQLite
Typeerror: Unhashable Type: 'List' When Using Built-In Set Function