Scrapy is still the workhorse for production crawls in 2026 — and still the framework where proxy setup confuses people most, because there are four different places to plug a proxy in and three of them are wrong for most projects. This guide gives you the right one: a small custom middleware with per-request routing, sticky sessions, ban detection, and sane retry behavior.
If you're on plain requests/httpx/aiohttp instead, see How to Rotate Proxies in Python. This article is Scrapy-specific.
For a rotating residential gateway, the minimum viable setup is one line per request — no middleware needed:
def start_requests(self):
for url in self.urls:
yield scrapy.Request(
url,
meta={"proxy": "http://USERNAME:[email protected]:913"},
)
Scrapy's built-in HttpProxyMiddleware reads request.meta["proxy"] and handles authentication from the URL. The gateway rotates the exit IP for you. If that's all you need, stop here. The rest of this guide is for when you need control: sticky sessions, country routing, ban-aware rotation, and concurrency tuning.
Drop this in middlewares.py. It assigns sticky sessions per domain, rotates on bans, and tags every request so you can debug which session fetched what:
import random
import string
GATEWAY = "us.jibaoproxy.com:913"
USERNAME = "USERNAME" # move to settings.py / env in real projects
PASSWORD = "PASSWORD"
def _new_session(n=8):
return "".join(random.choices(string.ascii_lowercase + string.digits, k=n))
class JibaoProxyMiddleware:
"""Sticky session per domain; rotate session on ban."""
def __init__(self):
self.sessions = {} # domain -> session id
def _proxy_url(self, session_id):
user = f"{USERNAME}-session-{session_id}"
return f"http://{user}:{PASSWORD}@{GATEWAY}"
def process_request(self, request, spider):
domain = request.url.split("/")[2]
session = self.sessions.setdefault(domain, _new_session())
request.meta["proxy"] = self._proxy_url(session)
request.meta["proxy_session"] = session
def rotate(self, domain):
"""Call when a session is burned."""
self.sessions[domain] = _new_session()
And a companion downloader middleware that detects bans and retries on a fresh session:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
BAN_CODES = {403, 429}
BAN_MARKERS = (b"captcha", b"access denied", b"unusual traffic")
class BanAwareRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
banned = (
response.status in BAN_CODES
or any(m in response.body[:2048].lower() for m in BAN_MARKERS)
)
if banned:
domain = request.url.split("/")[2]
proxy_mw = spider.crawler.engine.downloader.middleware.middlewares
for mw in proxy_mw:
if hasattr(mw, "rotate"):
mw.rotate(domain) # burn the session
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super().process_response(request, response, spider)
Wire both up in settings.py:
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.JibaoProxyMiddleware": 350,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None, # replace stock retry
"myproject.middlewares.BanAwareRetryMiddleware": 550,
}
RETRY_TIMES = 2
Priority matters: the proxy middleware must run before Scrapy's HttpProxyMiddleware (750), so anything under 750 works; 350 keeps it early and predictable.
| Crawl type | Mode | Implementation |
|---|---|---|
| Stateless page harvesting | Rotating | Bare username, gateway rotates per request |
| Login + crawl behind auth | Sticky per account | -session-{account_id}, never rotate mid-login |
| Pagination-heavy listings | Sticky per domain, rotate on ban | The middleware above |
| Geo-specific pricing | Rotating + country pin | USERNAME-country-de style parameters |
Deeper treatment of this trade-off: Sticky vs Rotating Proxy Sessions.
Scrapy defaults are tuned for polite single-IP crawling. Behind a rotating pool you can push much harder — but per-domain limits still matter because the target sees aggregate behavior:
# settings.py - sane starting point behind a residential pool
CONCURRENT_REQUESTS = 64
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # what the target experiences
DOWNLOAD_DELAY = 0.25 # jitter applied per slot
RANDOMIZE_DOWNLOAD_DELAY = True # 0.5x-1.5x the delay
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 6.0
DOWNLOAD_TIMEOUT = 30
ROBOTSTXT_OBEY = True
Bump CONCURRENT_REQUESTS_PER_DOMAIN only after watching your 403 rate at the current level for a few thousand requests. Going 8 → 32 because "the proxies rotate anyway" is how people burn through GB on retries.
IMAGES_STORE pipelines to fetch through a datacenter proxy or direct connection if the CDN doesn't care — media is most of your GB and rarely protected.HTTPCACHE_ENABLED = True while you iterate on parsers replays responses from disk instead of re-fetching through the pool. This one setting typically halves a project's bandwidth bill.COMPRESSION_ENABLED = True (default) — gzip'd HTML is 5–10x smaller on the wire.407 Proxy Authentication RequiredCredentials didn't reach the proxy. Put them in the URL (http://user:pass@host:port) in meta["proxy"] — Scrapy parses and sets Proxy-Authorization for you. Setting the header manually and using URL credentials causes double-auth weirdness; pick one.
TunnelError: Could not open CONNECT tunnelAlmost always a typo'd host/port, or HTTPS target through an endpoint that doesn't allow CONNECT on that port. Verify with curl -x outside Scrapy first.
Your sticky session outlived its welcome, or your per-domain rate is too hot. The ban-aware middleware above handles the first case; lower CONCURRENT_REQUESTS_PER_DOMAIN for the second. If it's a JA4-checking target, Scrapy's TLS stack itself may be the tell — see JA3/JA4 explained for why no proxy fixes that.
meta["proxy"] with the gateway URL is the whole setup.Rotating and sticky residential sessions on one gateway. $5 free credit to crawl with.
Start Free TrialNew users get $5 USDT instantly, plus an extra first-deposit reward — limited-time offer.