Scrapy Proxy Middleware: Complete Configuration Guide (2026)

Published June 4, 2026 · 11 min read

Scrapy is still the workhorse for production crawls in 2026 — and still the framework where proxy setup confuses people most, because there are four different places to plug a proxy in and three of them are wrong for most projects. This guide gives you the right one: a small custom middleware with per-request routing, sticky sessions, ban detection, and sane retry behavior.

If you're on plain requests/httpx/aiohttp instead, see How to Rotate Proxies in Python. This article is Scrapy-specific.

The 30-Second Version

For a rotating residential gateway, the minimum viable setup is one line per request — no middleware needed:

def start_requests(self):
    for url in self.urls:
        yield scrapy.Request(
            url,
            meta={"proxy": "http://USERNAME:[email protected]:913"},
        )

Scrapy's built-in HttpProxyMiddleware reads request.meta["proxy"] and handles authentication from the URL. The gateway rotates the exit IP for you. If that's all you need, stop here. The rest of this guide is for when you need control: sticky sessions, country routing, ban-aware rotation, and concurrency tuning.

A Production Proxy Middleware

Drop this in middlewares.py. It assigns sticky sessions per domain, rotates on bans, and tags every request so you can debug which session fetched what:

import random
import string

GATEWAY = "us.jibaoproxy.com:913"
USERNAME = "USERNAME"          # move to settings.py / env in real projects
PASSWORD = "PASSWORD"

def _new_session(n=8):
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=n))

class JibaoProxyMiddleware:
    """Sticky session per domain; rotate session on ban."""

    def __init__(self):
        self.sessions = {}          # domain -> session id

    def _proxy_url(self, session_id):
        user = f"{USERNAME}-session-{session_id}"
        return f"http://{user}:{PASSWORD}@{GATEWAY}"

    def process_request(self, request, spider):
        domain = request.url.split("/")[2]
        session = self.sessions.setdefault(domain, _new_session())
        request.meta["proxy"] = self._proxy_url(session)
        request.meta["proxy_session"] = session

    def rotate(self, domain):
        """Call when a session is burned."""
        self.sessions[domain] = _new_session()

And a companion downloader middleware that detects bans and retries on a fresh session:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

BAN_CODES = {403, 429}
BAN_MARKERS = (b"captcha", b"access denied", b"unusual traffic")

class BanAwareRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        banned = (
            response.status in BAN_CODES
            or any(m in response.body[:2048].lower() for m in BAN_MARKERS)
        )
        if banned:
            domain = request.url.split("/")[2]
            proxy_mw = spider.crawler.engine.downloader.middleware.middlewares
            for mw in proxy_mw:
                if hasattr(mw, "rotate"):
                    mw.rotate(domain)        # burn the session
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return super().process_response(request, response, spider)

Wire both up in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.JibaoProxyMiddleware": 350,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": None,   # replace stock retry
    "myproject.middlewares.BanAwareRetryMiddleware": 550,
}
RETRY_TIMES = 2

Priority matters: the proxy middleware must run before Scrapy's HttpProxyMiddleware (750), so anything under 750 works; 350 keeps it early and predictable.

Sticky vs Rotating: Which Mode for Which Spider

Crawl type	Mode	Implementation
Stateless page harvesting	Rotating	Bare username, gateway rotates per request
Login + crawl behind auth	Sticky per account	`-session-{account_id}`, never rotate mid-login
Pagination-heavy listings	Sticky per domain, rotate on ban	The middleware above
Geo-specific pricing	Rotating + country pin	`USERNAME-country-de` style parameters

Deeper treatment of this trade-off: Sticky vs Rotating Proxy Sessions.

Concurrency Settings That Don't Get You Banned

Scrapy defaults are tuned for polite single-IP crawling. Behind a rotating pool you can push much harder — but per-domain limits still matter because the target sees aggregate behavior:

# settings.py - sane starting point behind a residential pool
CONCURRENT_REQUESTS = 64
CONCURRENT_REQUESTS_PER_DOMAIN = 8     # what the target experiences
DOWNLOAD_DELAY = 0.25                  # jitter applied per slot
RANDOMIZE_DOWNLOAD_DELAY = True        # 0.5x-1.5x the delay
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 6.0
DOWNLOAD_TIMEOUT = 30
ROBOTSTXT_OBEY = True

Bump CONCURRENT_REQUESTS_PER_DOMAIN only after watching your 403 rate at the current level for a few thousand requests. Going 8 → 32 because "the proxies rotate anyway" is how people burn through GB on retries.

Bandwidth Discipline (Residential GB Are Money)

Don't download media through residential IPs. Set IMAGES_STORE pipelines to fetch through a datacenter proxy or direct connection if the CDN doesn't care — media is most of your GB and rarely protected.
Cache during development: HTTPCACHE_ENABLED = True while you iterate on parsers replays responses from disk instead of re-fetching through the pool. This one setting typically halves a project's bandwidth bill.
Compress: keep COMPRESSION_ENABLED = True (default) — gzip'd HTML is 5–10x smaller on the wire.

Common Failures and What They Actually Mean

`407 Proxy Authentication Required`

Credentials didn't reach the proxy. Put them in the URL (http://user:pass@host:port) in meta["proxy"] — Scrapy parses and sets Proxy-Authorization for you. Setting the header manually and using URL credentials causes double-auth weirdness; pick one.

`TunnelError: Could not open CONNECT tunnel`

Almost always a typo'd host/port, or HTTPS target through an endpoint that doesn't allow CONNECT on that port. Verify with curl -x outside Scrapy first.

Spider works for 10 minutes, then everything is 403

Your sticky session outlived its welcome, or your per-domain rate is too hot. The ban-aware middleware above handles the first case; lower CONCURRENT_REQUESTS_PER_DOMAIN for the second. If it's a JA4-checking target, Scrapy's TLS stack itself may be the tell — see JA3/JA4 explained for why no proxy fixes that.

Free tool · no signup

Validate your proxy list before the crawl

Paste endpoints into our Proxy Checker: it tests connectivity, latency, anonymity level and exit-IP type in bulk — catch dead or mislabeled proxies before Scrapy wastes retries on them.

Check my proxies →

Tired of babysitting free lists? One residential gateway replaces all of it — get 500MB free traffic →

Summary

For simple rotation, meta["proxy"] with the gateway URL is the whole setup.
For real projects: sticky session per domain, rotate on ban, retry once on the new session.
Per-domain concurrency is what targets see — tune it by 403 rate, not optimism.
Enable HTTP cache in development; keep media off residential bandwidth.
403s on a JA4-checking target are a TLS problem, not a proxy problem.

Point Your Spiders at a Real Pool

Rotating and sticky residential sessions on one gateway. 500MB free traffic to crawl with.

Start Free Trial