Crawl4AI and Firecrawl have become the default way to feed web data into LLM pipelines — they crawl, render, and hand back clean Markdown your model can actually use. Then you point them at a real target and discover what every scraper learns eventually: the extraction layer was never the hard part. The hard part is that your crawler runs from one datacenter IP, and the web can see it.
This guide shows working proxy configuration for both tools — self-hosted Crawl4AI and both Firecrawl modes — plus the session strategy that stops RAG ingestion jobs from dying at page 50. It extends our AI agent proxy guide and browser-use guide to the crawling frameworks.
Crawl4AI (open-source, self-hosted) takes proxies at the BrowserConfig level — one proxy per crawler instance:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_cfg = BrowserConfig(
headless=True,
proxy_config={
"server": "us.jibaoproxy.com:913",
"username": "USERNAME",
"password": "PASSWORD",
},
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/docs",
config=CrawlerRunConfig(),
)
print(result.markdown[:500])
For deep crawls, rotate identity per crawler instance, not per page — pages within one site visit should share one exit IP (a human doesn't change cities between page 3 and page 4):
def crawler_for(site_id: str) -> BrowserConfig:
# Sticky session per site: cookies + IP move together
return BrowserConfig(
headless=True,
proxy_config={
"server": "us.jibaoproxy.com:913",
"username": f"USERNAME-session-{site_id}",
"password": "PASSWORD",
},
)
# site A crawled through exit A, site B through exit B, in parallel
Cloud API: proxying is a request parameter — Firecrawl routes through its own pools. You control quality tier, not the IPs:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR-KEY")
result = app.scrape_url(
"https://example.com/pricing",
params={"proxy": "stealth"}, # basic | stealth | auto
)
The catch: stealth-tier requests bill at a multiple of basic credits, and you can't pin countries precisely or hold sticky sessions across calls. Fine for occasional pages; expensive and imprecise at ingestion volume.
Self-hosted Firecrawl: you supply your own proxy via environment variables, with full control:
# .env for self-hosted Firecrawl
PROXY_SERVER=http://us.jibaoproxy.com:1000
PROXY_USERNAME=USERNAME
PROXY_PASSWORD=PASSWORD
Self-hosted + your own residential gateway is the cost-rational setup once you're ingesting at scale: you pay per-GB for bandwidth instead of per-page credits with a stealth multiplier.
semaphore_count / delay options exist for this — 2–4 concurrent pages per site is plenty; spread parallelism across sites instead.ETag/Last-Modified where the framework allows.| Setup | You pay for | Best when |
|---|---|---|
| Firecrawl cloud, stealth proxy | Per-page credits × stealth multiplier | Low volume, zero ops |
| Self-hosted Firecrawl + residential GB | Bandwidth only (~$2/GB) | Steady ingestion volume |
| Crawl4AI + residential GB | Bandwidth only, full control | Custom pipelines, deep crawls |
A typical text-heavy page costs 100–300 KB through a proxy — roughly 3,000–10,000 pages per GB. Block-and-retry loops are what blow the budget, which is another reason to fix detection before scaling volume.
proxy_config in BrowserConfig; sticky identity per site, rotate between sites.proxy: "stealth" param, costly at volume; self-hosted: your own gateway via env vars.Residential exits, sticky sessions, per-GB pricing — 500MB free traffic, no card required.
Start Free TrialNew users get 500MB free traffic instantly, plus an extra first-deposit reward — limited-time offer.