Replacing ScrapingBee and Bright Data: Self-Hosted Scraping in 2026
Scraping SaaS pricing scales linearly with traffic. At any volume, self-hosted is dramatically cheaper. Here's the stack I deploy and the gotchas I've learned.
By Andrii Votiakov on
Scraping-as-a-service vendors (Bright Data, ScrapingBee, ScraperAPI, Oxylabs) charge per-request. At any meaningful volume, that pricing model gets brutal — $1-3 per 1,000 requests on basic plans, more for residential proxies and JavaScript-rendered pages. A self-hosted stack flips this from a per-request cost to a fixed infrastructure bill. If you're evaluating multiple SaaS tools to replace at once, the build vs buy 2026 framework gives a scoring system to prioritise which ones to tackle first. For communication infrastructure specifically, replacing Twilio follows a similar pattern to scraping — high volume, linear pricing, clear crossover point.
Quick answer
Above ~$2,000/month in scraping SaaS spend, self-hosting saves 60-90% with a Playwright + headless browser pool + proxy rotation stack. Total monthly cost typically drops to $300-1,500 for the same throughput. Engineering effort: 2-4 weeks initial build, ~2 days/month ongoing maintenance.
The cost crossover
Rough comparison at common volumes:
| Volume / month | ScrapingBee / Bright Data | Self-hosted (compute + proxy + maintenance) |
|---|---|---|
| 100k requests | $40-150 | ~$300 (overhead exceeds savings) |
| 1M requests | $400-1,500 | $300-700 |
| 10M requests | $4,000-15,000 | $700-2,000 |
| 100M requests | $40,000-100,000+ | $2,000-7,000 |
Crossover usually around 1-3M requests/month for cost. For sites with heavy bot detection (LinkedIn, Amazon, Booking), the SaaS premium for residential proxies makes the crossover earlier — sometimes 500k requests.
The stack I deploy
Browser pool: Playwright
Playwright beats Puppeteer for this work — better cross-browser support, more stable, better network interception. Run in headless mode in containers. Each container handles ~5-15 concurrent pages.
Key tuning:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
],
)
context = await browser.new_context(
user_agent="Mozilla/5.0 ...", # rotate realistic UAs
viewport={"width": 1920, "height": 1080},
locale="en-GB",
)
Block what you don't need (images, fonts, ads) to cut bandwidth and speed up:
await context.route("**/*.{png,jpg,jpeg,svg,gif,woff2,woff}", lambda r: r.abort())
Proxy rotation
Two layers:
- Datacenter proxies for permissive sites (most). $0.50-1/GB or fixed monthly. Providers: Webshare, IPRoyal, Smartproxy datacenter tier.
- Residential proxies for hard sites (LinkedIn, Amazon, Booking). $3-15/GB. Providers: Bright Data, Oxylabs, Smartproxy residential.
Rotation strategy:
- New IP per request for sensitive targets
- Sticky session per scraping flow for sites that track sessions
- Geographic targeting where required (some sites geofence content)
Browser fingerprint randomisation
Real browsers vary in canvas fingerprint, WebGL renderer, font list, time zone. Bot-detection vendors (Cloudflare, DataDome, PerimeterX) check all of these. Use a stealth plugin:
playwright-stealthfor Pythonpuppeteer-extra-plugin-stealthfor Node
Plus rotate user agents, viewport sizes, and time zones to match the proxy IP geography.
Queue and orchestration
Don't run scrapers as one big monolith. Queue-driven:
- Job queue (Redis/SQS/Cloud Tasks) with retry policies
- Worker pool of N browser containers
- Dead-letter queue for jobs failing > 3 times
- Rate limiter per target domain (one of the most-skipped pieces; protects you AND the target site)
LLM extraction layer
The biggest win since 2023: extract structured data from messy HTML using an LLM instead of brittle CSS selectors. Cheap models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle most cases at $0.0001-0.001 per page.
extraction_prompt = """
Extract the following fields from this HTML as JSON:
- product_name: string
- price: number
- currency: string (3-letter code)
- availability: "in_stock" | "out_of_stock" | "unknown"
If a field is not present, return null.
"""
Combine with strict JSON schema validation. The result: scrapers that survive site redesigns instead of breaking weekly.
Gotchas I've learned
1. Memory leaks
Browser contexts leak memory. Restart workers every N jobs (typically 50-100). Don't try to fix the leak — Chrome has done it before, will do it again.
2. Captcha
Scraping SaaS often handles captcha for you. Self-hosted, you have three options:
- Avoid captcha-prone targets — change strategy (use the site's API, partner data, etc.)
- 2Captcha / CapSolver — $1-3 per 1,000 captchas; works for hCaptcha and reCAPTCHA v2/v3
- Higher-quality proxies and fingerprints — most captchas are triggered by suspicious traffic; better disguise prevents triggering
3. Cloudflare's Turnstile and Bot Score
Increasingly common. The right answer for high-volume scraping behind Turnstile is residential proxies + good fingerprinting + paced request rates. There's no free trick.
4. Legal and ToS
Self-hosting doesn't change the legal status. Public data is generally fair game in most jurisdictions, but ToS-bound sites still bind you. Get legal advice before scraping at scale, especially if you're a commercial entity.
5. Storage
Scraped data adds up fast. Plan for:
- Hot storage in Postgres / DynamoDB for recent data
- Compressed Parquet in S3 / GCS for archive
- Lifecycle rules to move old data to cold tiers
What you give up
Honest list:
- No vendor support when the target site changes
- No managed captcha solving (must integrate your own)
- No magic "anti-bot bypass" sales pitch
- More engineering attention — you own the maintenance forever
For most companies above the crossover, the answer is still "build". But it's a real commitment.
Realistic numbers
Recent client (~$8,400/month on Bright Data + ScrapingBee):
- 4 worker nodes on Spot EC2 (m6i.large): $200/month
- Datacenter proxy bundle: $250/month
- Residential proxy bundle (selective): $400/month
- Captcha solving budget: $100/month
- Engineering ~10% one engineer: $1,500/month equivalent
Total: $2,450/month, ~70% reduction. Initial build: 3 weeks, paid back in week 6.
If your scraping bill has crossed the $2-5k/month line and you'd like help building the replacement, book a call.