Replacing ScrapingBee and Bright Data: Self-Hosted Scraping in 2026

Scraping SaaS pricing scales linearly with traffic. At any volume, self-hosted is dramatically cheaper. Here's the stack I deploy and the gotchas I've learned.

By Andrii Votiakov on 2026-04-26

Scraping-as-a-service vendors (Bright Data, ScrapingBee, ScraperAPI, Oxylabs) charge per-request. At any meaningful volume, that pricing model gets brutal — $1-3 per 1,000 requests on basic plans, more for residential proxies and JavaScript-rendered pages. A self-hosted stack flips this from a per-request cost to a fixed infrastructure bill. If you're evaluating multiple SaaS tools to replace at once, the build vs buy 2026 framework gives a scoring system to prioritise which ones to tackle first. For communication infrastructure specifically, replacing Twilio follows a similar pattern to scraping — high volume, linear pricing, clear crossover point.

Quick answer

Above ~$2,000/month in scraping SaaS spend, self-hosting saves 60-90% with a Playwright + headless browser pool + proxy rotation stack. Total monthly cost typically drops to $300-1,500 for the same throughput. Engineering effort: 2-4 weeks initial build, ~2 days/month ongoing maintenance.

The cost crossover

Rough comparison at common volumes:

Volume / month ScrapingBee / Bright Data Self-hosted (compute + proxy + maintenance)
100k requests $40-150 ~$300 (overhead exceeds savings)
1M requests $400-1,500 $300-700
10M requests $4,000-15,000 $700-2,000
100M requests $40,000-100,000+ $2,000-7,000

Crossover usually around 1-3M requests/month for cost. For sites with heavy bot detection (LinkedIn, Amazon, Booking), the SaaS premium for residential proxies makes the crossover earlier — sometimes 500k requests.

The stack I deploy

Browser pool: Playwright

Playwright beats Puppeteer for this work — better cross-browser support, more stable, better network interception. Run in headless mode in containers. Each container handles ~5-15 concurrent pages.

Key tuning:

browser = await p.chromium.launch(
    headless=True,
    args=[
        "--disable-dev-shm-usage",
        "--disable-blink-features=AutomationControlled",
        "--no-sandbox",
    ],
)
context = await browser.new_context(
    user_agent="Mozilla/5.0 ...",  # rotate realistic UAs
    viewport={"width": 1920, "height": 1080},
    locale="en-GB",
)

Block what you don't need (images, fonts, ads) to cut bandwidth and speed up:

await context.route("**/*.{png,jpg,jpeg,svg,gif,woff2,woff}", lambda r: r.abort())

Proxy rotation

Two layers:

  1. Datacenter proxies for permissive sites (most). $0.50-1/GB or fixed monthly. Providers: Webshare, IPRoyal, Smartproxy datacenter tier.
  2. Residential proxies for hard sites (LinkedIn, Amazon, Booking). $3-15/GB. Providers: Bright Data, Oxylabs, Smartproxy residential.

Rotation strategy:

  • New IP per request for sensitive targets
  • Sticky session per scraping flow for sites that track sessions
  • Geographic targeting where required (some sites geofence content)

Browser fingerprint randomisation

Real browsers vary in canvas fingerprint, WebGL renderer, font list, time zone. Bot-detection vendors (Cloudflare, DataDome, PerimeterX) check all of these. Use a stealth plugin:

  • playwright-stealth for Python
  • puppeteer-extra-plugin-stealth for Node

Plus rotate user agents, viewport sizes, and time zones to match the proxy IP geography.

Queue and orchestration

Don't run scrapers as one big monolith. Queue-driven:

  • Job queue (Redis/SQS/Cloud Tasks) with retry policies
  • Worker pool of N browser containers
  • Dead-letter queue for jobs failing > 3 times
  • Rate limiter per target domain (one of the most-skipped pieces; protects you AND the target site)

LLM extraction layer

The biggest win since 2023: extract structured data from messy HTML using an LLM instead of brittle CSS selectors. Cheap models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle most cases at $0.0001-0.001 per page.

extraction_prompt = """
Extract the following fields from this HTML as JSON:
- product_name: string
- price: number
- currency: string (3-letter code)
- availability: "in_stock" | "out_of_stock" | "unknown"

If a field is not present, return null.
"""

Combine with strict JSON schema validation. The result: scrapers that survive site redesigns instead of breaking weekly.

Gotchas I've learned

1. Memory leaks

Browser contexts leak memory. Restart workers every N jobs (typically 50-100). Don't try to fix the leak — Chrome has done it before, will do it again.

2. Captcha

Scraping SaaS often handles captcha for you. Self-hosted, you have three options:

  1. Avoid captcha-prone targets — change strategy (use the site's API, partner data, etc.)
  2. 2Captcha / CapSolver — $1-3 per 1,000 captchas; works for hCaptcha and reCAPTCHA v2/v3
  3. Higher-quality proxies and fingerprints — most captchas are triggered by suspicious traffic; better disguise prevents triggering

3. Cloudflare's Turnstile and Bot Score

Increasingly common. The right answer for high-volume scraping behind Turnstile is residential proxies + good fingerprinting + paced request rates. There's no free trick.

4. Legal and ToS

Self-hosting doesn't change the legal status. Public data is generally fair game in most jurisdictions, but ToS-bound sites still bind you. Get legal advice before scraping at scale, especially if you're a commercial entity.

5. Storage

Scraped data adds up fast. Plan for:

  • Hot storage in Postgres / DynamoDB for recent data
  • Compressed Parquet in S3 / GCS for archive
  • Lifecycle rules to move old data to cold tiers

What you give up

Honest list:

  • No vendor support when the target site changes
  • No managed captcha solving (must integrate your own)
  • No magic "anti-bot bypass" sales pitch
  • More engineering attention — you own the maintenance forever

For most companies above the crossover, the answer is still "build". But it's a real commitment.

Realistic numbers

Recent client (~$8,400/month on Bright Data + ScrapingBee):

  • 4 worker nodes on Spot EC2 (m6i.large): $200/month
  • Datacenter proxy bundle: $250/month
  • Residential proxy bundle (selective): $400/month
  • Captcha solving budget: $100/month
  • Engineering ~10% one engineer: $1,500/month equivalent

Total: $2,450/month, ~70% reduction. Initial build: 3 weeks, paid back in week 6.


If your scraping bill has crossed the $2-5k/month line and you'd like help building the replacement, book a call.