Honest list: - No vendor support when the target site changes - No managed captcha solving (must integrate your own) - No magic "anti-bot bypass" sales pitch - More engineering attention — you own the maintenance forever For most companies above the crossover, the answer is still "build". But it's a real commitment.

Replacing ScrapingBee and Bright Data: Self-Hosted Scraping in 2026

Scraping SaaS pricing scales linearly with traffic. At any volume, self-hosted is dramatically cheaper. Here's the stack I deploy and the gotchas I've learned.

By Andrii Votiakov on 2026-04-26

Scraping-as-a-service vendors (Bright Data, ScrapingBee, ScraperAPI, Oxylabs) charge per-request. At any meaningful volume, that pricing model gets brutal — $1-3 per 1,000 requests on basic plans, more for residential proxies and JavaScript-rendered pages. A self-hosted stack flips this from a per-request cost to a fixed infrastructure bill. If you're evaluating multiple SaaS tools to replace at once, the build vs buy 2026 framework gives a scoring system to prioritise which ones to tackle first. For communication infrastructure specifically, replacing Twilio follows a similar pattern to scraping — high volume, linear pricing, clear crossover point.

Quick answer

Above ~$2,000/month in scraping SaaS spend, self-hosting saves 60-90% with a Playwright + headless browser pool + proxy rotation stack. Total monthly cost typically drops to $300-1,500 for the same throughput. Engineering effort: 2-4 weeks initial build, ~2 days/month ongoing maintenance.

The cost crossover

Rough comparison at common volumes:

Volume / month	ScrapingBee / Bright Data	Self-hosted (compute + proxy + maintenance)
100k requests	$40-150	~$300 (overhead exceeds savings)
1M requests	$400-1,500	$300-700
10M requests	$4,000-15,000	$700-2,000
100M requests	$40,000-100,000+	$2,000-7,000

Crossover usually around 1-3M requests/month for cost. For sites with heavy bot detection (LinkedIn, Amazon, Booking), the SaaS premium for residential proxies makes the crossover earlier — sometimes 500k requests.

The stack I deploy

Browser pool: Playwright

Playwright beats Puppeteer for this work — better cross-browser support, more stable, better network interception. Run in headless mode in containers. Each container handles ~5-15 concurrent pages.

Key tuning:

browser = await p.chromium.launch(
    headless=True,
    args=[
        "--disable-dev-shm-usage",
        "--disable-blink-features=AutomationControlled",
        "--no-sandbox",
    ],
)
context = await browser.new_context(
    user_agent="Mozilla/5.0 ...",  # rotate realistic UAs
    viewport={"width": 1920, "height": 1080},
    locale="en-GB",
)

Block what you don't need (images, fonts, ads) to cut bandwidth and speed up:

await context.route("**/*.{png,jpg,jpeg,svg,gif,woff2,woff}", lambda r: r.abort())

Proxy rotation

Two layers:

Datacenter proxies for permissive sites (most). $0.50-1/GB or fixed monthly. Providers: Webshare, IPRoyal, Smartproxy datacenter tier.
Residential proxies for hard sites (LinkedIn, Amazon, Booking). $3-15/GB. Providers: Bright Data, Oxylabs, Smartproxy residential.

Rotation strategy:

New IP per request for sensitive targets
Sticky session per scraping flow for sites that track sessions
Geographic targeting where required (some sites geofence content)

Browser fingerprint randomisation

Real browsers vary in canvas fingerprint, WebGL renderer, font list, time zone. Bot-detection vendors (Cloudflare, DataDome, PerimeterX) check all of these. Use a stealth plugin:

playwright-stealth for Python
puppeteer-extra-plugin-stealth for Node

Plus rotate user agents, viewport sizes, and time zones to match the proxy IP geography.

Queue and orchestration

Don't run scrapers as one big monolith. Queue-driven:

Job queue (Redis/SQS/Cloud Tasks) with retry policies
Worker pool of N browser containers
Dead-letter queue for jobs failing > 3 times
Rate limiter per target domain (one of the most-skipped pieces; protects you AND the target site)

LLM extraction layer

The biggest win since 2023: extract structured data from messy HTML using an LLM instead of brittle CSS selectors. Cheap models (Claude Haiku, GPT-4o-mini, Gemini Flash) handle most cases at $0.0001-0.001 per page.

extraction_prompt = """
Extract the following fields from this HTML as JSON:
- product_name: string
- price: number
- currency: string (3-letter code)
- availability: "in_stock" | "out_of_stock" | "unknown"

If a field is not present, return null.
"""

Combine with strict JSON schema validation. The result: scrapers that survive site redesigns instead of breaking weekly.

Gotchas I've learned

1. Memory leaks

Browser contexts leak memory. Restart workers every N jobs (typically 50-100). Don't try to fix the leak — Chrome has done it before, will do it again.

2. Captcha

Scraping SaaS often handles captcha for you. Self-hosted, you have three options:

Avoid captcha-prone targets — change strategy (use the site's API, partner data, etc.)
2Captcha / CapSolver — $1-3 per 1,000 captchas; works for hCaptcha and reCAPTCHA v2/v3
Higher-quality proxies and fingerprints — most captchas are triggered by suspicious traffic; better disguise prevents triggering

3. Cloudflare's Turnstile and Bot Score

Increasingly common. The right answer for high-volume scraping behind Turnstile is residential proxies + good fingerprinting + paced request rates. There's no free trick.

4. Legal and ToS

Self-hosting doesn't change the legal status. Public data is generally fair game in most jurisdictions, but ToS-bound sites still bind you. Get legal advice before scraping at scale, especially if you're a commercial entity.

5. Storage

Scraped data adds up fast. Plan for:

Hot storage in Postgres / DynamoDB for recent data
Compressed Parquet in S3 / GCS for archive
Lifecycle rules to move old data to cold tiers

What you give up

Honest list:

No vendor support when the target site changes
No managed captcha solving (must integrate your own)
No magic "anti-bot bypass" sales pitch
More engineering attention — you own the maintenance forever

For most companies above the crossover, the answer is still "build". But it's a real commitment.

Realistic numbers

Recent client (~$8,400/month on Bright Data + ScrapingBee):

4 worker nodes on Spot EC2 (m6i.large): $200/month
Datacenter proxy bundle: $250/month
Residential proxy bundle (selective): $400/month
Captcha solving budget: $100/month
Engineering ~10% one engineer: $1,500/month equivalent

Total: $2,450/month, ~70% reduction. Initial build: 3 weeks, paid back in week 6.

If your scraping bill has crossed the $2-5k/month line and you'd like help building the replacement, book a call.