Self-Hosting Open-Source LLMs: When It Pays Back vs API

Running Llama 3.3 or Mistral on your own GPU instances looks cheap until you do the full maths. Here's the real crossover point versus Claude and GPT-4o APIs.

By Andrii Votiakov on 2026-04-06

The pitch for self-hosted LLMs is compelling: own the infrastructure, pay once per token regardless of volume, keep data inside your VPC. In practice, the economics only work above a certain request volume — and the break-even point is higher than most teams expect when they're excited about open-source models. I've helped several teams through this calculation, and the answer is almost never "just spin up an H100 and call it done."

Quick answer

At typical usage patterns, self-hosted open-source LLMs become cheaper than API providers somewhere between 50-100 million tokens per day, depending on the model and GPU. Below that threshold, API costs (especially with prompt caching and model routing) are almost always cheaper when you factor in GPU instance cost, engineering overhead, availability requirements, and model maintenance. Above that threshold — or if data privacy requires it — self-hosting is the right call.

The real cost of running your own GPU

Let's start with the hardware side, because this is where people under-count.

An NVIDIA H100 SXM on AWS (p4de.24xlarge, 8x H100) costs roughly $98/hour on-demand, or around $50-55/hour on a 1-year reserved instance. On GCP, an A3 instance (8x H100) is comparable. A100 nodes (p4d.24xlarge on AWS) are cheaper at ~$32/hour on-demand, ~$18/hour reserved.

A single H100 can serve roughly:

  • Llama 3.3 70B at 4-bit quantisation: ~2,500-3,500 tokens/second throughput
  • Mistral 7B at FP16: ~8,000-12,000 tokens/second
  • Qwen 72B at 4-bit: ~2,000-3,000 tokens/second

At 3,000 tokens/second sustained on one H100 (~$6/hour for a single H100 reserved slice), that's:

  • 3,000 tokens/sec × 3,600 seconds = 10.8 million tokens/hour
  • ~259 million tokens/day
  • Cost: ~$6/hour for the GPU slice

That sounds cheap compared to Claude Sonnet at $3/million input tokens. But that's the ceiling throughput. Real utilisation is rarely 100%.

The utilisation problem

This is the crux of the cost analysis. GPU instances run 24/7. If your request volume only uses 20% of available capacity, you're paying for 100% of the GPU and using 20% of it.

At 20% utilisation on an H100 reserved slice:

  • Effective cost per million tokens: $6/hour / (0.2 × 10.8M tokens/hour) = $2.78/million tokens

That's already close to Claude Sonnet prices — and we haven't added engineering cost yet.

At 10% utilisation (common for many workloads during off-peak hours):

  • Effective cost per million tokens: $5.56/million tokens

Which is more expensive than Claude Sonnet, and you're also managing the infrastructure.

Utilisation target for self-hosting to make financial sense: above 60-70% sustained average. That means high, predictable, round-the-clock request volume.

vLLM: the right serving framework

If you do go self-hosted, vLLM is the framework to use. It handles continuous batching (which dramatically improves throughput), PagedAttention (efficient KV cache management), and multi-GPU tensor parallelism. Running a model without vLLM on a raw CUDA setup will give you a fraction of the throughput and make the economics worse.

A production vLLM setup for Llama 3.3 70B typically needs:

  • 2x H100 (80GB each) for FP16, or 1x H100 for 4-bit quantisation
  • ~16 CPU cores, 64GB RAM for the host
  • NVMe storage for model weights (70B FP16 = ~140GB)
  • Kubernetes or ECS for availability and rolling deploys

That's not a weekend project. It's a proper infrastructure workload requiring GPU expertise, autoscaling, health checks, and model version management.

When latency or privacy makes self-hosting worth it regardless of cost

Two cases where the cost analysis is secondary:

Data privacy and compliance. If you're processing patient data (HIPAA), financial data (FCA, FINRA requirements), or PII subject to GDPR with DPA restrictions on sub-processors, sending data to OpenAI or Anthropic may not be an option regardless of price. Self-hosted in your VPC means your data never leaves your network. This is a legitimate, non-negotiable reason.

Latency-sensitive inference. API providers have variable latency. Under load, GPT-4o can have P99 response times of 10-30 seconds for long generations. A self-hosted vLLM instance with dedicated capacity gives you predictable latency. If you're running real-time applications where P99 matters — voice assistants, real-time code completion, interactive tutoring — dedicated GPU capacity may be worth the premium.

Crossover analysis: Llama 3.3 70B vs Claude Sonnet 3.7

Assumptions:

  • AWS p4d.24xlarge (8x A100, not H100 — a cheaper previous-generation GPU that's often sufficient for non-frontier models), 1-year reserved: ~$18/hour
  • vLLM, Llama 3.3 70B at 4-bit, ~16,000 tokens/second peak
  • 70% sustained utilisation target for break-even

At 70% utilisation:

  • Tokens/hour: 16,000 × 3,600 × 0.7 = ~40 million
  • GPU cost per million tokens: $18 / 40 = $0.45/million

vs Claude Sonnet 3.7 input tokens at $3/million. Self-hosted wins at this utilisation.

But to achieve 70% average utilisation, you need roughly 28 million tokens/hour of consistent demand, which is ~670 million tokens/day. That's a substantial production workload — not a typical small or mid-size application.

For GPT-4o-mini ($0.15/million input tokens), the crossover is even harder to reach. A smaller model running on a cheaper GPU at lower throughput needs to hit ~85% utilisation to beat $0.15/million.

Model quality considerations

This matters for the cost comparison. Llama 3.3 70B is genuinely capable — competitive with GPT-4o-mini for most tasks and approaching Sonnet territory for code and reasoning. But "approaching" isn't "matching."

If the quality delta means you need a larger model (Llama 3.3 405B, which needs 4+ H100s) to match API model quality, the crossover point shifts significantly.

A practical approach: run your specific evaluation set on both options before committing. Quality differences often only matter for 20% of use cases. Route those to an API provider, self-host the rest.

A hybrid approach

The pattern that often makes the most sense:

  • Self-host a fast, smaller model (Mistral 7B, Qwen 14B) for high-volume, latency-sensitive, simpler tasks
  • Use API (Claude Haiku, GPT-4o-mini) for medium-complexity tasks
  • Use API (Claude Sonnet, GPT-4o) for complex generation and reasoning

This gives you the cost efficiency of self-hosting on the high-volume simple path, without the capital and ops overhead of running big models.

See /blog/openai-anthropic-api-cost for how to minimise API costs on the tasks you don't self-host.

Realistic numbers

A client running a document intelligence platform (~800 million tokens/day, mix of extraction and classification):

  • Was paying $12,000/month on Claude Haiku API (extraction pipeline, 95% of volume)
  • Self-hosted Qwen 14B on 4x A100 reserved instances: $3,200/month in GPU cost
  • Engineering setup time: 3 weeks
  • Kept Claude Sonnet on API for the complex reasoning tasks (~5% of volume): $1,100/month

Total after: $4,300/month, saving $7,700/month. Payback on engineering time: ~5 weeks.

The extraction quality on Qwen 14B matched Haiku for their structured document types. They kept a 5% routing rule that sends ambiguous documents to Sonnet as a quality gate.


If you're trying to decide whether self-hosting makes financial sense for your volume, book a call. The crossover maths takes about 30 minutes to work through with real numbers.