Our AI features started like everyone’s do: call the OpenAI API, ship the feature, move on. Then the bills started adding up.
I’ve been building LLM-powered backend services and agentic workflows in production for a while now. Early on, every feature meant another API call to GPT-4. Classification, summarization, data extraction. Output quality was great. The invoices were not.
After adding cost tracking, it became obvious some tasks were burning money for no good reason. A Mistral 7B running locally handles classification just as well as GPT-4o. Moving the right workloads off the API cut monthly spend from about $380 to under $80. The features didn’t get worse.
The numbers
OpenAI pricing looks cheap when you’re prototyping. At production volume, it adds up:
| Task | Daily Volume | Model | Monthly Cost |
|---|---|---|---|
| Document classification | 2,000 calls | GPT-4o | ~$48 |
| Structured data extraction | 1,500 calls | GPT-4o | ~$54 |
| Internal summarization | 500 calls | GPT-4o | ~$30 |
| Agentic tool-calling | 200 calls | GPT-4o | ~$36 |
| Content moderation | 3,000 calls | GPT-4o-mini | ~$7 |
| Embeddings | 5,000 calls | text-embedding-3-small | ~$2 |
| Dev, testing, retries | — | Mixed | ~$50+ |
No single feature looks expensive. That’s the trap. Each team sees their $30-50/month and shrugs. Across 6-8 AI features, you’re at $400 and climbing.
What stays on the API, what goes local
Not everything should run locally. The split is pretty straightforward once you think about it.
Keep on the API: agentic tool-calling workflows and complex multi-step reasoning. The model needs to decide which tools to call, chain multiple steps, handle branching. A 7B model fumbling a tool-calling chain costs more in debugging time than the API bill. We tried it. Twice. Both times the small model picked the wrong tool on the third step and the whole chain fell apart.
Move to local: classification, extraction, summarization, moderation, embeddings. These are well-defined tasks where “good enough” is fine and volume is high. The quality bar for “put this document in one of five buckets” is very different from “reason about which API to call next.”
The setup
I use Ollama for development and vLLM for production. Ollama is simpler on a laptop, pull a model and you’re running in minutes. vLLM is what you want in production because the throughput is much better and it handles request batching. The key thing: vLLM exposes an OpenAI-compatible endpoint. Your existing code barely changes.
from openai import AsyncOpenAI
# Local vLLM instance
local_client = AsyncOpenAI(
base_url="http://vllm:8000/v1",
api_key="local-dev-key", # required but not validated
)
# Real OpenAI for tasks that need it
api_client = AsyncOpenAI()
Same chat.completions.create() call. No SDK changes.
The routing layer
A thin router picks the right model and handles fallback if local inference goes down:
from dataclasses import dataclass
from enum import Enum
from openai import AsyncOpenAI
class ModelTier(Enum):
LOCAL = "local"
API_FAST = "api_fast"
API_FRONTIER = "api_frontier"
class ModelRouter:
def __init__(self, local_client: AsyncOpenAI, api_client: AsyncOpenAI):
self._configs = {
ModelTier.LOCAL: {"client": local_client, "model": "mistralai/Mistral-7B-Instruct-v0.3"},
ModelTier.API_FAST: {"client": api_client, "model": "gpt-4o-mini"},
ModelTier.API_FRONTIER: {"client": api_client, "model": "gpt-4o"},
}
self._local_healthy = True
async def complete(self, messages, tier=ModelTier.LOCAL, fallback=ModelTier.API_FAST, **kwargs):
config = self._configs[tier]
if tier == ModelTier.LOCAL and not self._local_healthy:
config = self._configs[fallback]
try:
resp = await config["client"].chat.completions.create(
model=config["model"], messages=messages, **kwargs
)
if tier == ModelTier.LOCAL:
self._local_healthy = True
return resp
except Exception:
if tier == ModelTier.LOCAL:
self._local_healthy = False
fb = self._configs[fallback]
return await fb["client"].chat.completions.create(
model=fb["model"], messages=messages, **kwargs
)
raise
The fallback matters a lot. Local inference will go down at some point — model updates, GPU OOM, container restarts. Your features can’t go down with it. The router falls back to the API quietly, you fix local later.
Cost comparison
Three months of data:
Before (all API): ~$380/month, zero infrastructure cost.
After (hybrid): ~$70 OpenAI (agentic + complex only) + ~$80 GPU server + ~$5 overhead = ~$155/month. That’s 59% savings.
But the dollar savings aren’t actually the biggest win. Once local inference is running, new AI features are basically free. That classification pipeline for a new feature? Zero cost. A summarization endpoint a PM wants to test? Ship it in an afternoon, no budget discussion.
Teams stop rationing AI and start experimenting. That shift matters more than the money.
Things that burned me
Prompts tuned for GPT-4o don’t produce the same results on Mistral 7B. I spent a week re-tuning. Budget for this.
# GPT-4o: works fine
PROMPT_GPT = "Classify this text into: billing, technical, general. Respond with only the category."
# Mistral 7B: returns "The category is: technical"
# Need to be more explicit:
PROMPT_LOCAL = """Classify this text into exactly one category.
Rules:
- Respond with ONLY one word
- Must be one of: billing, technical, general
- No other text
Text: {text}
Category:"""
Quantization helps more than model size. A 7B at Q4_K_M runs 3x faster than full FP16 with maybe 2-3% quality loss on classification. Don’t run full-precision for routine tasks.
Cold starts are annoying. First request after a container restart takes 15-30 seconds while the model loads. I added a warmup probe that sends a dummy request on startup.
And you have to monitor quality. Log every response, sample for accuracy weekly. Model updates can silently drop classification from 94% to 87% and you won’t notice until users complain.
When not to bother
Under 100 calls/day, the infrastructure cost is more than the API savings. Tool-calling workflows need frontier models — a 7B will hallucinate tool parameters. Huge context windows (over 32K tokens) are impractical locally. Small teams might not want another service to maintain. One more thing to monitor, one more thing that can go down.
The hybrid approach isn’t about eliminating API costs. It’s about spending on the API only where it actually matters, and running everything else for nearly free.
The $300/month I recovered was nice. The bigger deal was that cost stopped being a blocker for shipping new AI features.