Our AI features started the way everyone’s do: call the OpenAI API, ship the feature, move on. Then the bills started stacking up.
I’ve spent the past year building agentic AI workflows and LLM-powered backend services in production. Early on, every feature meant another API call to GPT-4. Structured data extraction, document summarization, classification pipelines. The output quality was great. The invoices were not.
After adding proper cost tracking across our AI workloads, it became obvious that some tasks were burning money for no reason. A Mistral 7B running locally handles classification just as well as GPT-4o. Moving the right workloads off the API cut our monthly spend from ~$380 to under $80, and the features our users depend on didn’t degrade at all.
The Numbers Nobody Adds Up
OpenAI pricing looks cheap when you’re prototyping. At production volume, it compounds. Here are the actual numbers from a workload I was running:
| Task | Daily Volume | Avg Tokens/Call | Model | Monthly Cost |
|---|---|---|---|---|
| Document classification | 2,000 calls | ~800 tokens | GPT-4o | ~$48 |
| Structured data extraction | 1,500 calls | ~1,200 tokens | GPT-4o | ~$54 |
| Internal summarization | 500 calls | ~2,000 tokens | GPT-4o | ~$30 |
| Agentic tool-calling workflows | 200 calls | ~3,000 tokens | GPT-4o | ~$36 |
| Content moderation | 3,000 calls | ~400 tokens | GPT-4o-mini | ~$7 |
| Embedding generation | 5,000 calls | ~500 tokens | text-embedding-3-small | ~$2 |
| Misc (dev, testing, retries) | — | — | Mixed | ~$50+ |
| Total | ~$380/month |
No individual feature looks expensive. That’s the trap. Each team sees their $30-50/month and shrugs. But across 6-8 AI features, you’re at $400 and climbing with every new capability.
The real question: which of these tasks actually need a frontier model?
API vs. Local: How I Decide
Not everything should run locally. I use a simple framework:
"""
Decision framework for API vs. local model routing.
Not code you run — a mental model encoded as logic.
"""
def should_use_local_model(task: dict) -> bool:
"""
Route to local model when ALL of these are true:
1. Task is well-defined (classification, extraction, summarization)
2. Quality bar is "good enough" — not "best possible"
3. Latency tolerance > 500ms (local inference is slower)
4. No need for massive context windows (>32K tokens)
5. Volume is high enough to justify the infrastructure
"""
return all([
task["type"] in ("classification", "extraction", "summarization", "moderation"),
task["quality_requirement"] != "frontier",
task["latency_tolerance_ms"] > 500,
task["avg_context_tokens"] < 32_000,
task["daily_volume"] > 100,
])
def should_keep_on_api(task: dict) -> bool:
"""
Keep on OpenAI API when ANY of these are true:
1. Task requires frontier reasoning (complex agentic workflows)
2. You need tool-calling with structured outputs at high reliability
3. Context window > 32K tokens
4. Volume is low (< 100 calls/day) — not worth the infra overhead
5. Task evolves rapidly and you need model updates without redeployment
"""
return any([
task["type"] in ("agentic_workflow", "complex_reasoning"),
task["requires_tool_calling"] and task["reliability_requirement"] == "critical",
task["avg_context_tokens"] > 32_000,
task["daily_volume"] < 100,
task["changes_frequently"],
])
After running every AI workload through this logic, the split was clear:
- Keep on API: Agentic tool-calling workflows, complex multi-step reasoning
- Move to local: Classification, extraction, summarization, moderation, embeddings
The agentic workflows, where the model decides which tools to call and chains multiple steps together, that’s where GPT-4o is worth the money. A local 7B model fumbling a tool-calling chain costs more in debugging time than the API bill.
Local Inference Setup
I use Ollama for development and vLLM for production:
Docker Compose for vLLM
# docker-compose.llm.yml
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia # or use CPU-only image for smaller models
ports:
- "8001:8000"
volumes:
- ./models:/root/.cache/huggingface
environment:
- VLLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
- VLLM_MAX_MODEL_LEN=8192
- VLLM_GPU_MEMORY_UTILIZATION=0.85
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--max-model-len 8192
--dtype auto
--api-key local-dev-key
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
ollama:
# Dev/CPU fallback — useful for local development
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
restart: unless-stopped
volumes:
ollama-data:
OpenAI-Compatible API (This Is the Key Part)
vLLM exposes an OpenAI-compatible endpoint. Your existing code barely changes:
from openai import AsyncOpenAI
# Production: route to local vLLM instance
local_client = AsyncOpenAI(
base_url="http://vllm:8000/v1",
api_key="local-dev-key", # vLLM requires a key but doesn't validate it
)
# Keep the real OpenAI client for tasks that need it
api_client = AsyncOpenAI() # Uses OPENAI_API_KEY env var
No SDK changes. No prompt rewrites. Same chat.completions.create() call.
The Routing Layer
I use a thin routing layer between application code and the model clients. It picks the right model based on the task and handles fallback if local inference goes down:
import asyncio
from dataclasses import dataclass
from enum import Enum
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletion
class ModelTier(Enum):
LOCAL = "local"
API_FAST = "api_fast"
API_FRONTIER = "api_frontier"
@dataclass
class ModelConfig:
client: AsyncOpenAI
model: str
max_tokens: int = 2048
temperature: float = 0.1
class ModelRouter:
"""
Routes AI tasks to the appropriate model based on task requirements.
Falls back to API if local model is unavailable.
"""
def __init__(
self,
local_client: AsyncOpenAI,
api_client: AsyncOpenAI,
):
self._models: dict[ModelTier, ModelConfig] = {
ModelTier.LOCAL: ModelConfig(
client=local_client,
model="mistralai/Mistral-7B-Instruct-v0.3",
max_tokens=2048,
),
ModelTier.API_FAST: ModelConfig(
client=api_client,
model="gpt-4o-mini",
max_tokens=4096,
),
ModelTier.API_FRONTIER: ModelConfig(
client=api_client,
model="gpt-4o",
max_tokens=4096,
),
}
self._local_healthy = True
async def complete(
self,
messages: list[dict],
tier: ModelTier = ModelTier.LOCAL,
fallback: ModelTier = ModelTier.API_FAST,
**kwargs,
) -> ChatCompletion:
"""
Send a completion request to the specified tier.
Automatically falls back if the primary tier is unavailable.
"""
config = self._models[tier]
# If local is unhealthy, skip straight to fallback
if tier == ModelTier.LOCAL and not self._local_healthy:
config = self._models[fallback]
try:
response = await config.client.chat.completions.create(
model=config.model,
messages=messages,
max_tokens=config.max_tokens,
temperature=config.temperature,
**kwargs,
)
# Mark local as healthy if it succeeded
if tier == ModelTier.LOCAL:
self._local_healthy = True
return response
except Exception as e:
if tier == ModelTier.LOCAL:
self._local_healthy = False
# Fall back to API
fallback_config = self._models[fallback]
return await fallback_config.client.chat.completions.create(
model=fallback_config.model,
messages=messages,
max_tokens=fallback_config.max_tokens,
temperature=fallback_config.temperature,
**kwargs,
)
raise
In practice:
router = ModelRouter(local_client=local_client, api_client=api_client)
# Classification — runs locally, falls back to API if vLLM is down
result = await router.complete(
messages=[
{"role": "system", "content": "Classify the following text into one of: billing, technical, general."},
{"role": "user", "content": customer_message},
],
tier=ModelTier.LOCAL,
)
# Agentic workflow — always use frontier model
result = await router.complete(
messages=agent_messages,
tier=ModelTier.API_FRONTIER,
tools=tool_definitions,
)
The fallback is critical. Local inference will go down at some point: model updates, GPU OOM, container restarts. Your features can’t go down with it. The router falls back to the API quietly, logs the incident, and you fix the local setup later. No user impact.
Actual Cost Comparison
Three months of data with the hybrid setup:
Before (All API)
| Monthly | |
|---|---|
| OpenAI API costs | ~$380 |
| Infrastructure | $0 (serverless) |
| Total | ~$380 |
After (Hybrid: Local + API)
| Monthly | |
|---|---|
| OpenAI API (agentic + complex only) | ~$70 |
| GPU server (RTX 4090, dedicated) | ~$80 |
| Electricity / overhead | ~$5 |
| Total | ~$155 |
Net savings: ~$225/month, or 59%.
But honestly, the dollar savings aren’t the biggest win. Once local inference is running, new AI features become essentially free at the margin. That classification pipeline for a new feature? Zero incremental cost. The summarization endpoint a PM wants to prototype? Ship it in an afternoon. No budget approvals.
The psychological shift is real. Teams stop rationing AI capabilities and start experimenting.
Things That Almost Burned Me
Prompt Compatibility
Prompts tuned for GPT-4o do not produce identical results on Mistral 7B. I burned a week re-tuning prompts. Budget for this.
# GPT-4o prompt (works great)
CLASSIFY_PROMPT_GPT = """Classify this text into exactly one category:
billing, technical, general. Respond with only the category name."""
# Same prompt on Mistral 7B often returns: "The category is: technical"
# You need to be more explicit:
CLASSIFY_PROMPT_LOCAL = """Classify this text into exactly one category.
Rules:
- Respond with ONLY one word
- The word must be one of: billing, technical, general
- Do not include any other text, punctuation, or explanation
Text: {text}
Category:"""
Quantization > Model Size
A 7B model quantized to Q4_K_M runs 3x faster than the full FP16 version with maybe 2-3% quality loss on classification tasks. For extraction and summarization, the difference is negligible. Don’t run full-precision models for routine tasks.
Cold Start
First request after a container restart takes 15-30 seconds while the model loads into GPU memory. I added a warmup probe:
async def warmup_local_model(client: AsyncOpenAI) -> bool:
"""Send a dummy request to load the model into GPU memory."""
try:
await client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=5,
)
return True
except Exception:
return False
You Must Monitor Quality
You need to track local model quality over time. I log every response with the task type and periodically sample for accuracy:
import json
import logging
from datetime import datetime, timezone
logger = logging.getLogger("llm_monitor")
async def monitored_complete(
router: ModelRouter,
messages: list[dict],
task_type: str,
tier: ModelTier,
**kwargs,
) -> ChatCompletion:
"""Wrapper that logs all completions for quality monitoring."""
start = datetime.now(timezone.utc)
response = await router.complete(messages=messages, tier=tier, **kwargs)
duration_ms = (datetime.now(timezone.utc) - start).total_seconds() * 1000
logger.info(
json.dumps({
"task_type": task_type,
"tier": tier.value,
"model": response.model,
"tokens_in": response.usage.prompt_tokens if response.usage else 0,
"tokens_out": response.usage.completion_tokens if response.usage else 0,
"duration_ms": round(duration_ms, 1),
"timestamp": start.isoformat(),
})
)
return response
This data tells you when a model update silently dropped classification accuracy from 94% to 87%. Without it, you won’t notice until users complain.
When NOT to Bother
I want to be specific about where local models aren’t worth it:
- Low volume. Under 100 calls/day, the infrastructure cost eclipses the API savings.
- Tool-calling workflows. Frontier models are significantly better at selecting and parameterizing tools correctly. A 7B model will hallucinate tool parameters and waste engineering time.
- Huge context windows. Anything over 32K tokens gets impractical on most local setups.
- Small teams. Local inference is another service to maintain. If you have 2-3 engineers, that operational tax might not be worth it.
- Compliance requirements. Some industries require SOC 2 certified inference. Self-hosting means you own that compliance burden.
The hybrid approach isn’t about eliminating API costs. It’s about spending intentionally.
Getting Started
If you want to try this:
-
Instrument first. Add cost tracking to every API call before you change anything. You can’t optimize what you can’t measure.
-
Identify candidates. Classification, extraction, summarization, and moderation are almost always good candidates for local models.
-
Start with one workload. Pick the highest-volume, lowest-complexity task. Get it running locally with the fallback pattern. Live with it for two weeks.
-
Evaluate quality. Compare local model outputs against API outputs on the same inputs. If accuracy is within your tolerance, expand.
-
Build the router. Once you’ve validated that local works, build the routing layer. Keep it simple — tier-based routing with automatic fallback.
-
Monitor continuously. Model quality degrades silently. Log everything, review weekly, set alerts on accuracy drops.
The whole migration took roughly three weeks of part-time effort. Infrastructure was the easy part. Prompt tuning and quality validation ate most of the time.
The AI cost landscape is shifting fast. A year ago, “just use the API” was always the right answer. Now open-weight models handle a real chunk of production workloads competently, and the serving tooling has gotten good.
But local LLMs aren’t a religion. Some tasks need frontier models. Some tasks are well-defined enough that a 7B model does the job at a fraction of the cost. It’s an engineering decision, not a philosophical one.
Measure your costs, figure out which workloads are over-served by the API, route intelligently, and always have a fallback. The exact savings depend on your mix, but for any team running multiple AI features, the hybrid approach is worth a look.
The $300/month I recovered was nice. The bigger deal was that cost stopped being a friction point for shipping new AI features. When inference is nearly free at the margin, people stop asking “can we afford AI here?” and start asking “why aren’t we using it?”