Loading_

Agentic AI in Production: What I Learned Building Tool-Calling Workflows

Production lessons from building agentic AI with OpenAI tool-calling and structured outputs. Architecture, failure modes, guardrails, cost controls, and everything the tutorials leave out.

Agentic AI in Production: What I Learned Building Tool-Calling Workflows

Let me save you the pitch. An agentic AI system is a while loop. The LLM picks a function, calls it, reads the result, then decides whether to call another one or give you an answer. That’s it. The conference talks make it sound like something from science fiction. The reality is closer to a bash script with an expensive API call in the middle.

I’ve been building these at my current role — tool-calling chains hooked into internal APIs, structured outputs feeding into downstream services, LLM-powered workflows that people actually rely on daily. Getting the demo working took a week. Getting it production-ready? Months. And almost none of the hard problems were about the AI itself.


The while loop nobody talks about

Here’s the thing that surprised me when I first started: every tutorial shows you a single request-response. “Pass a tool definition, get a tool call back, done.” But real workflows loop. The model calls a tool, looks at what came back, decides it needs more data, calls another tool. Maybe it branches. Maybe it goes in circles for a while.

The mental model is dead simple:

User Request

LLM decides → Tool Call (e.g., query database, call API)

Tool returns result

LLM observes result → decides next action or responds

(repeat until done)

And “until done” is doing a lot of heavy lifting in that sentence. Because “done” can mean the model gave a great answer, or it means it burned through 40K tokens going nowhere and you capped it.

The skeleton looks like this:

import openai
import json
from typing import Any

client = openai.OpenAI()

def run_agent(
    system_prompt: str,
    user_message: str,
    tools: list[dict],
    tool_handlers: dict[str, callable],
    max_iterations: int = 10,
) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        message = response.choices[0].message

        if not message.tool_calls:
            return message.content

        messages.append(message)
        for tool_call in message.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)

            handler = tool_handlers.get(fn_name)
            if handler is None:
                result = json.dumps({"error": f"Unknown tool: {fn_name}"})
            else:
                result = json.dumps(handler(**fn_args))

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Agent reached maximum iterations without completing."

I had this running in staging within a day and felt genuinely clever about it. Then it went to production and I spent the next several weeks dealing with every way it can go wrong.


The model will lie to your tools

First thing that bit me: hallucinated arguments. The model will confidently pass a user ID that doesn’t exist, or a filter field your API doesn’t support, or JSON structured just differently enough from your schema to blow up downstream. It doesn’t “know” what your tools accept — it’s pattern-matching from the function description and hoping for the best.

After the third time I watched it pass {"user_id": "usr_abc123"} to a tool that expects an integer, I stopped trusting any of it and started validating everything with Pydantic:

from pydantic import BaseModel, ValidationError

class SearchParams(BaseModel):
    query: str
    max_results: int = 10
    filters: dict[str, str] | None = None

def handle_search(**kwargs) -> dict:
    try:
        params = SearchParams(**kwargs)
    except ValidationError as e:
        return {"error": f"Invalid parameters: {e.errors()}"}

    results = execute_search(params.query, params.max_results, params.filters)
    return {"results": results}

The nice thing is, when you return the validation error as the tool result, the model usually corrects itself on the next try. Usually. Sometimes it just tries the same garbage again, which brings me to the next problem.


Infinite loops and the $47 morning

Here’s a fun one. I came in one morning and a single workflow had run up $47 in API costs overnight because the model got stuck in a cycle — call Tool A, get a result it didn’t like, call Tool B hoping for something different, Tool B references Tool A’s output, loop back to Tool A. Rinse and repeat fourteen times until it hit the (too generous) iteration cap I’d set.

$47 doesn’t sound like a lot. Scale that across all users and it would’ve been a very expensive week.

Now I track the exact call signature — function name plus arguments — and kill repeated calls after three attempts:

from collections import Counter

def run_agent_with_guards(
    system_prompt: str,
    user_message: str,
    tools: list[dict],
    tool_handlers: dict[str, callable],
    max_iterations: int = 10,
    max_repeated_calls: int = 3,
) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]
    call_history: list[str] = []

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        message = response.choices[0].message
        if not message.tool_calls:
            return message.content

        messages.append(message)
        for tool_call in message.tool_calls:
            fn_name = tool_call.function.name
            call_signature = f"{fn_name}:{tool_call.function.arguments}"

            call_history.append(call_signature)
            counts = Counter(call_history)
            if counts[call_signature] > max_repeated_calls:
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps({
                        "error": "This tool has been called with identical arguments "
                        "multiple times. Try a different approach."
                    }),
                })
                continue

            handler = tool_handlers.get(fn_name)
            if handler is None:
                result = {"error": f"Unknown tool: {fn_name}"}
            else:
                result = handler(**json.loads(tool_call.function.arguments))

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })

    return "Agent exceeded iteration limit."

The key insight was that the model doesn’t have memory of what it already tried unless you tell it. It just sees the conversation history, and if that history is long enough, it forgets earlier failures. Detecting the loop from outside the model turned out to be way more reliable than asking the model to detect it itself.


Context windows fill up fast

Nobody warns you about this in the tutorials because their examples do one tool call and stop. In production, my workflows regularly hit 6-8 iterations. Each iteration adds the model’s response (with tool calls), plus the tool results. A single database query might return 50 rows as JSON. An API response could be a deeply nested object.

By iteration 5 or 6, the model starts “forgetting” the original task because the early messages have been pushed out of its effective attention. I watched it answer a completely different question than what the user asked because the context had swollen to 90% tool results.

I truncate everything now:

import json

MAX_TOOL_RESULT_CHARS = 2000

def truncate_tool_result(result: dict) -> str:
    serialized = json.dumps(result, default=str)
    if len(serialized) <= MAX_TOOL_RESULT_CHARS:
        return serialized

    if isinstance(result, dict) and "results" in result:
        items = result["results"]
        if isinstance(items, list) and len(items) > 5:
            result = {
                **result,
                "results": items[:5],
                "_truncated": True,
                "_total_count": len(items),
            }
            return json.dumps(result, default=str)

    return serialized[:MAX_TOOL_RESULT_CHARS] + "...[truncated]"

2000 characters feels aggressive. It is. But the model works better with less data than with more irrelevant data, which is the opposite of what my instincts told me when I started.


Write operations are where it gets scary

Read-only agentic workflows are honestly fine. Let the model search around, query different things, combine data — worst case it wastes some tokens. But the moment you give it tools that create, update, or delete records, every loop iteration becomes a potential disaster.

I had a situation where the model created a record, realized the data was wrong based on a subsequent tool call, tried to delete it, passed the wrong ID, and deleted something else instead. Nobody got paged because my validation caught it before the commit, but only barely. That was the week I redesigned the whole write path.

Now every write tool has a preview step. The model calls preview_create_record first, which returns exactly what would be created without actually doing it. Then a separate confirm_create_record tool does the actual mutation. It adds friction, but I sleep better:

TOOLS = [
    # Read — let it go wild
    {
        "type": "function",
        "function": {
            "name": "search_records",
            "description": "Search for records. Read-only, no side effects.",
            "parameters": { ... },
        },
    },
    # Write — gated behind preview
    {
        "type": "function",
        "function": {
            "name": "create_record",
            "description": "Create a new record. IMPORTANT: Always confirm "
            "the details with the user before calling this tool.",
            "parameters": { ... },
        },
    },
]

This two-phase pattern probably added 30% more code. Worth every line.


Costs scale in ways you don’t expect

One thing I learned from years of building data pipelines at previous roles: you have to instrument costs before they surprise you. But LLM costs have a particularly nasty scaling profile.

A normal API call costs fractions of a cent. An agentic chain with 8 iterations, each with a growing message history? That can easily hit $0.30-0.50 per request. You don’t notice it until you’re looking at the invoice and wondering what happened.

I budget tokens per request now, hard:

import tiktoken

class TokenBudget:
    def __init__(self, max_tokens: int = 50_000):
        self.max_tokens = max_tokens
        self.tokens_used = 0
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def track(self, response) -> None:
        usage = response.usage
        self.tokens_used += usage.total_tokens

    def check(self) -> bool:
        return self.tokens_used < self.max_tokens

    @property
    def remaining(self) -> int:
        return max(0, self.max_tokens - self.tokens_used)

Different workflow types get different budgets. A simple lookup gets 10K. A complex multi-step chain gets 50K. When the budget runs out, the agent returns whatever it has so far with an honest “I ran out of budget” message. The user can retry if they want, but at least the system won’t quietly rack up charges.


Structured outputs changed everything for me

Before OpenAI added structured outputs, I was parsing model responses with regex. Sometimes the model would return JSON. Sometimes it’d wrap it in markdown code fences. Sometimes it’d add a chatty preamble before the JSON. I had a regex that handled like seven variations and it still broke maybe once a week.

Structured outputs just… fixed this:

from pydantic import BaseModel

class AnalysisResult(BaseModel):
    summary: str
    confidence: float
    categories: list[str]
    recommended_actions: list[str]
    requires_human_review: bool

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=messages,
    response_format=AnalysisResult,
)

result: AnalysisResult = response.choices[0].message.parsed

You define a Pydantic model, the API guarantees the response matches it. No parsing. No “hope it’s valid JSON.” The schema becomes the contract between your AI layer and the rest of your backend, and if you think about it that way from the start — schema first, prompt second — the integration is so much cleaner.


What I actually look at in monitoring

After building validation and monitoring into the AI systems at work, I’m pretty opinionated about what matters. Here’s the tracking I settled on:

import time
import logging
from dataclasses import dataclass, field

logger = logging.getLogger(__name__)

@dataclass
class AgentMetrics:
    workflow_id: str
    start_time: float = field(default_factory=time.time)
    iterations: int = 0
    tool_calls: list[str] = field(default_factory=list)
    total_tokens: int = 0
    errors: list[str] = field(default_factory=list)
    completed: bool = False

    def log_tool_call(self, tool_name: str, duration_ms: float) -> None:
        self.tool_calls.append(tool_name)
        logger.info(
            "agent.tool_call",
            extra={
                "workflow_id": self.workflow_id,
                "tool": tool_name,
                "duration_ms": duration_ms,
                "iteration": self.iterations,
            },
        )

    def finalize(self, success: bool) -> None:
        self.completed = success
        duration = time.time() - self.start_time
        logger.info(
            "agent.completed",
            extra={
                "workflow_id": self.workflow_id,
                "success": success,
                "iterations": self.iterations,
                "total_tool_calls": len(self.tool_calls),
                "total_tokens": self.total_tokens,
                "duration_seconds": round(duration, 2),
                "errors": self.errors,
            },
        )

The metrics I ended up caring about most:

  • Iterations per request — anything over 6 is a smell. Over 10 means something’s wrong with the prompt or the tools.
  • Tokens per request — this is your money. Track it per workflow type.
  • Completion rate — what percentage of agent runs actually finish vs. hitting the iteration cap. Mine was 73% initially. Fixing prompt issues and adding better tool descriptions brought it to 94%.
  • Error rate by tool — one of my tools was failing 12% of the time due to a flaky upstream API. I only found this because I was tracking it per-tool.

I ran these metrics for about a month before touching anything. The patterns were obvious once I had the data — certain prompt phrasings caused an extra 2-3 iterations on average, and two tool combinations triggered loops consistently. Data first, optimization second.


Prompting for agents is a different skill

I’ve written a lot of prompts. Single-turn prompts are hard enough — you’re fighting vagueness and edge cases. Agent prompts are harder because you’re not just controlling output format, you’re controlling behavior over time. The model is making a sequence of decisions, and each one can compound into a mess.

Here’s a prompt structure I’ve landed on that works:

You are a data analysis assistant with access to the following tools.

RULES:
1. Always search before creating. Check if the record exists first.
2. Never call a write tool without first previewing the changes.
3. If a tool returns an error, try a different approach — do not retry
   the same call with the same arguments.
4. If you cannot complete the task in 5 tool calls, summarize what you've
   found so far and explain what's blocking you.
5. Respond with structured data only — no conversational filler.

The critical bits are the stopping conditions and the error-handling instructions. Without rule 3, the model will retry failed calls forever. Without rule 4, it’ll churn until max iterations. I learned both of these the expensive way.

Vague prompts like “be helpful and thorough” are actively harmful here. The model wants to be thorough. It’ll call every tool it has access to just to be comprehensive. You need to tell it when to stop.


Most of my AI features aren’t agents

This is maybe the most useful thing I can share. After spending months building agentic workflows, I’ve become way more selective about when I actually use them. I’ve refactored two workflows back to single-turn prompts because the agent loop wasn’t doing anything — the model called one tool, got the result, and responded. Every single time. That’s not an agent. That’s an expensive function call.

Agents make sense when the model genuinely needs to explore — when the next step depends on what came back from the previous one, when the branching is unpredictable, when a human doing the same task would need to poke around before having an answer.

For everything else — and this is most things — just gather the context yourself, stuff it into a single prompt, and use structured outputs. It’s faster (no iteration latency), cheaper (one API call instead of eight), and way easier to debug when something goes wrong.


If I had to boil all of this down: treat the LLM like any other external service with an unreliable API. Validate inputs. Set timeouts. Budget costs. Monitor everything. The model itself is honestly the easy part. The hard part is all the boring engineering around it — the same input validation and error handling and observability work that made me a decent backend engineer long before I touched any AI code.

The gap between a demo and production really is just that: regular engineering. Nothing glamorous. Nothing that’ll make a good conference talk. But it’s the difference between a system you show your boss and a system you can actually leave running overnight without checking your phone.

Salih Yildirim

Salih "Adam" Yildirim

Full Stack Software Engineer with 6+ years of experience building scalable web and mobile applications. Passionate about clean code, modern architecture, and sharing knowledge.

{ ideas }
<thoughts/>
// discuss
</>{ }( )=>&&||
Gathering thoughts
Salih YILDIRIM

Let's Connect!

Choose your preferred way to reach out