Agentic AI in Production: What I Learned Building Tool-Calling Workflows

AI agents, at their core, are a while loop. Call the model, check if it wants to use a tool, execute the tool, feed the result back, repeat until it says it’s done. Easy enough to understand. Everything that goes wrong is in the details.

I’ve been building agentic AI workflows at my current role — tool-calling, structured outputs, LLM-powered backend services. The first working demo took a day. Making it production-ready took weeks. Most of that time was spent on things no tutorial covers: what happens when the model hallucinates tool arguments, what happens when it loops forever, what happens when a junior dev’s test run costs $47 because nobody set a token limit.

The agent loop

Every agent framework dresses this up differently, but the loop is always the same:

async def run_agent(messages: list[dict], tools: list[dict], max_iterations: int = 10) -> str:
    for i in range(max_iterations):
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            return msg.content  # done

        for call in msg.tool_calls:
            result = await execute_tool(call.function.name, call.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(result),
            })

    return "Agent reached iteration limit."

That max_iterations parameter? It wasn’t in my first version. I added it after a bug where the model kept calling the same tool, getting back the same error, and trying again. For ten minutes. On a Sunday evening.

Hallucinated tool arguments

The model knows what tools exist because you give it the schemas. But it will still invent arguments that look plausible but are completely wrong. A user_id that doesn’t exist. An action value that’s not in the enum. A date in a format your API doesn’t accept.

I validate every single tool call with Pydantic before execution:

from pydantic import BaseModel, field_validator
from enum import Enum


class TicketAction(str, Enum):
    ASSIGN = "assign"
    CLOSE = "close"
    ESCALATE = "escalate"


class UpdateTicketArgs(BaseModel):
    ticket_id: int
    action: TicketAction
    note: str | None = None

    @field_validator("ticket_id")
    @classmethod
    def positive_id(cls, v):
        if v <= 0:
            raise ValueError("ticket_id must be positive")
        return v


async def execute_tool(name: str, raw_args: str) -> dict:
    schemas = {"update_ticket": UpdateTicketArgs}

    schema = schemas.get(name)
    if not schema:
        return {"error": f"Unknown tool: {name}"}

    try:
        args = schema.model_validate_json(raw_args)
    except Exception as e:
        return {"error": f"Invalid arguments: {e}"}

    # execute with validated args...

The validation error goes back to the model as a tool result. About 60% of the time, the model self-corrects on the next attempt. The other 40%, it fails again with the same bad argument, and max_iterations saves you.

Write operations need a leash

Reading data is safe. The model can look up tickets, search documents, fetch user info all day. But the first time it tries to do something — close a ticket, send an email, update a database — you need guardrails.

I use a preview/confirm pattern for anything that writes:

Model calls close_ticket with arguments
Tool returns a preview: “This will close ticket #4521 and notify 3 watchers. Confirm?”
Model sees the preview and calls confirm_action with the operation ID
Only then does the write happen

This sounds like overhead. It is. It also prevented the model from bulk-closing tickets in staging because it interpreted “clean up old tickets” a bit too literally. Closed about 200 tickets in 30 seconds before someone noticed. In staging, thankfully. The confirm pattern went in the next day.

Context windows grow fast

Each iteration of the loop adds the model’s response AND every tool result to the message history. A 4-step agent run can easily hit 8,000 tokens. A 10-step run with big tool results? 30,000+. Context window limits become a real concern.

Two things help. First, summarize tool results before feeding them back. If a database query returns 200 rows, the model doesn’t need all 200. Return counts and a representative sample. Second, set a token budget and track usage:

async def run_agent(messages, tools, max_iterations=10, token_budget=50_000):
    total_tokens = 0
    for i in range(max_iterations):
        response = await client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=tools
        )
        total_tokens += response.usage.total_tokens
        if total_tokens > token_budget:
            return f"Budget exceeded ({total_tokens} tokens used)."
        # ... rest of loop

The $47 incident I mentioned? A test run with no token budget, a tool that returned full database records, and the model deciding it needed to cross-reference all of them. Twenty iterations at increasing context sizes. The fix was a 3-line budget check.

Most AI features aren’t agents

This is probably the most useful thing I can say. After building several agentic workflows, I realized that most AI features in production don’t need an agent loop at all.

Classification? One API call. Summarization? One call. Data extraction? One call with structured outputs. You define the output schema, the model fills it in, done. No loop, no tools, no iteration budget to worry about.

Agents make sense when the model genuinely needs to decide what to do next based on intermediate results. “Look up user, check their subscription, find relevant tickets, draft a response.” That’s an agent. “Extract these 5 fields from this document” is not.

When I catch myself reaching for tool-calling, I ask: can I solve this with a single structured output call? Usually the answer is yes. The agent loop adds latency, cost, and failure modes. Only use it when the workflow actually requires dynamic decision-making.

Monitoring

You need to track iteration counts per run, token spend per run, tool call success vs failure rates, and time to completion. I log every tool call with its arguments, validation result, and execution time. All of it goes into structured logs so you can query it later.

The most useful alert: average iterations per run trending up. It usually means your system prompt drifted or a tool started returning confusing results. We caught a regression this way once. A tool changed its error format and the model stopped understanding what went wrong, so it kept retrying the same call.

Build agents only when you need them, validate everything, budget everything, and watch the metrics. The model is the easy part. The engineering around it is where production readiness lives.

The agent loop

Hallucinated tool arguments

Write operations need a leash

Context windows grow fast

Most AI features aren’t agents

Monitoring

Share this article

Salih "Adam" Yildirim

Let's Connect!