Skip to content
OObaro.Olori
All articles
Azure AI

An Azure AI Foundry agent with 23 tools: the JSON-schema discipline that killed hallucinated arguments

The first time the agent went into production it cancelled an order that did not exist. The customer asked to cancel their last work order, the agent called cancel_order with order_id="ORD-12345", and the downstream API returned 404 seven times in a row before the conversation timed out. WO-12345-A was real and could have been cancelled in one call. ORD-12345 had never existed. Six months later, after a JSON-schema rewrite, a jsonschema preflight, a strict timeout policy, and a 180-task eval, the same agent runs at 97% tool-call accuracy with no runaway loops. Here is the build log.

19 min read 312 viewsAzure AI FoundryFunction callingAgentsJSON Schema

The first time the agent went into production it cancelled an order that did not exist. The customer had asked, in chat, "please cancel my last work order, the one from Tuesday." The agent looked confident, called cancel_order with order_id="ORD-12345", the downstream API returned a 404, the agent retried, the API returned a 404 again, and after the seventh identical retry the conversation timed out and the customer got a generic error. The actual order was WO-12345-A. The model had seen "12345" in a recent invoice summary and decided, with no evidence, that the prefix was ORD-. ORD-12345 had never existed. WO-12345-A was real and could have been cancelled with one call.

That was the hallucination that triggered the rewrite. The agent had 23 internal APIs wired up as tools and the same class of failure was hiding in every single one. Six months later, after a thorough pass over the JSON schemas, a jsonschema preflight in front of every dispatch, a strict timeout policy, and a 180-task eval set, the same agent runs at 97% tool-call accuracy with no runaway loops. This is the build log.

The 23 tools, and why the count matters

Before the schema work, I sat down and wrote out every tool the agent could call. Twenty three felt like a lot until I listed them, at which point it felt like the minimum:

  1. lookup_order: by work order id, customer id, or date range
  2. cancel_order: terminal, requires confirmation flag
  3. refund_order: partial or full, requires reason code
  4. lookup_customer: by customer id, email, or phone
  5. validate_address: calls a third-party address normaliser
  6. update_address: only for non-shipped orders
  7. track_shipment: by tracking number or work order id
  8. start_return: kicks off RMA workflow
  9. lookup_invoice: by invoice id or work order id
  10. account_balance: open invoices, credits, refundable amount
  11. list_payment_methods: last-four only, no full PAN
  12. send_password_reset_link: emails a tokenised link
  13. create_ticket: opens a support ticket
  14. update_ticket: appends notes, changes status
  15. search_kb: knowledge base over published articles
  16. lookup_faq: short canned answers
  17. handoff_to_human: explicit escalation
  18. escalate: same as handoff, but with severity
  19. detect_language: IETF tag back
  20. score_sentiment: -1.0 to 1.0
  21. summarise: for long ticket histories
  22. convert_currency: for refunds across regions
  23. check_holiday: to forecast SLA windows

The count matters because the failure modes scale with the number of tools. With three tools the model rarely picks the wrong one; with twenty three, "right tool, wrong arguments" becomes the dominant error class. The schema discipline is what stops that.

The agent itself lives in Azure AI Foundry as a single agent with function tools attached, configured against gpt-4o via the Azure AI Foundry Agent Service. The runtime is Python with the azure-ai-projects SDK. The agent definition, stripped to the bones, looks like this:

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AI_FOUNDRY_PROJECT_CONN"],
    credential=DefaultAzureCredential(),
)

agent = project.agents.create_agent(
    model="gpt-4o",
    name="support-agent-v3",
    instructions=AGENT_SYSTEM_PROMPT,
    tools=FUNCTION_DEFINITIONS,
    temperature=0.2,
    response_format={"type": "text"},
    tool_choice="auto",
)

tool_choice: "auto" is the right setting for this kind of agent. Forcing a specific tool defeats the point; forcing "none" means you have a chatbot, not an agent. temperature: 0.2 is deliberate: lower temperatures give noticeably more stable tool arguments, with a small loss in conversational warmth that we make up in the wrapping prose the agent emits. The documentation for function calling on Azure OpenAI is the right starting point if you have not done this before.

Why a description and a type is not enough

The first version of the cancel_order tool looked like this:

{
    "name": "cancel_order",
    "description": "Cancel an order.",
    "parameters": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string", "description": "The order id."},
            "confirm": {"type": "boolean"},
        },
        "required": ["order_id", "confirm"],
    },
}

This is the schema that produced order_id="ORD-12345". The schema told the model the argument was a string. The description said "the order id." The model duly invented a string that looked like an order id. There was nothing in the contract that said the string had to match the company's actual ID format, no enumeration of valid prefixes, no example, no way for the model to know it was guessing.

The fixed version, which is now the template every other tool follows, looks like this:

{
    "type": "function",
    "function": {
        "name": "cancel_order",
        "description": (
            "Cancel a customer's work order. Only call this once you have "
            "the exact work order id confirmed by the customer or returned "
            "from lookup_order. Never construct a work order id; if you do "
            "not have one, call lookup_order first."
        ),
        "parameters": {
            "type": "object",
            "additionalProperties": False,
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": (
                        "Work order id in the format WO-NNNNN-A where NNNNN "
                        "is 5 digits and A is a single uppercase letter. "
                        "Example: WO-12345-A. Obtain via lookup_order."
                    ),
                    "pattern": "^WO-[0-9]{5}-[A-Z]$",
                },
                "reason_code": {
                    "type": "string",
                    "description": "Reason for cancellation. Required for audit.",
                    "enum": [
                        "customer_request",
                        "duplicate_order",
                        "fraud_suspected",
                        "out_of_stock",
                        "address_invalid",
                    ],
                },
                "confirm": {
                    "type": "boolean",
                    "description": (
                        "Must be true. The agent must have asked the "
                        "customer for explicit confirmation before setting this."
                    ),
                },
            },
            "required": ["order_id", "reason_code", "confirm"],
        },
    },
}

Five things changed and each one fixed a different real failure.

The pattern regex is the headline fix. Once the schema declared ^WO-[0-9]{5}-[A-Z]$, the model became markedly more cautious about inventing IDs. When it lacked one it would call lookup_order first, instead of inventing a plausible one. The model is trained on JSON Schema as a first-class language; patterns shift its behaviour at generation time, not just at validation time.

The enum on reason_code does the same job as a regex but for a closed set. Before the enum, we saw free-text reasons like "customer changed their mind" which then needed normalising downstream. After, we get one of five values and downstream code stops second-guessing.

The additionalProperties: False is the underrated one. Without it, the model occasionally hallucinated extra parameters like notify_customer: true that did not exist in the tool. With it, the validator rejects those at the boundary and the model learns, within a single dispatch loop, not to add them.

The description on order_id does double duty. It tells the human reading the schema what the field is. It also tells the model how to get the value if it does not have one: "Obtain via lookup_order." That sentence is doing work. Models are dramatically less likely to invent values when the description tells them how to obtain them legitimately.

The description on confirm builds a behavioural contract into the schema: "The agent must have asked the customer for explicit confirmation." This is the kind of thing you would normally put in the system prompt, but locating it next to the parameter that depends on it makes the model's behaviour more reliable. Confirmation requests in production rose from roughly 71% of cancellations to 98% after this change.

The full schema library, across 23 tools, is around 1,800 lines of Python. It is the densest documentation in the codebase.

The dispatch wrapper: validate before you call

Schemas at definition time tell the model what to emit. They do not, by themselves, stop a bad call from reaching the downstream API. The agent runtime can still emit an order_id that almost matches the pattern but does not, or a value the model believes is an enum member but spelled wrongly. The right defence is a jsonschema validator that runs between the model's emitted tool call and the actual API request.

import jsonschema
from jsonschema import Draft202012Validator

# Build a lookup of name -> JSON Schema
TOOL_SCHEMAS = {
    tool["function"]["name"]: tool["function"]["parameters"]
    for tool in FUNCTION_DEFINITIONS
}

# Compile validators once, at startup, not per call
VALIDATORS = {
    name: Draft202012Validator(schema)
    for name, schema in TOOL_SCHEMAS.items()
}


class ToolCallValidationError(Exception):
    """Raised when the model's tool args fail the JSON schema."""


def dispatch_tool_call(name: str, raw_args: str) -> dict:
    if name not in VALIDATORS:
        return {
            "ok": False,
            "error": "unknown_tool",
            "detail": f"Tool '{name}' is not registered.",
            "next_action": "select a tool from the list",
        }

    try:
        args = json.loads(raw_args)
    except json.JSONDecodeError as e:
        return {
            "ok": False,
            "error": "invalid_json",
            "detail": str(e),
            "next_action": "emit valid JSON for tool arguments",
        }

    errors = sorted(
        VALIDATORS[name].iter_errors(args),
        key=lambda e: e.path,
    )
    if errors:
        return {
            "ok": False,
            "error": "argument_validation_failed",
            "tool": name,
            "violations": [
                {
                    "path": "/".join(str(p) for p in e.path),
                    "message": e.message,
                    "schema_keyword": e.validator,
                }
                for e in errors
            ],
            "next_action": (
                "Re-read the tool description and the parameter schema, "
                "then re-emit the call with corrected arguments. If you "
                "do not have a valid value, call lookup_order or ask the "
                "customer instead of guessing."
            ),
        }

    return _call_internal_api(name, args)

The shape of the error response matters. The model is reading this output as its next input. If the error is a stack trace, the model panics and retries with the same bad arguments. If the error is structured and includes a next_action hint, the model uses the hint. That third pattern, next_action, is the most important field in the whole error contract. Tools that return errors without a next_action produce runaway loops; tools that include one almost never do.

The first time this caught a real failure, the model emitted:

ToolCallValidationError: 'order_id' does not match pattern '^WO-[0-9]{5}-[A-Z]$'

The arg was order_id="WO-12345", missing the trailing -A. Without preflight that would have hit the API and returned a 404 which the agent would have retried until the turn timed out. With preflight, the error returned to the model in the same turn, the model called lookup_order to find the actual id, and the cancellation completed in 1.4 seconds end to end. The whole loop was four iterations: failed cancel_order, lookup_order, successful cancel_order, summary back to the user.

A similar one we used to see weekly:

OpenAIToolCallError: Argument 'limit' must be an integer

The model would emit {"limit": "10"} because string-typed numbers are common in HTTP query strings. The schema declared "limit": {"type": "integer", "minimum": 1, "maximum": 100}, and the validator caught the string. The next_action hint said "emit 10 not \"10\"," and the model fixed it on retry. Without the validator the request would have reached the API and either errored with a less friendly message or, worse, silently parsed in some Python frameworks and returned the wrong page.

Per-tool timeouts and the max-iterations cap

The cancellation hallucination was one failure mode. The other one we had to fix was the runaway loop: the agent calling the same tool many times in a row, never converging on an answer, and burning through token budget until the turn timed out. The mechanism was simple. The model would call search_orders with a vague query, the search would return an empty list, the model would interpret "empty list" as "try a slightly different query," and would do that seven, eight, twelve times before something stopped it.

The fix was two pieces of policy and one piece of tool design.

The policy is a per-tool timeout and a per-turn max iterations:

TOOL_TIMEOUTS = {
    # Fast, deterministic
    "validate_address": 2.0,
    "check_holiday": 1.0,
    "lookup_faq": 1.5,
    "convert_currency": 1.0,
    "detect_language": 1.5,
    "score_sentiment": 2.0,

    # Database lookups
    "lookup_order": 4.0,
    "lookup_customer": 4.0,
    "lookup_invoice": 4.0,
    "account_balance": 4.0,
    "list_payment_methods": 3.0,
    "track_shipment": 5.0,

    # Search and KB
    "search_kb": 8.0,

    # State-mutating
    "cancel_order": 6.0,
    "refund_order": 8.0,
    "update_address": 4.0,
    "start_return": 6.0,
    "create_ticket": 5.0,
    "update_ticket": 4.0,
    "send_password_reset_link": 4.0,

    # Free text utilities
    "summarise": 10.0,

    # Escalation
    "handoff_to_human": 3.0,
    "escalate": 3.0,
}

TURN_DEADLINE_SECONDS = 45.0
MAX_ITERATIONS = 12


def run_agent_turn(thread_id: str, user_message: str) -> dict:
    start = time.monotonic()
    iterations = 0

    project.agents.create_message(thread_id=thread_id, role="user", content=user_message)
    run = project.agents.create_run(thread_id=thread_id, assistant_id=AGENT_ID)

    while True:
        if iterations >= MAX_ITERATIONS:
            return _force_handoff(thread_id, reason="max_iterations_exceeded")
        if (time.monotonic() - start) > TURN_DEADLINE_SECONDS:
            return _force_handoff(thread_id, reason="turn_deadline_exceeded")

        run = project.agents.get_run(thread_id=thread_id, run_id=run.id)

        if run.status == "completed":
            return {"ok": True, "iterations": iterations}

        if run.status == "requires_action":
            tool_calls = run.required_action.submit_tool_outputs.tool_calls
            outputs = []
            for tc in tool_calls:
                # Never dispatch on partial args. Wait for finalisation.
                if not tc.function.arguments or tc.function.arguments[-1] not in "}\"":
                    continue
                deadline = TOOL_TIMEOUTS.get(tc.function.name, 5.0)
                result = _dispatch_with_timeout(
                    tc.function.name,
                    tc.function.arguments,
                    timeout=deadline,
                )
                outputs.append({"tool_call_id": tc.id, "output": json.dumps(result)})

            project.agents.submit_tool_outputs_to_run(
                thread_id=thread_id, run_id=run.id, tool_outputs=outputs,
            )
            iterations += 1
            continue

        if run.status in {"failed", "cancelled", "expired"}:
            return _force_handoff(thread_id, reason=f"run_{run.status}")

        time.sleep(0.25)

The timeouts are per-tool because the tools have wildly different latency profiles. A 2-second cap on validate_address is generous; an 8-second cap on refund_order accounts for the downstream payment processor's occasional slowness. The numbers are picked off the p99 of the underlying API's latency, rounded up to the nearest second.

MAX_ITERATIONS = 12 is a hard cap. Most successful turns use between 2 and 5 iterations. Anything past 8 is a sign of trouble. At 12 we escalate. We also have TURN_DEADLINE_SECONDS = 45 because some turns can iterate quickly but stall on a single slow tool; the deadline catches that case independently.

The third piece, tool design, is the one that actually killed the runaway loops. Tools no longer return an empty list when there are no results. They return:

{
    "ok": True,
    "results": [],
    "match_count": 0,
    "next_action": (
        "no_results: ask the customer for additional identifying "
        "information (work order id, email, date range) OR call "
        "handoff_to_human if you have already asked once"
    ),
}

The next_action field is doing the same job it does in the validation errors: telling the model what its next move should be, instead of leaving it to imagine one. Before this change, an empty result was being interpreted as "try a different query," over and over. After, the model asks the customer one clarifying question and, if that does not resolve it, hands off. The runaway loop rate fell from about 3% of tasks to about 0.1%.

The eval that proves the agent uses the tools

A tool-using agent is only as good as the eval that measures its tool use. The eval I built is a 180-task gold set, hand-labelled, covering nine categories evenly:

  1. Single-tool lookup (e.g., "what is the status of WO-12345-A")
  2. Multi-step transaction (e.g., "cancel my last order and refund the difference to a different card")
  3. Ambiguous input (e.g., "cancel the one from Tuesday")
  4. Requires clarification (e.g., "I want a refund" with no order context)
  5. Edge cases on canceled or already-refunded orders
  6. Confirmation-required actions
  7. Out-of-scope requests (handoff is the right answer)
  8. Multilingual inputs
  9. Adversarial inputs (prompt injection in the user message)

Each task has a gold answer that lists: the expected tool calls in order, the expected arguments where they are deterministic, and the expected final disposition (completed, handoff, or clarification_requested). The eval harness runs the agent against each task and produces four metrics:

@dataclass
class EvalMetrics:
    tool_call_accuracy: float          # right tool for the question
    argument_validity_rate: float      # args passed jsonschema
    avg_iterations: float              # per task
    success_rate: float                # final disposition matched gold
    handoff_correctness: float         # handed off when handoff was right


def run_eval(agent_id: str, gold_set_path: str) -> EvalMetrics:
    tasks = json.loads(Path(gold_set_path).read_text())

    results = []
    for task in tasks:
        thread = project.agents.create_thread()
        observed_calls = []
        iter_count = 0

        def _spy(name: str, args: dict) -> None:
            observed_calls.append((name, args))

        # Run the turn with the spy injected into dispatch
        with patch_dispatch(_spy):
            outcome = run_agent_turn(thread.id, task["user_message"])
        iter_count = outcome.get("iterations", 0)

        # Compare observed tool sequence with gold
        gold_tools = [c["tool"] for c in task["expected_calls"]]
        observed_tools = [c[0] for c in observed_calls]
        tool_match = _sequence_matches(observed_tools, gold_tools)

        # Validate every emitted arg set against its schema (post hoc)
        arg_valid = all(
            VALIDATORS[name].is_valid(args) for name, args in observed_calls
        )

        final_disp = outcome.get("disposition")
        success = final_disp == task["expected_disposition"]

        results.append({
            "task_id": task["id"],
            "tool_match": tool_match,
            "arg_valid": arg_valid,
            "iterations": iter_count,
            "success": success,
            "handoff_expected": task["expected_disposition"] == "handoff",
            "handoff_observed": final_disp == "handoff",
        })

    n = len(results)
    return EvalMetrics(
        tool_call_accuracy=sum(r["tool_match"] for r in results) / n,
        argument_validity_rate=sum(r["arg_valid"] for r in results) / n,
        avg_iterations=sum(r["iterations"] for r in results) / n,
        success_rate=sum(r["success"] for r in results) / n,
        handoff_correctness=sum(
            1 for r in results
            if r["handoff_expected"] == r["handoff_observed"]
        ) / n,
    )

The numbers before and after the schema discipline:

Metric Before After
Tool-call accuracy 91.1% 97.2%
Argument validity rate 83.9% 99.6%
Avg iterations per task 4.8 3.1
Success rate (final disposition) 84.4% 95.0%
Handoff correctness 78.3% 94.4%

The argument validity rate is the headline. The schemas plus the preflight took us from emitting bad arguments roughly one call in six to roughly one call in 250. The runaway loop rate, which is implicit in the average iteration count, dropped enough that the p99 iteration count went from 11 (right against the cap) to 6.

The eval runs on every commit to the agent code via the standard Foundry evaluation flow. A regression on any of the five metrics blocks the deploy. We have shipped 19 changes through this gate; three of them were rejected for tool-call accuracy regressions and rolled back.

Streaming tool calls and the partial-args race

A specific gotcha that cost us a Tuesday afternoon early on: when the agent streams its response, the function call arguments arrive in chunks. The model emits {"order_id": "WO- then 12345-A", "reason_code": then "customer_request", "confirm": true}. If you dispatch on the first chunk you get an unparseable string and a panicked-looking retry; if you dispatch on the second you get a JSON decode error. The fix is in the loop above and worth calling out:

# Never dispatch on partial args. Wait for finalisation.
if not tc.function.arguments or tc.function.arguments[-1] not in "}\"":
    continue

That heuristic is cheap and sufficient. The Foundry SDK actually surfaces a finalisation event you can listen on directly; we ended up doing both because in practice the heuristic catches edge cases the event misses, and the event catches edge cases the heuristic misses. Belt and braces. The structured outputs feature on Azure OpenAI is the longer-term cure here because it guarantees JSON that conforms to a schema by construction, and we have started migrating the highest-stakes tools to it.

Troubleshooting, in the order they actually happen

ToolCallValidationError: 'order_id' does not match pattern '^WO-[0-9]{5}-[A-Z]$': the model emitted an ID it invented. Check the system prompt for "Obtain via lookup_order" language on the parameter. If the description does not say where to get the value, the model will guess.

OpenAIToolCallError: Argument 'limit' must be an integer: JSON Schema's type: "integer" is strict. The model emitted "10". The validator caught it; the next_action should tell the model to drop the quotes.

Runaway loop on search_orders: almost always an empty-result case where the tool returned [] and the model retried with a different query. Switch the tool's empty-result return to a structured object with next_action: "no_results: ask for clarification or handoff."

max_iterations_exceeded with 12 successful tool calls: the model is doing useful work but the task is bigger than 12 calls. This is a sign the task should be split, not that the cap should be raised. We split one task into "lookup + summarise" + "act on the summary" and the iteration count halved.

Validator passes but the API still 404s: the schema is too loose. We had this on lookup_invoice where the regex accepted any 6-12 digit string but the system actually only issues IDs of length 9 or 11. Tighten the pattern.

AADSTS70021 in the eval harness: your eval is running under a different identity than your dev shell. The DefaultAzureCredential chain picked a stale token. Force AzureCliCredential or set AZURE_CLIENT_ID explicitly in the eval environment.

tool_choice: "auto" but the model never calls the tool: the system prompt is telling the model to answer in prose. Re-read the prompt; the words "explain" and "describe" near the top push the model toward narration instead of action. Replace with "use the tools."

Streaming partial args reaching dispatch: see the section above. Add the finalisation guard.

Where we ended up

The agent runs in production behind the same APIM front door we built for the wider Azure OpenAI estate, with the per-tenant token budget gating tokens at the edge. The agent's own cost line on the bill is a steady fraction of the overall OpenAI spend, not the lumpy spike pattern we used to see when runaway loops were eating tokens. Customer-facing latency on a typical "where is my order" question is about 1.8 seconds, p50, including the model turn and the lookup tool call. The handoff rate sits at 14% of conversations, which the support team are happy with because it is up from 9% before the schema work (the agent is now more honest about its uncertainty rather than fabricating a confident wrong answer).

The thing I underestimated going in was how much of the agent's behaviour is shaped at definition time, not at runtime. The schemas are not just for validation. They are part of the model's prompt, in effect; the model reads them, internalises them, and behaves differently because of what they say. A pattern field changes generation. A description that says "Obtain via lookup_order" changes generation. An additionalProperties: False declared at the right level changes generation. The runtime validator is the safety net, not the primary mechanism.

The other thing I underestimated was how much trust the support team needed before they would route real customers to the agent. The eval, with its 180 tasks and five metrics, was what bought that trust. Before the eval existed, my answer to "is the agent actually using the tools right" was "probably?" with a shrug. After the eval existed, my answer was a table with five numbers and a per-category breakdown, and the conversation moved from gut-feel to numbers in one meeting. That table is now the artefact we update on every release and walk through with the operations team monthly. The numbers move; the conversation stays grounded.

WO-12345-A was the order that started this. The system prompt now contains a single sentence I will probably never remove: "If you have not been given a work order id, you do not have one. Call lookup_order to find it. Do not invent it." That sentence, plus the regex, plus the validator, plus the eval, is what good looks like when an LLM is allowed to call your APIs.