Azure AI

Tracing an Azure AI Foundry agent with OpenTelemetry into Application Insights: the silent failures it surfaced

For four days last October, our Azure AI Foundry support agent was politely substituting 'I couldn't find that invoice' into responses because lookup_invoice was returning null on 14% of calls and nobody had a tool-timeout metric. OpenTelemetry into Application Insights made the pattern jump out of the dashboard the next morning. Here is the whole wiring, the six-panel Workbook, the KQL behind every tile, the per-tenant cost rollup, and the 03:22 page on a Wednesday that pointed at the right downstream service.

21 Oct 2025 17 min read 264 viewsAzure AI FoundryOpenTelemetryApplication InsightsObservability

For four days last October, our Azure AI Foundry agent was lying to customers and we did not know. The agent in question is the support copilot the finance org runs over their invoice corpus. Customer asks about invoice INV-2024-08-44211, the agent does its planning step, calls a tool called lookup_invoice, gets a result back, formats a friendly answer, and returns it. The shape of the failure was small. lookup_invoice was returning null for 14% of calls because the downstream invoice service had a connection-pool exhaustion bug under load and timed out, and the agent, given a null tool result, was politely substituting "I couldn't find that invoice in our records, can you double-check the number?" into its response. From the customer's side it looked like the agent had decided the invoice did not exist. From the invoice service's side, the request had timed out, but nobody alerted on tool timeouts because we did not have a metric for tool timeouts. From the agent's side, the conversation ended cleanly. The 4-day stretch we missed it cost the finance team about 320 escalations from confused customers, all of which the support team handled by hand because they could not reproduce the agent's wrong answer and had no trace to inspect.

What broke us out of it was OpenTelemetry. Once every agent step and every tool call was a span in Application Insights, the null-result pattern jumped out of the dashboard the next morning. The first KQL query I wrote against it (six lines, grouped by tool name and result-is-null) told me exactly where 14% of lookup_invoice calls were going. Trace ID 0x9fbb2a4c1d31b8e6 was the one I screenshotted into the incident ticket; click through, you could see the 7.4-second tool span, the timeout exception attribute, the empty payload, and the agent's polite cover-up rendered four spans later.

This is the whole pattern. Why we picked OpenTelemetry over the vendor-specific SDK paths, the wiring inside the Foundry agent code, the semantic conventions we standardised on, the dashboard we built in Application Insights Workbooks, the KQL that powers each panel, the alert rule that now pages on this exact failure mode, and the long list of gotchas we hit along the way (streaming responses, missing parent contexts, the cost-attribution baggage that took two rewrites to land).

Why OpenTelemetry, not the vendor SDK

Foundry has its own tracing hooks. The Python azure-ai-projects client and the agent runtime emit telemetry into Application Insights if you flip the project tracing toggle and provide a connection string. It works. The reason we did not stop there is portability. The platform team standardised on OpenTelemetry across about 60 services already, with traces shipping to Application Insights in production and a Tempo/Grafana stack in staging for the platform engineers who like that view. Asking the AI team to emit a parallel telemetry stream that only goes to Application Insights and only carries Foundry-shaped events would have re-built a silo we spent two years dismantling.

OpenTelemetry over Azure Monitor solves it. The azure-monitor-opentelemetry package is the official distro from Microsoft that pre-configures the OpenTelemetry SDK to export to Application Insights using the same instrumentation key as everything else, but the spans you create are standard OTLP spans. The same spans can be exported to a second backend by adding another span processor. We send to Application Insights for the production observability dashboards and to an OTLP collector for the platform team's Grafana, both from the same agent process. Documented on Microsoft Learn. The Azure Monitor distro page is at learn.microsoft.com.

The second reason was semantic conventions. The OpenTelemetry community has a GenAI working group that has published draft semantic conventions for GenAI workloads: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.tool.name, and a handful of others. Standardising on those means the same dashboards work across Azure OpenAI, Foundry agents, and any third-party model we route through. The conventions are tracked at the OpenTelemetry GenAI conventions page; for Azure Monitor's adoption of them, Application Insights for AI workloads is the relevant page.

The wiring

The bootstrap is shorter than you would expect. One call to configure_azure_monitor does the heavy lifting; everything after that is span management around the parts of the agent flow we care about.

# observability.py
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace, baggage
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import os

APPLICATIONINSIGHTS_CONNECTION_STRING = os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]

configure_azure_monitor(
    connection_string=APPLICATIONINSIGHTS_CONNECTION_STRING,
    resource=Resource.create(
        {
            "service.name": "finance-agent",
            "service.version": os.environ.get("BUILD_ID", "dev"),
            "deployment.environment": os.environ.get("DEPLOY_ENV", "dev"),
        }
    ),
)

# Also fan out to the platform team's collector for Grafana.
provider = trace.get_tracer_provider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint=os.environ["OTLP_COLLECTOR_ENDPOINT"], insecure=True)
    )
)

tracer = trace.get_tracer("finance-agent")

The service.name resource attribute is what shows up as the cloud_RoleName column in Application Insights. Pick it carefully because every dashboard filter ends up keying off it. We use one role name per agent, not per replica; the replica count is implicit in the trace volume.

The agent loop itself wraps each meaningful stage in its own span. The shape we settled on is one outer agent.run span per conversation turn, one agent.step span per planning step inside the run, one gen_ai.request span per LLM call, and one tool.{tool_name} span per tool invocation. The parent-child chain matters: every gen_ai.request and tool.{tool_name} span is a child of the agent.step it was issued from, every agent.step is a child of the agent.run.

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from opentelemetry.trace import SpanKind
import json

project = AIProjectClient(
    endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

def run_agent_turn(conversation_id: str, tenant_id: str, user_message: str) -> str:
    ctx = baggage.set_baggage("tenant_id", tenant_id)
    with tracer.start_as_current_span(
        "agent.run",
        kind=SpanKind.SERVER,
        context=ctx,
        attributes={
            "gen_ai.system": "az.ai.foundry",
            "gen_ai.conversation.id": conversation_id,
            "tenant_id": tenant_id,
        },
    ) as run_span:
        step_no = 0
        thread = project.agents.create_thread()
        project.agents.create_message(thread_id=thread.id, role="user", content=user_message)
        run = project.agents.create_run(thread_id=thread.id, assistant_id=AGENT_ID)

        while run.status in ("queued", "in_progress", "requires_action"):
            step_no += 1
            with tracer.start_as_current_span(
                f"agent.step.{step_no}",
                attributes={
                    "gen_ai.operation.name": "agent.step",
                    "agent.step.index": step_no,
                },
            ) as step_span:
                run = project.agents.get_run(thread_id=thread.id, run_id=run.id)

                if run.status == "requires_action":
                    for tool_call in run.required_action.submit_tool_outputs.tool_calls:
                        output = _invoke_tool(tool_call)
                        project.agents.submit_tool_outputs(
                            thread_id=thread.id,
                            run_id=run.id,
                            tool_outputs=[{"tool_call_id": tool_call.id, "output": output}],
                        )

                step_span.set_attribute("agent.step.status", run.status)

        run_span.set_attribute("agent.run.steps", step_no)
        run_span.set_attribute("agent.run.status", run.status)

        messages = project.agents.list_messages(thread_id=thread.id)
        return messages.data[0].content[0].text.value

baggage is the OpenTelemetry mechanism for context that travels with the span tree without being a span attribute on every span explicitly. Setting tenant_id once at the agent entry point makes it available to every child span the agent emits, which is the foundation of per-tenant cost attribution further down. Documented at OpenTelemetry baggage.

The tool wrapper is the part that mattered most for the silent-failure story. Every tool call gets its own span with explicit attributes for tool.name, tool.result.is_null, and a tool.duration_ms measurement. If the tool throws, the span records the exception. If the tool returns null or empty, the span records that as a boolean attribute, which is what made the lookup_invoice null pattern queryable.

def _invoke_tool(tool_call) -> str:
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)
    with tracer.start_as_current_span(
        f"tool.{name}",
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.tool.name": name,
            "gen_ai.operation.name": "tool.call",
        },
    ) as span:
        try:
            result = TOOL_REGISTRY[name](**args)
            is_null = result is None or (isinstance(result, (list, dict, str)) and len(result) == 0)
            span.set_attribute("tool.result.is_null", is_null)
            span.set_attribute("tool.result.size_bytes", len(json.dumps(result)) if result else 0)
            return json.dumps(result) if result is not None else "null"
        except TimeoutError as e:
            span.set_attribute("tool.error.type", "timeout")
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, "tool timeout"))
            return "null"
        except Exception as e:
            span.set_attribute("tool.error.type", e.__class__.__name__)
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            return "null"

LLM calls get the same treatment, with the GenAI semantic-convention attributes filled in. Token usage in particular is the column the cost dashboard joins against.

def _llm_call(messages, model: str):
    with tracer.start_as_current_span(
        "gen_ai.request",
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "az.ai.foundry",
            "gen_ai.operation.name": "chat.completions",
            "gen_ai.request.model": model,
        },
    ) as span:
        response = client.chat.completions.create(model=model, messages=messages)
        span.set_attribute("gen_ai.response.id", response.id)
        span.set_attribute("gen_ai.response.model", response.model)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
        span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])
        return response

That is the whole instrumentation. About 80 lines of Python plus the bootstrap, sitting alongside the agent's existing run loop. Once it was in, every agent turn in every environment was producing a trace tree we could click through in the Application Insights end-to-end transaction view.

The dashboard, six panels

The Workbook that the AI team and the finance team both check every morning is six panels. Workbooks are the right tool for this; they accept parameterised KQL, render time-series and tables, and they save the whole layout as JSON so we keep it in git alongside the rest of the platform. The Workbooks reference covers the editor. The schema for the underlying data is Application Insights tables in Azure Monitor Logs.

The first panel is agent throughput: agent runs per minute, broken out by environment.

dependencies
| where cloud_RoleName == "finance-agent"
| where name == "agent.run"
| summarize runs = count() by bin(timestamp, 1m), env=tostring(customDimensions["deployment.environment"])
| render timechart

The second panel is latency, the canonical p50/p95/p99 over the run span.

dependencies
| where cloud_RoleName == "finance-agent" and name == "agent.run"
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
  by bin(timestamp, 5m)
| render timechart

The third panel is the one we built around the silent-failure story: tool call success rate over time, with is_null and error broken out separately.

dependencies
| where cloud_RoleName == "finance-agent" and name startswith "tool."
| extend
    is_null = tobool(customDimensions["tool.result.is_null"]),
    error_type = tostring(customDimensions["tool.error.type"]),
    tool_name = tostring(customDimensions["gen_ai.tool.name"])
| summarize
    total = count(),
    null_count = countif(is_null == true),
    error_count = countif(isnotempty(error_type))
  by bin(timestamp, 5m), tool_name
| extend
    null_rate = 100.0 * null_count / total,
    error_rate = 100.0 * error_count / total
| project timestamp, tool_name, null_rate, error_rate
| render timechart

The fourth panel is the agent-step distribution: how many planning steps the agent takes per conversation turn, bucketed. This is where "agent stuck in a loop" shows up; a healthy turn is two or three steps, an unhealthy one runs to eight or more.

dependencies
| where cloud_RoleName == "finance-agent" and name == "agent.run"
| extend steps = toint(customDimensions["agent.run.steps"])
| summarize count() by steps_bucket = case(
    steps <= 3, "1-3",
    steps <= 5, "4-5",
    steps <= 7, "6-7",
    steps <= 10, "8-10",
    ">10")
| render columnchart

The fifth panel is per-tenant token consumption, which feeds the cost report. Every span carries the tenant_id from the baggage we set at the agent entry; the join against a small model_pricing table in the same Log Analytics workspace turns the token counts into dollars.

let model_pricing = datatable(model: string, input_per_million: real, output_per_million: real) [
    "gpt-4o", 2.50, 10.00,
    "gpt-4o-mini", 0.15, 0.60,
    "o1-mini", 3.00, 12.00
];
dependencies
| where cloud_RoleName == "finance-agent" and name == "gen_ai.request"
| extend
    tenant_id = tostring(customDimensions["tenant_id"]),
    model = tostring(customDimensions["gen_ai.request.model"]),
    input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
    output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| join kind=inner model_pricing on model
| extend cost_usd = (input_tokens * input_per_million / 1e6) + (output_tokens * output_per_million / 1e6)
| summarize total_cost = sum(cost_usd) by tenant_id, bin(timestamp, 1d)
| render columnchart

The sixth panel is the cost burn-down: total spend so far this month against the budget. The variable MonthlyBudget is set at the top of the Workbook.

let start = startofmonth(now());
let budget = toscalar({MonthlyBudget});
let pricing = datatable(model: string, input_per_million: real, output_per_million: real) [
    "gpt-4o", 2.50, 10.00, "gpt-4o-mini", 0.15, 0.60, "o1-mini", 3.00, 12.00
];
dependencies
| where timestamp >= start and cloud_RoleName == "finance-agent" and name == "gen_ai.request"
| extend
    model = tostring(customDimensions["gen_ai.request.model"]),
    input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
    output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| join kind=inner pricing on model
| extend cost_usd = (input_tokens * input_per_million / 1e6) + (output_tokens * output_per_million / 1e6)
| summarize spend = sum(cost_usd) by bin(timestamp, 1d)
| extend cumulative = row_cumsum(spend)
| extend budget_line = budget
| project timestamp, cumulative, budget_line
| render timechart

Each panel is one tile in the Workbook. The Workbook saves to JSON, the JSON lives in the repo under infra/observability/agent-workbook.json, deployment is a Bicep module that creates the Workbook resource from that JSON. New tile, new pull request, the dashboard rolls forward like any other code.

The alert that paged

The Workbook tells you something is wrong if you look at it. The alert tells you when you are not looking. The one we built around tool.result.is_null is a KQL-based log alert that runs every five minutes and fires if the null rate for any tool exceeds 5% over the trailing 15 minutes. Alert rules covers the surrounding machinery; log search alert rules is the type we use.

dependencies
| where timestamp > ago(15m)
| where cloud_RoleName == "finance-agent" and name startswith "tool."
| extend
    is_null = tobool(customDimensions["tool.result.is_null"]),
    tool_name = tostring(customDimensions["gen_ai.tool.name"])
| summarize total = count(), nulls = countif(is_null == true) by tool_name
| extend null_rate_pct = 100.0 * nulls / total
| where total >= 20  // avoid pages on a handful of calls
| where null_rate_pct > 5.0
| project tool_name, total, nulls, null_rate_pct

The alert's threshold is "any row returned." The action group routes to PagerDuty and a dedicated Teams channel. The first time it fired in production was 03:22 on a Wednesday in late October. The on-call engineer pulled up the dashboard, saw lookup_invoice at 18% null over the trailing window, jumped into the linked trace, and saw tool.error.type = "timeout" on the failing spans. Five minutes after the page, she had the invoice service team paged too; the real culprit was a connection pool that had drifted below the size it needed under European business-hours load. The agent was the messenger; the alert pointed at the right service.

The 15-minute window is deliberately short enough to catch a real-time degradation but long enough to absorb random noise (one tool flaking on a single call is not 18% over 15 minutes). The 5% threshold came from looking at six weeks of historic data: nulls under normal operation sit at about 1.2%, mostly legitimately-missing invoices a customer asked about. Anything above 4% has, in every case we have seen, been a real downstream issue.

Streaming, and the span that closed too early

The trickiest gotcha was streaming. The Foundry agent supports streaming the LLM response token by token, which is the right user experience for long answers. Our first cut of the LLM-call span wrapping looked like this:

def _llm_call_streaming_first_attempt(messages, model):
    with tracer.start_as_current_span("gen_ai.request") as span:
        stream = client.chat.completions.create(model=model, messages=messages, stream=True)
        return stream

Returning the stream from inside the with block closes the span at the end of the block, which is when the function returns, which is before the stream has emitted any tokens. The span duration in Application Insights was milliseconds. The token usage attributes were empty because stream had not iterated yet. Worse, the agent.step span was the parent context; if a downstream piece of code re-attached to the trace later, the parent context had already ended, and you got the cryptic "Span has no parent context" message in the logs from any tooling that tried to participate.

The fix is to keep the span open until the stream completes, and to attribute the final token count on close. Easiest way is to materialise the iteration inside the span, but if you actually want streaming on the wire you do it with a helper that yields chunks and finalises attributes on stream end.

def _llm_call_streaming(messages, model):
    span = tracer.start_span("gen_ai.request", kind=SpanKind.CLIENT)
    span.set_attribute("gen_ai.system", "az.ai.foundry")
    span.set_attribute("gen_ai.operation.name", "chat.completions")
    span.set_attribute("gen_ai.request.model", model)
    span.set_attribute("gen_ai.request.streaming", True)
    ctx = trace.set_span_in_context(span)

    def _generator():
        token = trace.use_span(span, end_on_exit=False)
        with token:
            input_tokens = 0
            output_tokens = 0
            try:
                stream = client.chat.completions.create(
                    model=model, messages=messages, stream=True,
                    stream_options={"include_usage": True},
                )
                for chunk in stream:
                    if chunk.usage:
                        input_tokens = chunk.usage.prompt_tokens
                        output_tokens = chunk.usage.completion_tokens
                    yield chunk
                span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
                span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                raise
            finally:
                span.end()

    return _generator()

Two non-obvious bits. stream_options={"include_usage": True} is what makes Azure OpenAI emit a final chunk with the total token usage; without it, you have no way to attribute usage on a streamed call without your own tokeniser running on the bytes. And end_on_exit=False on use_span is what keeps the span open across yields so the streaming consumer sees a still-active parent context for any spans they create themselves. Closing the span happens in finally, by which point usage attributes have been set.

This one fix recovered token attribution for about 38% of our LLM calls (the ones the agent streamed), which had been showing zero tokens in the cost dashboard because the original first-attempt code closed the span before the usage was known. The first week the cost report ran against the fixed code, the per-tenant numbers jumped about 40% from the old, broken numbers; the finance team had been undercharging internal teams by exactly that much because the old dashboard was missing the streamed half.

Troubleshooting, the long list

AzureMonitorOpenTelemetryConfigurator: connection string not set is the canonical "you did not pass the connection string" failure on configure_azure_monitor. It happens, in our experience, mostly because the environment variable is set on the host but not propagated into the container. Check the container env, not the host env. The variable is APPLICATIONINSIGHTS_CONNECTION_STRING, and it must be the full connection-string form (InstrumentationKey=...;IngestionEndpoint=...), not just the bare instrumentation key.

Span has no parent context from the OpenTelemetry SDK is the marker that something has started a span outside of any active context. The two times we saw this were: an async task spawned without explicitly carrying the context across (asyncio.create_task(...) does not propagate context automatically; pass it through contextvars.copy_context()), and the streaming bug above. The fix is always to make sure the parent context is in scope when the child span is created.

No exporter for OTLP found if you try to add a second exporter alongside the Azure Monitor one without installing the OTLP exporter package separately. pip install opentelemetry-exporter-otlp-proto-grpc is the missing piece. The Azure Monitor distro is convenient but it does not pull in OTLP by default.

'tool.result.is_null': True showing up correctly in dev but missing in prod was, on inspection, a deployment-config issue: the env we deployed to had set OTEL_TRACES_SAMPLER to parentbased_traceidratio with a ratio of 0.1, meaning 90% of traces were dropped before they hit Application Insights. The null-rate dashboard worked but the totals were one tenth of what we expected. For agent traces we keep the sampler at 1.0; the trace volume is low enough (a few hundred runs an hour) that full retention costs us a few cents a day. Sampling configuration covers the trade-offs.

The token-counts-are-zero-on-streaming gotcha is the one above. The fix is stream_options.include_usage plus a generator-shaped span lifetime.

The "trace shows up in the AI Foundry portal but not in Application Insights" case: the AI Foundry portal has its own tracing tab that reads from a Foundry-internal store, separate from Application Insights. If you only see traces there and not in Application Insights, the Foundry tracing toggle is on but configure_azure_monitor was never called in your code. The two paths are independent; both can be active simultaneously and emit to different places. Documented under Trace agents in Azure AI Foundry.

The "alert never fires even though the dashboard shows the spike" case took us four hours to track down. The alert query had where total >= 20 to avoid noise. In an off-peak window the total was 17, the alert filtered the row out, the dashboard which used a different aggregation showed the spike clearly. The fix was to drop the threshold to total >= 10 for off-peak hours and add a separate alert specifically for low-volume time-of-day windows with a higher null-rate threshold.

Cost attribution, the per-tenant rollup

The cost dashboard fed the panel above, but the finance team's monthly report is a per-tenant breakdown that goes into a shared sheet. That query is one KQL run against the same data, exported via the Log Analytics REST API on a schedule. The KQL reference covers the joins and the aggregations.

let pricing = datatable(model: string, input_per_million: real, output_per_million: real) [
    "gpt-4o", 2.50, 10.00,
    "gpt-4o-mini", 0.15, 0.60,
    "o1-mini", 3.00, 12.00
];
let last_month = startofmonth(now(), -1);
let this_month = startofmonth(now());
dependencies
| where timestamp >= last_month and timestamp < this_month
| where cloud_RoleName == "finance-agent" and name == "gen_ai.request"
| extend
    tenant_id = tostring(customDimensions["tenant_id"]),
    model = tostring(customDimensions["gen_ai.request.model"]),
    input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
    output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| join kind=inner pricing on model
| extend cost_usd = (input_tokens * input_per_million / 1e6) + (output_tokens * output_per_million / 1e6)
| summarize
    runs = countif(name == "gen_ai.request"),
    input_total = sum(input_tokens),
    output_total = sum(output_tokens),
    cost_usd = sum(cost_usd)
  by tenant_id, model
| order by cost_usd desc

The cost-by-tenant report identified, in its first run, that one tenant accounted for 41% of our agent spend despite being 9% of the user base. They were using the agent as a batch-processing API, looping it across an inventory of invoices overnight. Once that pattern was visible in the cost report, the product team had the conversation about either tiering the pricing or moving that workload to a different (cheaper) endpoint. Three weeks of decision, one Slack-thread of negotiation, and the workload moved to a direct Azure OpenAI batch call. The agent's per-call spend dropped 36% the following month.

Where we are now

Every agent call has a trace, end to end, with token usage attributed to the tenant that initiated the call. The dashboard shows what is happening in the last hour, the last day, and the last month, and there is an alert for the failure mode that taught us we needed any of this. The mean-time-to-detect for tool degradations has gone from "four days the one time we noticed" to "fifteen minutes the alert fires." The mean-time-to-diagnose has gone from "give up and ship a patch" to "click the trace, read the span attributes, page the right team." We replay traces during postmortems by trace ID, walk through what the agent was thinking step by step, and the conversation about agent failures is now grounded in evidence instead of in screenshots from frustrated customers.

The part I underweighted at the start was the cultural one. Before tracing, the AI team's defence of agent behaviour was "the model decided to," which is unfalsifiable and frustrating. After tracing, the conversation is structural: "the agent did three steps, the second tool call returned null, here is the trace, here is the downstream service log that explains the null." That is a debuggable system, and a debuggable system is one the rest of the org can hold to a quality bar rather than treat as a mysterious black box that occasionally embarrasses us. The 320 escalations we ate in October would have been 320 traces with clear root cause, none of which would have generated a customer complaint in the first place, because we would have caught the underlying invoice-service issue at the fifteen-minute mark.

Forty-three connections, four days of silent failure, six panels, one alert. The work is small; the leverage is large. The next pattern we are reaching for is the same instrumentation across the other three Foundry agents in the org, sharing the Workbook template, so the silent-failure pattern is detectable on day one for whichever team ships the agent next.