Skip to content
OObaro.Olori
All articles
Azure AI

Azure OpenAI behind APIM: per-tenant token budgets, streaming, and a circuit breaker that actually breaks

One tenant burned 2.1 million tokens in 19 minutes and the shared deployment went 429 for everyone. This is the APIM-fronted design that turned a noisy neighbour into one tenant's problem only.

19 min read 512 viewsAzure OpenAIAPI ManagementMulti-tenantRate limiting

The page in the on-call channel said "OpenAI 429s, all tenants, can't ship". The timestamp was 14:22. By 14:24 we knew it was one customer. By 14:31 we had the number: tenant aurora-uk had burned 2.1 million tokens in 19 minutes through a runaway agent stuck in a tool-use loop, each turn calling a search tool whose result triggered another search, no termination condition. The shared Azure OpenAI deployment for that region had a regional quota of 240k tokens per minute. Aurora was eating roughly 110k per minute on their own. The other seventeen tenants were getting HTTP 429 - rate limit exceeded - Retry-After: 12 and their UIs were spinning. Customer success spent the afternoon on apology calls.

This is the rebuild. Azure API Management in front of two Azure OpenAI deployments, a per-tenant token budget keyed on a JWT claim, a per-month quota in Cosmos DB, semantic caching for the prompts that repeat, and a circuit breaker on the backend that fails over to a different region when the primary stops behaving. Streaming responses pass through unchanged. The runaway-tenant problem cannot happen the same way again: aurora's per-minute cap is now 30k, and the moment they hit it the gateway returns a 429 to them and only to them.

Eight months in: 22 production tenants, peak 14M tokens per day across all of them, semantic cache hit rate around 18%, two real failover events from a primary-region outage in March that lasted six minutes each, zero tenant-induced outages on other tenants.

Why APIM specifically, and not a DIY proxy

The first prototype was, of course, a Python FastAPI proxy. It worked for a week of internal testing and then started revealing its limits. SSE streaming was the one that bit hardest. We had it half-working: the response would stream, but our token counter only got the final chunk's usage field, and only when stream_options.include_usage=true was set. Earlier prompts were billing tokens we never measured. Then we added chunk parsing and hit the second problem: Python asyncio plus uvicorn plus the OpenAI SDK plus our middleware made a five-deep call stack where dropped TCP connections turned into 30-second hangs instead of immediate ECONNRESET. Then we added JWT validation, per-tenant cache keys, retry on 429, multi-region failover. By month two we had built a worse version of API Management.

API Management has tested SSE handling, gateway-class throughput (we sustain about 80 requests per second across the cluster), built-in JWT validation, region failover via backend resources, a circuit breaker that already knows how to read Retry-After headers from Azure OpenAI, and an audit trail in Application Insights the security team accepts as evidence without a follow-up question. It also has a purpose-built azure-openai-token-limit policy that counts tokens correctly across streaming and non-streaming requests, including the prompt-token estimation that lets us reject overcap requests before they hit the model.

The licensing cost is real (Premium for multi-region) but the migration paid for it in three months. Customer success no longer fields outage tickets caused by other tenants' usage; that line item alone covered the bill.

The shape of the system

Client SDKs (Python, TypeScript, internal Java service) hit a single endpoint: https://gateway.example.com/openai/v1/chat/completions. The client carries a JWT issued by our identity service. Inside the JWT, the claim that matters is tenant_id, a stable opaque string like aurora-uk or helix-de.

Inside APIM, every request walks through the same inbound pipeline:

  1. validate-jwt against our issuer's JWKS
  2. extract tenant_id into a context variable
  3. semantic cache lookup keyed on tenant + embedded prompt
  4. per-minute token-limit check, keyed on tenant_id
  5. per-month quota check, queried from Cosmos DB
  6. route to the primary Azure OpenAI backend

If the request makes it through, APIM forwards to Azure OpenAI. On the way back:

  1. count actual tokens used (from the response, including streaming chunks)
  2. increment the Cosmos counter
  3. store the response in the semantic cache (if it qualifies)
  4. set response headers x-tenant-tokens-remaining, x-tenant-monthly-remaining, x-cache-hit

If the primary backend starts misbehaving (consecutive 5xx, or repeated 429s with Retry-After), the backend resource's circuit breaker trips and routes are sent to the fallback Azure OpenAI deployment in a different region. The tenant sees a slightly slower response but no error.

Everything lands in Application Insights with tenant_id as a custom dimension, which means the per-tenant dashboard is one KQL query away.

The per-minute token limit, keyed on JWT claim

This is the piece that fixes the original incident. The azure-openai-token-limit policy can be keyed on any expression, including a JWT claim. We pull tenant_id out of the validated JWT and use it as the counter key.

<inbound>
  <validate-jwt header-name="Authorization" failed-validation-httpcode="401"
                failed-validation-error-message="Unauthorized">
    <openid-config url="https://auth.example.com/.well-known/openid-configuration" />
    <required-claims>
      <claim name="tenant_id" match="any" />
    </required-claims>
  </validate-jwt>

  <set-variable name="tenant_id"
                value="@((string)((Jwt)context.Variables["validated-jwt"]).Claims["tenant_id"].FirstOrDefault())" />

  <choose>
    <when condition="@(string.IsNullOrEmpty((string)context.Variables["tenant_id"]))">
      <return-response>
        <set-status code="400" reason="Missing tenant_id claim" />
      </return-response>
    </when>
  </choose>

  <azure-openai-token-limit
      counter-key="@((string)context.Variables["tenant_id"])"
      tokens-per-minute="30000"
      estimate-prompt-tokens="true"
      remaining-tokens-variable-name="remainingTokens"
      remaining-tokens-header-name="x-tenant-tokens-remaining"
      tokens-consumed-header-name="x-tenant-tokens-consumed" />
</inbound>

A few details that matter and are easy to get wrong.

The counter-key must be a stable string for the tenant. We use the opaque tenant_id and not the JWT sub claim, because sub is per-user; one tenant can have hundreds of users, and we want their per-minute budget pooled. Pick the wrong claim and you either have one tenant whose loudest user starves the rest of the tenant (too granular) or you have all tenants sharing a counter (which is exactly what the runaway-tenant incident proved disastrous).

The estimate-prompt-tokens="true" is what makes this policy useful for streaming. Without it, the policy can only count after the response comes back, which is too late. With it, APIM estimates the prompt tokens before forwarding, adds them to the counter, and rejects the request immediately with 429 if the tenant is over budget. The estimate is not exact, more on that below.

The remaining-tokens-header-name is the value the client sees. We expose it deliberately so tenants can see their own budget shrinking in real time. The header looks like x-tenant-tokens-remaining: 184221. Tenants that hit a hard budget can self-throttle without having to call us.

The 30k-per-minute number is per-tenant default. Three of our larger tenants negotiated higher limits and we set theirs by policy fragment loaded from a tenant-config Cosmos collection. The same azure-openai-token-limit block, different tokens-per-minute. We will go through that in a moment.

A separate gotcha that took us a week to find: azure-openai-token-limit counters are per-region. If you deploy APIM to multiple regions (we run primary in West Europe and standby in North Europe) the counter for aurora-uk in West Europe is independent of the counter for aurora-uk in North Europe. This is documented but easy to skim past in the multi-region docs. For us this is acceptable because failover is rare and a tenant getting briefly double-budgeted during a regional incident is not a problem worth solving. If it were, we would need to externalise the counter to a shared Redis or Cosmos, at which point the policy becomes hand-rolled and we lose the built-in streaming integration. The tradeoff was not worth it for us.

Per-tenant overrides via a policy fragment

For tenants on bigger plans, we load config from Cosmos and rewrite the limit before the policy runs.

<send-request mode="new" response-variable-name="tenantConfig" timeout="2" ignore-error="true">
  <set-url>@($"https://tenants.cosmos.azure.com/colls/limits/docs/{(string)context.Variables["tenant_id"]}")</set-url>
  <set-method>GET</set-method>
  <authentication-managed-identity resource="https://cosmos.azure.com" />
</send-request>

<set-variable name="tpmLimit" value="@{
  var resp = (IResponse)context.Variables["tenantConfig"];
  if (resp == null || resp.StatusCode != 200) return 30000;
  return (int)(resp.Body.As<JObject>()["tokens_per_minute"] ?? 30000);
}" />

<azure-openai-token-limit
    counter-key="@((string)context.Variables["tenant_id"])"
    tokens-per-minute="@((int)context.Variables["tpmLimit"])"
    estimate-prompt-tokens="true"
    remaining-tokens-header-name="x-tenant-tokens-remaining" />

The ignore-error="true" plus the default of 30000 is deliberate: if Cosmos is unreachable we degrade to the default cap rather than failing the request, which is the right tradeoff for an inflight LLM call. The 2-second timeout is also deliberate; the Cosmos read is on the hot path, so a slow lookup translates directly into perceived latency for the client.

The monthly cap

azure-openai-token-limit is per-minute, not per-month. The monthly cap is a separate concern with a different access pattern: read once on inbound, written once on outbound, eventual consistency is fine. Cosmos DB, partitioned by tenant_id, one document per tenant per calendar month, id {tenant_id}-{YYYY-MM}.

Inbound, we read the monthly counter and short-circuit with HTTP 402 if the cap is hit:

<set-variable name="month_key"
              value="@($"{(string)context.Variables["tenant_id"]}-{DateTime.UtcNow:yyyy-MM}")" />

<send-request mode="new" response-variable-name="monthDoc" timeout="2" ignore-error="true">
  <set-url>@($"https://meter.cosmos.azure.com/colls/usage/docs/{(string)context.Variables["month_key"]}")</set-url>
  <set-method>GET</set-method>
  <authentication-managed-identity resource="https://cosmos.azure.com" />
</send-request>

<set-variable name="monthlyUsed" value="@{
  var resp = (IResponse)context.Variables["monthDoc"];
  if (resp == null || resp.StatusCode != 200) return 0L;
  return (long)(resp.Body.As<JObject>()["tokens_used"] ?? 0);
}" />

<choose>
  <when condition="@((long)context.Variables["monthlyUsed"] >= (long)context.Variables["monthlyCap"])">
    <return-response>
      <set-status code="402" reason="Monthly token cap exceeded" />
      <set-header name="x-tenant-monthly-remaining"><value>0</value></set-header>
    </return-response>
  </when>
</choose>

Outbound, we atomically increment the counter with the actual tokens billed by the model:

<send-request mode="new" response-variable-name="incrResp" timeout="2" ignore-error="true">
  <set-url>@($"https://meter.cosmos.azure.com/colls/usage/docs/{(string)context.Variables["month_key"]}")</set-url>
  <set-method>PATCH</set-method>
  <set-header name="Content-Type"><value>application/json-patch+json</value></set-header>
  <set-body>@{
    var consumed = int.Parse(context.Response.Headers.GetValueOrDefault("x-tenant-tokens-consumed", "0"));
    return new JArray(new JObject(
      new JProperty("op", "incr"),
      new JProperty("path", "/tokens_used"),
      new JProperty("value", consumed))).ToString();
  }</set-body>
  <authentication-managed-identity resource="https://cosmos.azure.com" />
</send-request>

Cosmos PATCH with incr is atomic, so two simultaneous requests from the same tenant cannot lose an increment. The 402 on monthly overshoot is deliberately not 429: 429 implies "try again soon" and a monthly cap is a billing event, not a rate-limit event. Our client SDKs distinguish them, and 402 surfaces in tenant dashboards as "monthly cap reached, contact us to extend". Two tenants have hit it; both became upgrade conversations.

Streaming, the part everyone gets wrong

The Azure OpenAI Chat Completions endpoint streams via server-sent events when stream=true. Each chunk is a data: { ... }\n\n line; the final chunk before data: [DONE] carries the usage object when stream_options.include_usage is true. APIM's azure-openai-token-limit policy understands this pattern natively, parses the SSE stream as it passes through, extracts token counts, and updates the counter. We did not have to write any of that.

What we did have to do was make sure the client got the stream chunk-by-chunk and not as a buffered blob, which is the easy mistake when you start adding policies. Do not wrap the response in any transformation that reads the body. The moment a policy reads context.Response.Body, APIM buffers, and the client sees the entire response arrive in one chunk after a multi-second wait. Our outbound policy explicitly leaves the body alone:

<outbound>
  <base />
  <!-- DO NOT read context.Response.Body here. Streaming will be ruined. -->
  <set-header name="x-tenant-tokens-remaining" exists-action="override">
    <value>@(context.Variables.GetValueOrDefault<string>("remainingTokens", "unknown"))</value>
  </set-header>
  <set-header name="x-cache-hit" exists-action="override">
    <value>@(context.Variables.GetValueOrDefault<string>("cacheHit", "false"))</value>
  </set-header>
</outbound>

The token increment to Cosmos happens after the stream completes; APIM serialises the outbound section to run after the final chunk has been forwarded, so the client perceives the full streamed response and we still get the bookkeeping. During a long stream the x-tenant-tokens-remaining header reflects the value at request-start, not request-end. For per-second precision a client would have to re-query a separate endpoint, which one of our larger tenants does. Most clients are fine with the inbound estimate.

Semantic caching, eighteen percent and counting

Repeated and near-duplicate prompts are common in our workload. Customer support copilots ask "summarise this ticket" a thousand times a day across very similar tickets; doc-search agents send the same canonical question phrased a dozen ways. Caching these by exact-match of the prompt string is useless; caching by embedding similarity is the right shape.

API Management has semantic caching for Azure OpenAI built in. The lookup policy embeds the incoming prompt, searches the cache for entries within a configured similarity threshold, and returns a cached response if a match exists.

<inbound>
  <azure-openai-semantic-cache-lookup
      score-threshold="0.92"
      embeddings-backend-id="embeddings-backend"
      embeddings-backend-auth="system-assigned"
      ignore-system-messages="true"
      max-message-count="10">
    <vary-by>@((string)context.Variables["tenant_id"])</vary-by>
  </azure-openai-semantic-cache-lookup>
</inbound>

<outbound>
  <azure-openai-semantic-cache-store duration="3600" />
</outbound>

Two things to call out.

vary-by keyed on tenant_id is non-negotiable in a multi-tenant system. Without it, tenant A's cached response can be returned to tenant B for a similar-enough prompt, which is a data-leak class problem. We test this with synthetic prompts on every deployment and the test fails the deploy if a cross-tenant hit occurs.

The score threshold of 0.92 is the result of tuning. We started at 0.85 (the default in the docs) and got hit rate around 32%, but support flagged a handful of mismatches where the cached completion was technically related to the prompt but answered a slightly different question. We tightened to 0.92, hit rate dropped to 18%, mismatch reports went to zero in three months. The right threshold is workload-dependent. Anyone who tells you "use the default" is not running this in production.

The cache costs us about 600 USD per month in embedding tokens (we use text-embedding-3-small, cheap). It saves us roughly 12,000 USD per month in completion tokens. The ratio is not subtle.

The circuit breaker that actually breaks

The original incident had one Azure OpenAI deployment behind the gateway, in West Europe. When the regional outage hit in March, that deployment returned 5xx for six minutes. Without a fallback, the gateway just propagated the errors.

The fix is a backend circuit breaker on the primary backend resource, combined with a set-backend-service decision in the inbound policy that switches to the fallback when the breaker trips. The breaker definition lives in Bicep:

resource primaryBackend 'Microsoft.ApiManagement/service/backends@2023-09-01-preview' = {
  parent: apim
  name: 'aoai-westeurope-primary'
  properties: {
    url: 'https://aoai-westeurope.openai.azure.com/openai'
    protocol: 'http'
    circuitBreaker: {
      rules: [
        {
          name: 'trip-on-5xx-and-retry-after'
          failureCondition: {
            count: 5
            interval: 'PT1M'
            statusCodeRanges: [
              { min: 500, max: 599 }
            ]
          }
          tripDuration: 'PT1M'
          acceptRetryAfter: true
        }
      ]
    }
  }
}

resource fallbackBackend 'Microsoft.ApiManagement/service/backends@2023-09-01-preview' = {
  parent: apim
  name: 'aoai-northeurope-fallback'
  properties: {
    url: 'https://aoai-northeurope.openai.azure.com/openai'
    protocol: 'http'
    circuitBreaker: {
      rules: [
        {
          name: 'trip-on-5xx-and-retry-after'
          failureCondition: {
            count: 5
            interval: 'PT1M'
            statusCodeRanges: [
              { min: 500, max: 599 }
            ]
          }
          tripDuration: 'PT1M'
          acceptRetryAfter: true
        }
      ]
    }
  }
}

The acceptRetryAfter: true is the critical bit. When Azure OpenAI returns a 429 with a Retry-After: 1200 header (which it absolutely does under sustained quota pressure), the circuit breaker reads that value and waits exactly that long before retrying. Without that flag the breaker would trip on its configured tripDuration of one minute, retry immediately after, get another Retry-After, trip again, and you have a thrash loop. Microsoft documents this gotcha explicitly on the backends page.

In the inbound policy, we set the backend with a fallback:

<set-backend-service backend-id="aoai-westeurope-primary" />

<retry
    condition="@(context.Response.StatusCode == 503)"
    count="1"
    interval="0">
  <set-backend-service backend-id="aoai-northeurope-fallback" />
  <forward-request />
</retry>

When the primary's breaker is tripped, APIM returns 503 immediately on the first attempt. The retry block catches it, switches the backend, and re-forwards. The client sees a single response from the fallback region, with maybe 200ms of added latency. The two failover events in March were invisible to all but one tenant who was running a millisecond-sensitive benchmark and noticed the bump in their p99.

Five consecutive 5xx in a minute is a reasonable trip threshold for our traffic volume. For lower-traffic gateways, three may be a better number, because waiting for five errors to accumulate at low QPS means a long stretch of broken requests before the breaker actually trips. Tune to your traffic.

The Bicep that ties it all together

The two Azure OpenAI accounts, Premium APIM with a second location, and the role assignment that lets APIM call OpenAI without keys:

param location string = 'westeurope'
param fallbackLocation string = 'northeurope'

resource aoaiPrimary 'Microsoft.CognitiveServices/accounts@2024-04-01-preview' = {
  name: 'aoai-${location}'
  location: location
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: {
    customSubDomainName: 'aoai-${location}'
    publicNetworkAccess: 'Disabled'
  }
}

resource gpt4 'Microsoft.CognitiveServices/accounts/deployments@2024-04-01-preview' = {
  parent: aoaiPrimary
  name: 'gpt-4o-2024-08-06'
  sku: { name: 'ProvisionedManaged', capacity: 100 }
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o', version: '2024-08-06' }
  }
}

resource aoaiFallback 'Microsoft.CognitiveServices/accounts@2024-04-01-preview' = {
  name: 'aoai-${fallbackLocation}'
  location: fallbackLocation
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: { customSubDomainName: 'aoai-${fallbackLocation}' }
}

resource apim 'Microsoft.ApiManagement/service@2023-09-01-preview' = {
  name: 'apim-aoai-gateway'
  location: location
  sku: { name: 'Premium', capacity: 1 }
  identity: { type: 'SystemAssigned' }
  properties: {
    publisherEmail: 'platform@example.com'
    publisherName: 'Platform'
    additionalLocations: [
      { location: fallbackLocation, sku: { name: 'Premium', capacity: 1 } }
    ]
  }
}

resource aoaiRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: aoaiPrimary
  name: guid(apim.id, aoaiPrimary.id, 'cognitive-services-user')
  properties: {
    principalId: apim.identity.principalId
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      'a97b65f3-24c7-4388-baec-2e87135dc908') // Cognitive Services User
    principalType: 'ServicePrincipal'
  }
}

APIM's system-assigned managed identity holds Cognitive Services User on both Azure OpenAI accounts, so there are no API keys anywhere in this stack. The primary uses Provisioned Throughput (100 PTUs of gpt-4o), which gives us predictable latency and a fixed monthly cost. The fallback runs Standard pay-as-you-go, since we only use it during the few minutes a year the primary is down.

The estimate-actual mismatch

Halfway through month two we noticed our Cosmos-tracked monthly usage was running 5 to 7 percent lower than the billing line on the Azure portal. Investigation showed azure-openai-token-limit's prompt-token estimate is based on the request body, and our requests include tool definitions and tool results in the message array, which the estimator counts conservatively but does not match the model's actual tokeniser output exactly. For pure-text prompts the estimate is within half a percent. For tool-use-heavy prompts (our agent traffic) the estimate undercounts by 5 to 7 percent because tool-call JSON has nontrivial token cost.

Two ways to handle this. First option: turn off estimation (estimate-prompt-tokens="false") and count only on the response, accepting that a tenant near their cap could occasionally overshoot by one request's worth of tokens before the policy notices. We do this for tenants with relaxed caps where the eventual-consistency is fine.

Second option: apply a correction factor on the outbound increment. We multiply the model-reported token count by 1.0 (the model's number is the billed number, so there is nothing to correct on the outbound side) and we add a margin to the prompt estimate on the inbound side. Specifically, when the request contains tool definitions or tool results, we bump the estimate by 8% before letting the token-limit policy see it.

The implementation is a small set-variable block that inspects the request body for tools or tool_calls keys and sets a multiplier, then the rate-limit policy is configured with tokens-per-minute reduced by that factor at policy-bind time. It is the ugly part of this system. We are watching the policy roadmap for a cleaner solution; the genai gateway capabilities are evolving fast.

A client view

From the SDK side, the change is almost invisible. The tenant points at our gateway URL instead of Azure OpenAI directly, attaches a Bearer JWT, and otherwise calls the API the same way. Their token-budget visibility is the new headers.

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://gateway.example.com/openai",
    azure_ad_token=os.environ["TENANT_JWT"],
    api_version="2024-06-01",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "summarise ticket #4823"}],
    stream=True,
    stream_options={"include_usage": True},
)

# Grab the response headers; the OpenAI SDK exposes them on .response after the call
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# After the stream completes:
headers = stream.response.headers
print(f"\n[tokens-remaining-this-minute: {headers.get('x-tenant-tokens-remaining')}]")
print(f"[tokens-remaining-this-month: {headers.get('x-tenant-monthly-remaining')}]")
print(f"[cache-hit: {headers.get('x-cache-hit')}]")

A run mid-burst might print [tokens-remaining-this-minute: 184221], which is the value tenants use to self-throttle. A 429 response carries the same header set to 0 plus the standard Retry-After. Tenants that want pre-emptive throttling do not have to wait for the 429; they can watch the header decline and pause before they hit the wall.

Troubleshooting

HTTP 429 - rate limit exceeded - Retry-After: 12 from a tenant who claims they are nowhere near their cap. Almost always: their JWT is missing the tenant_id claim or has it spelled differently (we have seen tenantId, tenant-id, tid). The policy treats null as an empty string and collapses all such requests into one counter, which fills up in seconds. The 400 response from the policy when tenant_id is null exists to make this obvious; tenants who ignore the 400 and retry end up here.

Unable to read stream from upstream: ECONNRESET in the APIM trace, with the client seeing a truncated response. Usually Azure OpenAI has hit its regional capacity limit and is dropping mid-stream connections. The circuit breaker should catch this on subsequent requests, but the in-flight one is lost. Our SDK retries on this specific error class. If you see a cluster of these in a five-minute window, check the Azure OpenAI deployment's regional metrics; you are probably near the regional cap and not the per-deployment cap.

Token count mismatch: APIM counted 4180, OpenAI billed 4612 flagged by a tenant who reconciles their bills against our usage exports. This is the estimator-vs-actual gap described above. The mitigation is the inbound estimate margin. We reconcile monthly and apply credits if the gap exceeds 3% for a given tenant.

BackendCircuitBreakerTripped in the APIM diagnostics, followed by 503s with no fallback engaging. The retry block is misconfigured; check that the condition matches the status code APIM actually returns when the breaker is tripped (503, not the original 5xx). Also confirm the fallback backend itself is not also broken; a tripped fallback gives you a hard outage.

The semantic cache returned a response for a different tenant in a synthetic test. The vary-by is missing or evaluating to null. Check that the JWT is being validated before the cache lookup runs; if validation fails silently the tenant_id variable may be empty and all requests share a cache namespace. The order of policies in the inbound section is critical.

The dashboard

The Application Insights dashboard the platform team checks every morning has four panels.

Per-tenant token spend over 30 days, stacked. Aurora-uk used to sit around 40% of total spend; after the per-tenant cap was put in, they are at 11%, which is closer to their share of revenue. The previous 40% included the runaway-loop incidents that we no longer have.

Circuit-breaker events over time. Two in March (the regional outage), one in May (a transient gpt-4o capacity blip that lasted 90 seconds), zero in the months since. Each event is annotated with the tripping tenant if traffic correlates, "regional" otherwise.

Cache hit rate by tenant. The interesting outlier is a customer-support copilot tenant whose hit rate is 47%, well above the 18% average, because their workload answers near-duplicate questions all day.

Estimate-vs-actual delta by tenant. Mostly flat at 1.5% (text-heavy tenants) with spikes at 7% (agent-heavy tenants). The spikes tell us which tenants benefit most from the inbound margin correction.

The KQL behind the first panel:

ApiManagementGatewayLogs
| where TimeGenerated > ago(7d)
| where OperationName == "OpenAI Chat Completions"
| extend tenant = tostring(parse_json(BackendResponseHeaders).["x-ms-tenant-id"])
| extend tokens = toint(parse_json(BackendResponseHeaders).["x-tenant-tokens-consumed"])
| summarize sum(tokens) by tenant, bin(TimeGenerated, 1m)
| render timechart

Filtered to aurora-uk and zoomed to 19-minute windows, this is what we ran in the post-incident review to confirm the original runaway loop would now be impossible. The plotted line caps out at the 30k-per-minute ceiling and sits there until the tenant's traffic shape returns to normal. The line for every other tenant is undisturbed.

A coda

The thing I underweighted when we started this project was how much the architectural decision affected the conversations with customers. We expected the win to be operational; we got that, the on-call channel is much quieter. The unexpected win was the sales motion. When prospective customers ask about multi-tenancy guarantees, "we have a per-tenant token cap enforced at the gateway, here are the headers you'll see in your responses, here is the dashboard we'd give you" is a concrete answer the previous architecture could not produce. Two of our last four enterprise contracts cited the per-tenant isolation explicitly in the procurement document. The work had a revenue side I did not see coming.

The piece I would do differently is the estimator-actual reconciliation. We started with the assumption the policy estimate would be close enough to the model's billed token count that nobody would notice. We were wrong by 5 to 7 percent on agent traffic, which sounds small until you realise it is a tenant's worth of revenue at the largest customers. If I were doing this again, the first dashboard panel I would build is the estimate-actual delta per tenant, and the reconciliation policy would have shipped in week one rather than month two. The capability is there; it just needed the operational discipline to look for the discrepancy.

The runaway-loop incident took us four hours to recover from and twelve weeks to fix properly. Eight months in, the system has paid that back many times over. The next runaway loop, whenever it comes from whichever tenant, will be one tenant's problem only. Which is the entire point of the work.