Azure AI

Prompt Flow to a Managed Online Endpoint: the Azure Pipeline, the 95-question eval gate, and the 17:14 rollback

The new prompt version merged at 17:02 on a Friday, served 10% of traffic at 17:08, and broke the citation rubric for multi-hop questions at 17:12. The canary watcher hit its third failed window at 17:14 and rolled back automatically. The whole shape of the Prompt Flow, the Azure DevOps pipeline, the 95-question eval gate, and the canary mechanism that caught it.

09 Dec 2025 19 min read 327 viewsPrompt FlowAzure MLManaged Online EndpointAzure Pipelines

The prompt change merged at 17:02 on a Friday. The deploy pipeline ran clean, the build artefact was healthy, the smoke tests passed, the canary stage moved 10% of traffic onto the new deployment at 17:08. Four minutes later, the canary's running citation rubric score for one class of questions (the multi-hop class, where the answer needs to stitch together two retrieval hits) dipped to 2.61, against a gate of 2.85. The pipeline's canary watcher hit its third consecutive failed window at 17:14, flipped traffic back to the previous deployment, and paged on-call. The whole thing took 12 minutes from merge to rollback. The engineer who shipped the change saw the page before he had finished his coffee.

The point of this write-up is not "we had a rollback." The point is that the rollback was unremarkable and uncontested because the eval gate was already in the pipeline, the canary already mirrored production rubric scoring, and the previous deployment id was already known to the pipeline. Four months earlier, the same kind of regression would have shipped to 100% of traffic and stayed there until the support team noticed Monday morning. The path from "Prompt Flow on my laptop" to "Prompt Flow behind a Managed Online Endpoint with a real eval gate and a working rollback" took roughly five months of platform work. This is the whole shape of it.

The flow, briefly

The product is a retrieval-augmented assistant for an internal support team. Users ask a free-text question, the assistant answers in two to four sentences, and the answer must cite the documents it pulled from. The assistant lives behind an internal web UI which calls our Managed Online Endpoint over a single scoring URL. The flow, on disk, is a Prompt Flow project with seven nodes.

# flow.dag.yaml
inputs:
  question:
    type: string
  conversation_history:
    type: list
    default: []

outputs:
  answer:
    type: string
    reference: ${formatter.output}
  citations:
    type: list
    reference: ${citation_linker.output.citations}
  safety_verdict:
    type: string
    reference: ${safety_check.output.verdict}

nodes:
  - name: intent_classifier
    type: llm
    source:
      type: code
      path: prompts/intent.jinja2
    inputs:
      deployment_name: gpt-4o-mini
      temperature: 0.0
      question: ${inputs.question}
      conversation_history: ${inputs.conversation_history}
    connection: azure-openai-prod
    api: chat

  - name: retrieval
    type: python
    source:
      type: code
      path: nodes/retrieval.py
    inputs:
      intent: ${intent_classifier.output}
      question: ${inputs.question}
      top_k: 8
      index_name: support-kb-prod-v3

  - name: retrieval_summariser
    type: llm
    source:
      type: code
      path: prompts/summarise.jinja2
    inputs:
      deployment_name: gpt-4o-mini
      temperature: 0.0
      hits: ${retrieval.output}
      question: ${inputs.question}
    connection: azure-openai-prod
    api: chat

  - name: answer_drafter
    type: llm
    source:
      type: code
      path: prompts/draft.jinja2
    inputs:
      deployment_name: gpt-4o
      temperature: 0.2
      summary: ${retrieval_summariser.output}
      question: ${inputs.question}
      intent: ${intent_classifier.output}
    connection: azure-openai-prod
    api: chat

  - name: citation_linker
    type: python
    source:
      type: code
      path: nodes/citation_linker.py
    inputs:
      draft: ${answer_drafter.output}
      hits: ${retrieval.output}

  - name: safety_check
    type: python
    source:
      type: code
      path: nodes/safety_check.py
    inputs:
      text: ${citation_linker.output.answer}
      content_safety_connection: content-safety-prod

  - name: formatter
    type: python
    source:
      type: code
      path: nodes/formatter.py
    inputs:
      answer: ${citation_linker.output.answer}
      citations: ${citation_linker.output.citations}
      safety_verdict: ${safety_check.output.verdict}

Seven nodes is on the high side for a Prompt Flow. The internal pressure was always "fold the summariser into the drafter, fold the citation linker into the drafter, ship five nodes instead of seven." We resisted that pressure because each node is a separately addressable test surface. The eval set scores groundedness against the summariser's output independently of the drafter, citation rate against the citation linker independently of the drafter, refusal rate against the formatter, and the safety verdict against the safety check. If we folded those into one giant prompt, a regression in citation rate could come from a prompt change in the drafter, a prompt change in the summariser, or a python bug in the citation linker, and we would have no way of telling them apart from the metrics alone. Seven nodes means seven independent signals.

How I drive the flow locally

The pf CLI is the day-to-day surface. The contract is: anything that runs in the pipeline must work first via pf on a developer laptop. No "well it works in CI" allowed.

# Install the SDK and CLI into the project venv
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install promptflow promptflow-tools

# Create the local connections the flow expects
pf connection create --file connections/azure-openai-prod.yml --set api_key=$AOAI_KEY
pf connection create --file connections/content-safety-prod.yml --set api_key=$CS_KEY

# Single-shot smoke test of the whole flow with one input
pf flow test --flow . --inputs question="how do I reset the device PIN remotely"

# Single-node test (useful when iterating on one prompt)
pf flow test --flow . --node citation_linker --inputs draft='{"text":"...","spans":[]}' hits='[...]'

# Run the flow against the full eval set
pf run create \
  --flow . \
  --data ./eval/eval-set.jsonl \
  --column-mapping question='${data.question}' \
    conversation_history='${data.conversation_history}' \
    gold_answer='${data.gold_answer}' \
    class='${data.class}' \
  --stream \
  --name local-run-$(date +%s)

# Inspect a finished run
pf run show --name local-run-1726340012
pf run show-details --name local-run-1726340012
pf run show-metrics --name local-run-1726340012

The connections deserve a sentence. We have two Azure OpenAI deployments in front of the flow: azure-openai-prod (real prod traffic) and azure-openai-eval (a separate Azure OpenAI resource with higher TPM, exclusively for the eval runner). They are physically different connections so that the eval pipeline's burst of 95 questions does not eat into the production rate limit budget the live endpoint depends on. Same model, same prompt, different upstream meter.

The 95 questions

The eval set is hand-written. Not synthetic, not augmented, not LLM-generated. Twelve support engineers spent four afternoons in a meeting room generating questions, gold answers, and class labels. The final set is 95 questions long. The decomposition:

{
  "factual":    34,   // single-hop, factually correct retrieval expected
  "procedural": 22,   // step-by-step instructions, ordering matters
  "edge":       11,   // niche product configurations, retrieval is sparse
  "multi_hop":  17,   // answer requires stitching two or more retrieval hits
  "ambiguous":  11    // intentionally underspecified, refusal is the right answer
}

Each eval item is a row in eval-set.jsonl:

{"question":"how do I reset the device PIN remotely for a user whose laptop is in tamper-locked state","conversation_history":[],"gold_answer":"Open the MDM console, navigate to Devices > Tamper response, select the device, choose Issue recovery PIN. The user receives the PIN by SMS. The PIN expires in 30 minutes.","class":"procedural","expected_citations":["mdm-tamper-recovery","mdm-pin-issuance"]}
{"question":"is it possible to disable two-factor on a service account","conversation_history":[],"gold_answer":"The platform does not allow disabling two-factor on any human-bound or service-bound account. The supported pattern is to use a workload identity instead.","class":"factual","expected_citations":["security-policy-2fa","workload-identity-overview"]}
{"question":"will the new dashboard work","conversation_history":[],"gold_answer":"Refuse: the question is too ambiguous to answer without more context.","class":"ambiguous","expected_citations":[]}

The class label is the load-bearing field. The eval gate is not "average score above X." The gate is per-class. Multi-hop questions are allowed to score lower in absolute terms because they are harder; we set the bar separately. Ambiguous questions are scored on whether the formatter produces a refusal, not on the content of the answer. Each class has its own threshold, and the gate fails if any class breaks.

The eval rubric and the gpt-4o judge

The interesting part of the eval is the judge node. We use a Prompt Flow evaluation flow, a second flow whose job is to score the main flow's outputs. The judge is gpt-4o with temperature 0 and a deterministic JSON output schema.

# eval/eval-flow.dag.yaml
inputs:
  question:
    type: string
  gold_answer:
    type: string
  predicted_answer:
    type: string
  predicted_citations:
    type: list
  expected_citations:
    type: list
  class:
    type: string

outputs:
  groundedness:
    type: double
    reference: ${rubric_judge.output.groundedness}
  factual_alignment:
    type: double
    reference: ${rubric_judge.output.factual_alignment}
  citation_correctness:
    type: double
    reference: ${rubric_judge.output.citation_correctness}
  refusal_correct:
    type: bool
    reference: ${refusal_check.output.correct}

nodes:
  - name: rubric_judge
    type: llm
    source:
      type: code
      path: prompts/rubric_judge.jinja2
    inputs:
      deployment_name: gpt-4o
      temperature: 0.0
      response_format: json_object
      question: ${inputs.question}
      gold_answer: ${inputs.gold_answer}
      predicted_answer: ${inputs.predicted_answer}
      predicted_citations: ${inputs.predicted_citations}
      expected_citations: ${inputs.expected_citations}
    connection: azure-openai-eval
    api: chat

  - name: refusal_check
    type: python
    source:
      type: code
      path: nodes/refusal_check.py
    inputs:
      predicted_answer: ${inputs.predicted_answer}
      class: ${inputs.class}

The rubric_judge.jinja2 prompt is 180 lines long and we treat it as production code. It scores three independent axes on a 0 to 3 integer scale, with explicit anchor descriptions for what 0, 1, 2, and 3 mean on each axis. The 0/1/2/3 anchors are important; a "score 0 to 1 in floating point" rubric is too sloppy for a judge to be consistent across runs. Integer anchors with definitions stay consistent across thousands of judge calls.

The mean of those integer scores becomes the per-class metric. The gates we ship today:

# eval/gate.py
THRESHOLDS = {
    "factual":    {"groundedness": 2.90, "citation_correctness": 0.95},
    "procedural": {"groundedness": 2.85, "citation_correctness": 0.92},
    "edge":       {"groundedness": 2.60, "citation_correctness": 0.85},
    "multi_hop":  {"groundedness": 2.85, "citation_correctness": 0.90},
    "ambiguous":  {"refusal_rate": 0.85},
}

def evaluate_run(metrics_by_class: dict) -> tuple[bool, list[str]]:
    failures = []
    for cls, gate in THRESHOLDS.items():
        observed = metrics_by_class.get(cls, {})
        for metric, threshold in gate.items():
            value = observed.get(metric)
            if value is None:
                failures.append(f"{cls}: missing metric {metric}")
                continue
            if value < threshold:
                failures.append(
                    f"{cls}: {metric}={value:.3f} below threshold {threshold:.3f}"
                )
    return (len(failures) == 0, failures)

The script reads the per-class aggregates out of the pf run show-metrics JSON, applies the gate, and either exits 0 or exits 1 with the failure list piped to the pipeline log. The 17:14 rollback (the one this whole article opened with) was triggered because multi_hop: groundedness=2.71 below threshold 2.85. Single class, single metric, well above the noise floor.

The Azure DevOps pipeline

The pipeline has four stages: build, eval, deploy, traffic. Each one fails fast and writes its artefacts where the next stage can find them.

# azure-pipelines.yml
trigger:
  branches:
    include: [main]
pr:
  branches:
    include: [main]

pool:
  vmImage: ubuntu-latest

variables:
  workspaceName: 'mlw-assistant-prod'
  resourceGroup: 'rg-assistant-prod-eus2'
  endpointName: 'support-assistant-prod'
  serviceConnection: 'sc-mlw-assistant-prod'
  flowDir: 'flows/support-assistant'

stages:
  - stage: Build
    displayName: 'Containerise flow'
    jobs:
      - job: BuildFlow
        steps:
          - checkout: self
          - task: UsePythonVersion@0
            inputs: { versionSpec: '3.11' }
          - script: |
              pip install promptflow promptflow-tools promptflow-azure azure-cli-ml
            displayName: Install pf
          - script: |
              cd $(flowDir)
              pf flow validate --source .
              pf flow build --source . --output ./build --format docker
            displayName: 'pf flow build (docker)'
          - publish: $(flowDir)/build
            artifact: flow-build

  - stage: Eval
    displayName: 'Run 95-question eval'
    dependsOn: Build
    jobs:
      - job: EvalRun
        timeoutInMinutes: 30
        steps:
          - download: current
            artifact: flow-build
          - task: AzureCLI@2
            displayName: 'pf run against eval set'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                az extension add -n ml -y
                pip install promptflow promptflow-tools promptflow-azure

                RUN_NAME="eval-$(Build.BuildId)-$(Build.SourceVersion)"
                pf run create \
                  --flow $(flowDir) \
                  --data eval/eval-set.jsonl \
                  --column-mapping question='${data.question}' \
                    conversation_history='${data.conversation_history}' \
                  --stream \
                  --name "${RUN_NAME}" \
                  --runtime automatic \
                  --workspace-name $(workspaceName) \
                  --resource-group $(resourceGroup)

                EVAL_NAME="judge-${RUN_NAME}"
                pf run create \
                  --flow eval/eval-flow \
                  --data eval/eval-set.jsonl \
                  --column-mapping question='${data.question}' \
                    gold_answer='${data.gold_answer}' \
                    predicted_answer='${run.outputs.answer}' \
                    predicted_citations='${run.outputs.citations}' \
                    expected_citations='${data.expected_citations}' \
                    class='${data.class}' \
                  --run "${RUN_NAME}" \
                  --name "${EVAL_NAME}" \
                  --stream \
                  --workspace-name $(workspaceName) \
                  --resource-group $(resourceGroup)

                pf run show-metrics --name "${EVAL_NAME}" \
                  --workspace-name $(workspaceName) \
                  --resource-group $(resourceGroup) \
                  > metrics.json

                python eval/aggregate_by_class.py metrics.json > per-class.json
                python eval/gate.py per-class.json
          - publish: per-class.json
            artifact: eval-metrics
          - script: |
              python tools/emit_to_appinsights.py \
                --metrics per-class.json \
                --instrumentation-key $(APPINSIGHTS_KEY) \
                --build-id $(Build.BuildId)
            displayName: 'Emit metrics to App Insights'

  - stage: Deploy
    displayName: 'Create deployment (no traffic)'
    dependsOn: Eval
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: CreateDeployment
        environment: assistant-prod
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: flow-build
                - task: AzureCLI@2
                  displayName: 'az ml online-deployment create'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      set -euo pipefail
                      az extension add -n ml -y

                      PREV_DEP=$(az ml online-endpoint show \
                        --name $(endpointName) \
                        --workspace-name $(workspaceName) \
                        --resource-group $(resourceGroup) \
                        --query 'traffic' -o json \
                        | jq -r 'to_entries | map(select(.value > 0))[0].key')
                      echo "##vso[task.setvariable variable=prevDeployment;isOutput=true]${PREV_DEP}"

                      DEP_NAME="d-$(Build.BuildId)"
                      az ml online-deployment create \
                        --endpoint-name $(endpointName) \
                        --name "${DEP_NAME}" \
                        --workspace-name $(workspaceName) \
                        --resource-group $(resourceGroup) \
                        --file deploy/online-deployment.yml \
                        --set name="${DEP_NAME}" \
                        --set environment_variables.PROMPTFLOW_BUILD_ID=$(Build.BuildId)

                      echo "##vso[task.setvariable variable=newDeployment;isOutput=true]${DEP_NAME}"
                  name: deployStep

  - stage: Canary
    displayName: 'Canary 10% for 20 minutes'
    dependsOn: Deploy
    jobs:
      - job: ShiftCanary
        variables:
          prevDeployment: $[ stageDependencies.Deploy.CreateDeployment.outputs['deployStep.prevDeployment'] ]
          newDeployment:  $[ stageDependencies.Deploy.CreateDeployment.outputs['deployStep.newDeployment'] ]
        steps:
          - task: AzureCLI@2
            displayName: 'Shift 10% to new deployment'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                az ml online-endpoint update \
                  --name $(endpointName) \
                  --workspace-name $(workspaceName) \
                  --resource-group $(resourceGroup) \
                  --traffic "$(prevDeployment)=90 $(newDeployment)=10"
          - task: AzureCLI@2
            displayName: 'Run canary rubric watcher for 20m'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                python tools/canary_watcher.py \
                  --endpoint $(endpointName) \
                  --deployment $(newDeployment) \
                  --window-minutes 20 \
                  --rubric-threshold 2.85 \
                  --citation-threshold 0.92 \
                  --on-fail-traffic "$(prevDeployment)=100 $(newDeployment)=0" \
                  --workspace $(workspaceName) \
                  --resource-group $(resourceGroup)
          - task: AzureCLI@2
            displayName: 'Promote canary to 100%'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                az ml online-endpoint update \
                  --name $(endpointName) \
                  --workspace-name $(workspaceName) \
                  --resource-group $(resourceGroup) \
                  --traffic "$(newDeployment)=100 $(prevDeployment)=0"

The flow of variables between stages is the load-bearing detail. prevDeployment and newDeployment are emitted from the Deploy stage as outputs, picked up in the Canary stage via stageDependencies. The canary watcher script knows both names and writes the rollback traffic split (prev=100 new=0) directly via az ml online-endpoint update if its rolling rubric score crosses the threshold for three consecutive windows. The 17:14 rollback was that exact command being invoked from canary_watcher.py after the third window failed at 17:14:03.

The endpoint and deployment, in Bicep

We do not create the endpoint from the pipeline. The endpoint and its first deployment are infrastructure, created via Bicep on the platform pipeline. The application pipeline only creates new deployments and shifts traffic. The reason matters: if the application pipeline could delete the endpoint, a bad merge could take production down at the URL level instead of the model level. Endpoint URL stability is its own concern.

// infra/online-endpoint.bicep
param workspaceName string
param endpointName string
param location string = resourceGroup().location

resource endpoint 'Microsoft.MachineLearningServices/workspaces/onlineEndpoints@2024-04-01' = {
  name: '${workspaceName}/${endpointName}'
  location: location
  identity: { type: 'SystemAssigned' }
  properties: {
    authMode: 'AADToken'
    publicNetworkAccess: 'Disabled'
    description: 'Support assistant scoring endpoint'
    traffic: {}
  }
}

resource initialDeployment 'Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2024-04-01' = {
  name: '${workspaceName}/${endpointName}/d-bootstrap'
  location: location
  sku: { name: 'Standard_DS3_v2', capacity: 2 }
  properties: {
    endpointComputeType: 'Managed'
    instanceType: 'Standard_DS3_v2'
    scaleSettings: { scaleType: 'Default' }
    requestSettings: {
      requestTimeout: 'PT60S'
      maxQueueWait: 'PT10S'
      maxConcurrentRequestsPerInstance: 4
    }
    livenessProbe: {
      initialDelay: 'PT30S'
      period: 'PT10S'
      timeout: 'PT5S'
      failureThreshold: 30
      successThreshold: 1
    }
    readinessProbe: {
      initialDelay: 'PT30S'
      period: 'PT10S'
      timeout: 'PT5S'
      failureThreshold: 6
      successThreshold: 1
    }
  }
  dependsOn: [ endpoint ]
}

And the deployment file the pipeline uses on every release, which references a Prompt Flow build artefact rather than a raw model:

# deploy/online-deployment.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
endpoint_name: support-assistant-prod
model:
  path: ../flows/support-assistant/build
environment:
  image: mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
  inference_config:
    liveness_route: { port: 8080, path: /health }
    readiness_route: { port: 8080, path: /health }
    scoring_route: { port: 8080, path: /score }
instance_type: Standard_DS3_v2
instance_count: 2
request_settings:
  request_timeout_ms: 60000
  max_concurrent_requests_per_instance: 4
liveness_probe:
  initial_delay: 30
  period: 10
  timeout: 5
  failure_threshold: 30
  success_threshold: 1
environment_variables:
  PROMPTFLOW_WORKER_NUM: '8'
  PROMPTFLOW_WORKER_THREADS: '4'
  PRT_CONFIG_OVERRIDE: 'deployment.subscription_id=...'

The endpoint type is Managed Online Endpoint, not a Container Apps deploy, not an ACI deploy, not a custom AKS deploy. We considered Container Apps. We had two reasons not to. First, the scoring API contract and the MLflow logging plumbing come for free with managed online endpoints, and we already had MLflow tracing the eval runs end to end. Second, the blue/green traffic split is a first-class endpoint property, not something we have to build by stitching together front doors and label selectors. The traffic mechanism is one CLI call. Rolling it back is the same call with the percentages reversed. That property is what made the 17:14 rollback a one-line operation.

The 17:14 timeline, with the metrics

17:02:11  PR #1428 merged to main
17:02:13  Build stage queued
17:04:50  pf flow build (docker) ok, artefact 412 MB
17:04:55  Eval stage queued
17:08:32  pf run create finished, 95 questions answered
17:09:14  judge-run finished, metrics emitted
17:09:15  gate.py exited 0, all per-class metrics above threshold
17:09:40  Deploy stage queued
17:11:28  d-1428 created, capacity 2 instances
17:11:32  traffic shifted: d-1427=90 d-1428=10
17:11:32  canary_watcher.py started, polling 60s windows
17:12:32  window 1: multi_hop groundedness 2.78 (n=4, threshold 2.85) FAIL
17:13:32  window 2: multi_hop groundedness 2.71 (n=7, threshold 2.85) FAIL
17:14:03  window 3: multi_hop groundedness 2.66 (n=11, threshold 2.85) FAIL
17:14:03  rollback issued: az ml online-endpoint update --traffic d-1427=100 d-1428=0
17:14:06  endpoint traffic now d-1427=100, on-call paged
17:14:07  pipeline marked Canary stage as Failed, did not promote

The eval gate at 17:09:15 was clean. That is by design and also the most interesting failure mode of the whole system: the eval set is fixed and hand-written, and a prompt change can be fine for all 95 eval questions and broken for the 96th. The PR in question was a tweak to the answer_drafter's prompt that strengthened its instructions to "be concise." The eval set's multi-hop questions all happen to ask for explanations that benefit from a longer answer, and the new prompt produced answers so short they no longer carried the cross-document context. The eval scored them fine because the gold answers in the eval set were also fairly short. Live traffic carried a steady stream of multi-hop questions that were variants of the eval examples, and on those the abbreviated answers dropped citation correctness from "good" to "bad enough to fire the gate."

The fix was not to expand the eval set on the spot. The fix was to keep the rollback, ship a follow-up PR that softened the prompt change, and add two new multi-hop questions to the eval set that mirrored the production examples that had failed. Next pipeline run, gate stayed clean, canary held, traffic promoted at 09:34 the following Monday morning.

The canary watcher, in detail

The watcher is the thing that turned this from "we run an eval before deploy" into "we are protected during deploy." Its loop is small:

# tools/canary_watcher.py
import time, json, subprocess, sys
from datetime import datetime, timedelta

def fetch_canary_metrics(endpoint, deployment, lookback_minutes):
    out = subprocess.check_output([
        "az", "monitor", "app-insights", "query",
        "--app", "appi-assistant-prod",
        "--analytics-query", f"""
            customMetrics
            | where timestamp > ago({lookback_minutes}m)
            | where customDimensions.endpoint == '{endpoint}'
              and customDimensions.deployment == '{deployment}'
              and (name == 'rubric.groundedness'
                   or name == 'rubric.citation_correctness'
                   or name == 'rubric.class')
            | summarize
                groundedness = avg(todouble(iff(name=='rubric.groundedness', value, real(null)))),
                citation = avg(todouble(iff(name=='rubric.citation_correctness', value, real(null)))),
                sample_count = count() by tostring(customDimensions['class'])
        """,
    ])
    return json.loads(out)["tables"][0]

def main(args):
    end = datetime.utcnow() + timedelta(minutes=args.window_minutes)
    consecutive_failures = 0
    while datetime.utcnow() < end:
        metrics = fetch_canary_metrics(args.endpoint, args.deployment, 2)
        worst = pick_worst_per_class(metrics, args)
        if worst.groundedness < args.rubric_threshold or worst.citation < args.citation_threshold:
            consecutive_failures += 1
            print(f"window fail: {worst} (consecutive {consecutive_failures})")
            if consecutive_failures >= 3:
                rollback(args)
                page_oncall(args)
                sys.exit(2)
        else:
            consecutive_failures = 0
        time.sleep(60)
    print("canary window completed cleanly")

There are two non-obvious choices here. First, the rubric scoring on live canary traffic uses the same judge flow as the offline eval, but sampling at a rate that keeps gpt-4o cost bounded. The sampler picks one in five canary requests, runs them through the judge asynchronously, and emits the resulting metrics back into App Insights tagged with the deployment id. The watcher reads those metrics, not raw scoring latency or HTTP error rates. Live rubric on live traffic is what makes the canary signal compatible with the eval gate signal: same scale, same anchors, same axis definitions.

Second, the watcher requires three consecutive failed windows before triggering the rollback. A single window can fail because of a thin sample (we have seen multi_hop with n=2 in a 60-second window during quiet hours). Three windows in a row is the cheapest filter against false positives, and across nine months of running canaries we have one true positive (the 17:14 rollback) and zero false positives. The false-negative rate is harder to measure; we know of two PRs that shipped to 100% with regressions that the canary did not catch, both of which were in classes the canary samples too rarely to be confident on (edge and ambiguous). For those we lean on offline eval more heavily and the human review channel from the support team.

Troubleshooting

pf run create: ConnectionNotFound for connection 'azure-openai-prod' from the pipeline almost always means the pf CLI on the build agent does not see the Azure ML workspace's connections because --workspace-name and --resource-group were not passed, or because the service principal does not have AzureML Data Scientist on the workspace. Fix: add the role at the workspace scope to the pipeline SP, and pass workspace flags on every pf run call.

OperationFailed: Deployment 'd-1428' transitioned to state Unhealthy during the Deploy stage was the first symptom of a Docker image that exceeded the 8 GB Managed Online Endpoint image limit. We had baked unnecessary cache directories into the build. Fix: pf flow build --format docker followed by a multi-stage Dockerfile that copies only the flow and the runtime, not the entire ~/.cache/pip. Image size went from 9.4 GB to 1.8 GB and the deployment reached Succeeded in 110 seconds.

MLflow model logging failed: artifact too large during the Build stage means the flow's MLmodel directory has expanded past the workspace's storage limits. The single biggest contributor we saw was a vendored copy of a multilingual tokeniser inside the flow source tree. Fix: ship the tokeniser from the runtime image, not the flow artefact, and add a .amlignore to the flow root.

Forbidden: client does not have permission to perform action 'Microsoft.MachineLearningServices/workspaces/onlineEndpoints/score/action' from the front-end calling the endpoint with an AAD token means the front-end's identity does not have AzureML Data Scientist on the endpoint, only on the workspace. Endpoint-scoped role assignments are separate from workspace-scoped ones, which surprised us the first time. Fix: assign the role at the endpoint scope to the front-end's managed identity.

429 ContentSafetyTooManyRequests from inside the safety_check node was the gotcha that took the longest to diagnose. The Content Safety resource we had connected was on the cheaper SKU with a 10 RPS limit, and the eval run hit it with bursts well above that during the parallel pf run. The eval would intermittently fail the gate not because the rubric was bad but because some 95-question runs had eight or nine flow failures inside them, which the rubric judge then scored as 0/0/0 because there was no answer to grade. Two fixes shipped together: per-node retry with exponential backoff inside the safety check (so transient 429s are absorbed at the node level), and a new pipeline metric called eval_ran_cleanly that asserts all 95 flow rows produced an answer before the gate is allowed to evaluate. If eval_ran_cleanly is false, the pipeline fails with eval-infra-error rather than eval-regression, and the on-call sees a different page than "the model got worse."

InvalidColumnMapping: Column 'gold_answer' was not found during pf run create --run ... for the eval flow means the eval flow is being passed the wrong upstream run's outputs. The fix is to verify that the upstream pf run exists and exposes outputs.answer and outputs.citations, and that the column-mapping references ${run.outputs.answer} (not ${data.answer}).

UserError: The endpoint is in 'Updating' state. Retry after the operation completes. happens when two deploys race. The Canary stage's az ml online-endpoint update --traffic ... will return this if a previous deploy is still mid-update. We added a --wait flag wrapper and a short polling loop in the pipeline scripts; with that, the error has not recurred.

Where this ended up

Counted at the end of nine months on this pattern: 64 prompt PRs merged to main, 11 of those caught by the eval gate before they ever reached prod, 1 canary rollback (the 17:14 one), 0 incidents in which a prompt regression reached 100% of traffic. The median time from a prompt PR being opened to it serving 100% of prod is 4 weeks, which is slower than it would be without the gate, and faster than the previous regime where every prompt change required a meeting and a human approval and a calendared release window. The new bottleneck is the eval set itself: any time a class of question gains importance in production that the eval set under-represents, we have to expand the set and re-baseline the thresholds, which is a half-day of support-engineer time per expansion. We have done it three times.

There is a softer payoff that mattered to the team and I want to name it. Before this pipeline existed, the engineer who wrote a prompt change carried the risk of that change personally. The change would ship, the support team would notice issues a day later, and the engineer would spend a week chasing down what specifically had regressed. After the pipeline, the eval gate either caught the regression (in which case the PR was re-worked on the engineer's own desk before merge) or it did not, in which case the canary caught it within minutes, the rollback happened automatically, and the engineer's name was attached to a clean Pipeline run log and a clear failure reason. The work of "prompt engineering" became less personally fraught because the failure modes were defined and the gates were not arbitrary. That, more than the metrics, is what made the team willing to ship prompt changes weekly. The 17:14 rollback, when I look back at it, is the system working in exactly the way we promised the team it would. The engineer who shipped the change was at his desk for nine more minutes that Friday, which he spent reading the failure report and writing the follow-up PR. He went home at his normal time. The next deploy went out on Monday morning. The two new multi-hop eval questions are still in the set.