Azure DevOps

Canary releases on AKS with Argo Rollouts and Azure Pipelines: auto-promoting on SLOs

The canary held at 10 percent for 47 minutes. The on-call engineer slept through the 03:11 rollback because the system did not need waking. This is how every piece of that machine was wired.

08 Oct 2025 19 min read 296 viewsArgo RolloutsAKSAzure PipelinesPrometheus

The canary held at 10% for 47 minutes. The build was checkout-api:2024.10.07-1742, deployed at 02:23 UTC on a Tuesday morning. The pipeline run number was 9341. The on-call engineer was asleep. The Rollout sat on step 2 of 7, traffic split 90/10 between the stable ReplicaSet and the canary, and Argo Rollouts kept re-running the analysis every 60 seconds. P99 latency on the new pods crept from 180ms in the first measurement window to 240ms by the third, and then sat there. Error rate stayed flat at 0.2%. Nothing looked broken in the obvious way, which is the worst kind of canary because the operator does not get a clean signal either way.

At 03:11 UTC, the latency analysis crossed its failureCondition for the fourth consecutive measurement and the AnalysisRun returned Failed. The Rollout aborted automatically. Argo Rollouts scaled the canary ReplicaSet back to zero, removed the canary weight from the ingress, and the pipeline run in Azure DevOps that had been blocked on kubectl argo rollouts wait returned exit code 1. A Teams notification went out. The on-call engineer's phone did not ring because we had decided, deliberately, that an automatic rollback at 03:11am does not need a human in the loop. The system handled itself.

The next morning the engineer ran kubectl argo rollouts get rollout checkout and saw the full timeline. The post-mortem took twenty minutes. A regex compile in the request validation path ran cold on the first request to each pod and stayed hot afterwards. P99 looked bad because each new canary pod served a handful of those cold requests before warming up. A two-line change to pre-compile the regex at startup fixed it. The next rollout, two days later, walked through all seven steps in 34 minutes with both analyses green throughout.

This is the architecture that made that 03:11 abort possible without anyone watching. Argo Rollouts on AKS, Azure Monitor managed Prometheus for the metrics, Azure Pipelines as the trigger that ships the canary and waits on it. Eleven services run through this pattern at the time of writing. Across roughly 400 production rollouts in the last six months, eight have aborted on their own. Seven of the eight aborts caught a real regression. The one false positive was a noisy PromQL query during a low-traffic window, which I will tell further down.

Why Argo Rollouts, not Flagger

The team had been running Argo CD for two years before any of this. Application manifests sat in Git, Argo CD reconciled them onto the AKS clusters, the platform team was fluent in the CRD model. The natural extension of that was Argo Rollouts: same project family, the same kubectl argo rollouts CLI, the same dashboard URL. Flagger is a fine tool and several teams I respect use it heavily, but the operational cost of introducing a second progressive delivery controller alongside Argo CD was a cost we were not willing to pay. One control plane, one set of CRDs, one mental model.

The other reason is that Argo Rollouts decouples the traffic-shifting layer from the controller. We can drive weight shifts via the NGINX ingress controller's canary annotations, via Service Mesh Interface (SMI), or via Istio's VirtualService. We started on the NGINX annotations because that is what the cluster already had, and the migration path to Istio later is documented and additive.

The Argo Rollouts documentation lives at argoproj.io and the canary strategy reference is at argo-rollouts.readthedocs.io. The AKS setup follows the pattern documented on Microsoft Learn, with the Azure Monitor managed Prometheus add-on enabled for metrics.

The cluster shape

The AKS cluster runs three node pools: a system pool for the controllers (Argo CD, Argo Rollouts, ingress-nginx, kube-state-metrics), an apps pool for the workloads, and a spot pool for batch jobs that is irrelevant here. Managed Prometheus scrapes both the application pods and the Argo Rollouts controller itself, which gives visibility into the controller's reconciliation latency separately from the application's request latency.

Every pod that participates in a Rollout exposes a /metrics endpoint at port 9102 with two histograms: http_request_duration_seconds (with le buckets at 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) and http_requests_total (with a status label for 2xx/4xx/5xx). The bucketing matters because the histogram quantile estimate is only as good as the bucket boundaries you give it. Too few buckets at the relevant percentile and your p99 estimate jitters by 50ms on noise alone.

The Rollout resource

The deployment is shaped as an Argo Rollouts Rollout CRD, not a vanilla Deployment. Same container, same env vars, same PodSpec. The differences are in the top-level kind and the strategy block.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
  namespace: checkout
spec:
  replicas: 10
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: acrcheckoutprod.azurecr.io/checkout-api:PLACEHOLDER
          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9102
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 512Mi
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-checkout
              - templateName: p99-latency-checkout
            args:
              - name: service-name
                value: checkout
        - setWeight: 20
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-checkout
              - templateName: p99-latency-checkout
            args:
              - name: service-name
                value: checkout
        - setWeight: 40
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-checkout
              - templateName: p99-latency-checkout
            args:
              - name: service-name
                value: checkout
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 80
        - pause: { duration: 3m }
        - setWeight: 100

The ladder is intentionally front-loaded. Most failures show up between 10% and 40% traffic, which is where the analysis runs are densest. By 60% the canary has effectively become the stable service from the metrics' point of view, and the remaining steps exist mostly so the rollout does not slam from 60% to 100% in one move. Median wall time across the eleven services on this pattern is 31 minutes.

canaryService and stableService are two Service objects that the controller manages. It updates each service's selector to point at the appropriate ReplicaSet's pod-template-hash, so traffic on checkout-canary hits only canary pods, and same for checkout-stable. This is the trick that lets us run PromQL against either set independently.

apiVersion: v1
kind: Service
metadata:
  name: checkout-stable
  namespace: checkout
spec:
  selector: { app: checkout }
  ports: [{ name: http, port: 80, targetPort: 8080 }]
---
apiVersion: v1
kind: Service
metadata:
  name: checkout-canary
  namespace: checkout
spec:
  selector: { app: checkout }
  ports: [{ name: http, port: 80, targetPort: 8080 }]

Both services share the same selector. Argo Rollouts injects the rollouts-pod-template-hash label into each Service's selector dynamically. You do not write that label yourself. If you ever see both services routing to all pods regardless of revision, the controller has lost its grip on the Service objects, usually because someone edited them by hand and broke the annotation that lets the controller know it owns them.

Traffic shifting through NGINX

The traffic split is done at the ingress, not in-cluster. You only write the stable Ingress; Argo Rollouts manages the canary one.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: checkout
  namespace: checkout
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: 2m
spec:
  ingressClassName: nginx
  rules:
    - host: checkout.platform.contoso.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: checkout-stable
                port: { number: 80 }
  tls:
    - hosts: [checkout.platform.contoso.com]
      secretName: checkout-tls

When a Rollout's canary step engages, the controller creates a sibling Ingress named checkout-checkout-canary with nginx.ingress.kubernetes.io/canary: "true" and nginx.ingress.kubernetes.io/canary-weight: "10". As the steps progress, the controller updates the weight. At the end of a successful rollout, the canary Ingress is deleted and the stable Ingress points at the new pod-template-hash. There is no leftover state to clean up.

The advantage of doing the split at the ingress is that no application code or sidecars are involved. The disadvantage is that the routing decision happens at the cluster edge, so traffic from one microservice to another stays on the stable service unless that caller has its own Rollout shaping traffic separately. For edge-facing services that is what we want; for internal service-to-service we use a different pattern with SMI annotations.

The AnalysisTemplate resources

Two templates per Rollout. Each is a PromQL query plus a success condition, plus a failure condition, plus a sampling cadence and a count of consecutive failures before abort. They live in the same namespace as the Rollout.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-checkout
  namespace: checkout
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      count: 5
      successCondition: result[0] < 0.01
      failureCondition: result[0] >= 0.02
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}-canary",
              status=~"5.."
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}-canary"
            }[2m]))

result[0] is a list because PromQL instant queries return a vector; the analysis takes the first element. successCondition: result[0] < 0.01 means "less than 1% error rate." failureCondition: result[0] >= 0.02 means "abort immediately if error rate hits 2%." failureLimit: 3 means "three consecutive measurement failures within count: 5 total attempts trigger an abort." interval: 60s is the sampling cadence.

The split between successCondition and failureCondition is important. The success condition is what counts as "this measurement was healthy." The failure condition is what counts as "this measurement was actively unhealthy." A measurement that satisfies neither (between 1% and 2% error rate) is inconclusive; the analysis keeps trying. We tuned these thresholds per service based on three weeks of baseline traffic before turning analysis on.

The latency template is the one that fired at 03:11 in the opening scene.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p99-latency-checkout
  namespace: checkout
spec:
  args:
    - name: service-name
  metrics:
    - name: p99-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 0.220
      failureCondition: result[0] >= 0.250
      failureLimit: 4
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}-canary"
              }[2m])) by (le)
            )

220ms is the SLO floor; 250ms is the explicit fail line. failureLimit: 4 means the controller waits for four consecutive bad measurements before aborting. That four-measurement window is roughly four minutes at the 60-second interval, which gives the system time to absorb a single noisy minute without panicking. In the opening scene, the analysis ran 19 measurements total over the 47-minute hold; 14 were healthy, 4 were inconclusive, and the 4 that finally pushed it over were the consecutive run that triggered the abort.

The reason the analysis stopped the rollout from advancing to 20% during those 47 minutes is that step 3 in the strategy is the analysis step itself, and an analysis run that never reaches its successCondition count never returns success. Argo Rollouts holds the rollout at the current weight while the analysis is in flight. The 47 minutes are not a bug; they are the controller waiting patiently for the analysis to decide. If the analysis had returned success at any point during those 19 measurements, the rollout would have moved to step 4 (setWeight: 20).

The pipeline that ships it

The pipeline lives in Azure DevOps. The build stage produces the container image and pushes it to ACR. The deploy stage substitutes the image tag into the Rollout YAML, applies it to the cluster, and waits for the Rollout to reach Healthy. The wait is what turns a fire-and-forget deploy into a gated one. Workload identity federation handles the AKS authentication; assume kubectl is wired up via AzureCLI@2 with a federated service connection. The deployment job documentation on Microsoft Learn covers the broader pattern.

trigger:
  branches:
    include: [main]
  paths:
    include: [services/checkout/**]

variables:
  serviceConnection: 'sc-aks-checkout-prod'
  acrName: 'acrcheckoutprod'
  imageRepo: 'checkout-api'
  aksResourceGroup: 'rg-aks-checkout-prod'
  aksClusterName: 'aks-checkout-prod'
  rolloutName: 'checkout'
  rolloutNamespace: 'checkout'

stages:
  - stage: Build
    jobs:
      - job: BuildAndPush
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: AzureCLI@2
            displayName: 'docker build and push'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                TAG="$(date -u +%Y.%m.%d)-$(Build.BuildId)"
                echo "##vso[task.setvariable variable=imageTag;isOutput=true]$TAG"
                az acr login --name $(acrName)
                docker build -t $(acrName).azurecr.io/$(imageRepo):$TAG ./services/checkout
                docker push $(acrName).azurecr.io/$(imageRepo):$TAG
            name: build

  - stage: DeployCanary
    dependsOn: Build
    variables:
      imageTag: $[ stageDependencies.Build.BuildAndPush.outputs['build.imageTag'] ]
    jobs:
      - deployment: ApplyRollout
        environment: aks-checkout-prod
        pool:
          vmImage: ubuntu-latest
        strategy:
          runOnce:
            deploy:
              steps:
                - checkout: self
                - task: AzureCLI@2
                  displayName: 'get aks credentials'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      az aks get-credentials \
                        --resource-group $(aksResourceGroup) \
                        --name $(aksClusterName) \
                        --overwrite-existing
                      kubectl version --short
                - task: AzureCLI@2
                  displayName: 'apply rollout manifest'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      set -euo pipefail
                      sed -i "s#PLACEHOLDER#$(imageTag)#g" \
                        ./services/checkout/k8s/rollout.yaml
                      kubectl apply \
                        -n $(rolloutNamespace) \
                        -f ./services/checkout/k8s/rollout.yaml
                - task: AzureCLI@2
                  displayName: 'wait for rollout to be healthy'
                  timeoutInMinutes: 45
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      set -euo pipefail
                      curl -sSL -o /usr/local/bin/kubectl-argo-rollouts \
                        https://github.com/argoproj/argo-rollouts/releases/download/v1.7.2/kubectl-argo-rollouts-linux-amd64
                      chmod +x /usr/local/bin/kubectl-argo-rollouts
                      kubectl argo rollouts get rollout $(rolloutName) \
                        -n $(rolloutNamespace)
                      kubectl argo rollouts status $(rolloutName) \
                        -n $(rolloutNamespace) \
                        --watch \
                        --timeout 30m

The kubectl argo rollouts status --watch --timeout 30m command is the gate. It blocks until the Rollout reaches a terminal state. If the Rollout aborts because an analysis failed, the command exits non-zero and the pipeline run is marked failed. If the Rollout completes across all seven steps, the command exits zero and the pipeline goes green.

A subtle but important property: when an analysis fails and the Rollout self-aborts, the cluster is already in a safe state by the time the pipeline notices. Argo Rollouts has scaled the canary back to zero and removed the canary Ingress. The pipeline failure is a notification; it is not what causes the rollback. If the pipeline runner itself died mid-deploy, the Rollout would still complete or abort on its own based on the analysis outcome.

The 47-minute hold, replayed in CLI

The morning after the abort, the engineer ran this and got the full timeline.

kubectl argo rollouts get rollout checkout -n checkout

Name:            checkout
Namespace:       checkout
Status:          ✖ Degraded
Message:         RolloutAborted: metric "p99-latency" assessed Failed due to failed (4) > failureLimit (3)
Strategy:        Canary
  Step:          3/13
  SetWeight:     10
  ActualWeight:  0
Images:          acrcheckoutprod.azurecr.io/checkout-api:2024.10.07-1742 (canary)
                 acrcheckoutprod.azurecr.io/checkout-api:2024.10.05-1109 (stable)
Replicas:
  Desired:       10
  Current:       10
  Updated:       0
  Ready:         10
  Available:     10

NAME                                  KIND         STATUS        AGE   INFO
⟳ checkout                            Rollout      ✖ Degraded    21d
├──# revision:7
│  ├──⧉ checkout-5b8d6c8c7c           ReplicaSet   • ScaledDown  47m   canary
│  │  └──⊟ (canary pods, terminated)
│  └──α checkout-7                    AnalysisRun  ✖ Failed      47m
│     ├──📊 error-rate                Measurement  ✔ Successful  46m
│     └──📊 p99-latency               Measurement  ✖ Failed      4m
└──# revision:6
   └──⧉ checkout-9f4d8c5b6            ReplicaSet   ✔ Healthy     21d   stable

The signal that matters is metric "p99-latency" assessed Failed due to failed (4) > failureLimit (3). That is the abort cause spelled out: the latency analysis hit four failed measurements when the limit was three.

To dig deeper into the analysis run:

kubectl describe analysisrun checkout-7 -n checkout

The describe output lists each measurement with its timestamp, the raw PromQL result, and the assessment (Successful, Inconclusive, or Failed). The four Failed measurements were at 03:07:43, 03:08:43, 03:09:43, and 03:10:43, with values 0.251, 0.255, 0.253, and 0.258. The next minute the rollout aborted at 03:11. The first 14 measurements over the preceding 47 minutes had values clustered between 0.205 and 0.241, which is why the rollout neither advanced nor aborted during that window: the analysis could not get five clean successes in a row, and could not get three consecutive failures, so it kept measuring.

The PromQL gotcha at low RPS

The one false positive we hit in six months happened on a Sunday at 04:30 UTC. The service was at roughly 8 requests per second. The latency analysis fired a failure and aborted the rollout. The next morning, looking at the data, the p99 estimate had been swinging between 60ms and 350ms minute to minute, not because real latency was changing, but because at 8 RPS over a 2-minute window, the histogram quantile estimator was working off about 960 samples spread across 10 buckets, and the highest non-empty bucket kept hopping by one position.

The fix had two parts. First, widen the rate window from 2m to 5m for low-traffic services; the longer window smooths the bucket counts and stabilises the quantile estimate. Second, gate the analysis on a minimum-traffic guard. We added a second metric that checks request volume and short-circuits when traffic is below a threshold:

  - name: traffic-guard
    interval: 60s
    count: 5
    successCondition: result[0] >= 5
    failureCondition: result[0] < 1
    failureLimit: 5
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{
            service="{{args.service-name}}-canary"
          }[5m]))

The guard returns the canary's RPS. If it is at least 5 the latency analysis is meaningful. If it drops below 1, the guard fails, which sounds backwards but is right: the analysis logic treats "we have so little traffic we cannot judge this" as a failure to advance, not as a green light. We schedule canary rollouts during business hours for low-traffic services for the same reason; the safety of the canary is proportional to the signal you have to judge it by. After the fix, p99 estimates on that service stabilised to within ±15ms minute over minute and we have not had a false positive since.

Database migrations: keep them out of the rollout window

Schema migrations do not belong inside the canary rollout window. During the 10%/20%/40% phases you have a mixed-version fleet talking to one schema; any forward-incompatible change will break one of the two versions.

The pattern we use:

The first PR adds the column as nullable and updates the application to write to it but not read from it. Both old and new code are forward-compatible with the new schema; the old code ignores it, the new code populates it.
A background backfill Job populates the column for existing rows.
The second PR adds reads from the column behind a feature flag that defaults off. This deploys through the canary ladder.
The flag is flipped on with the usual progressive rollout in LaunchDarkly. This is where the read change is exercised, not inside the canary.
A later cleanup PR drops the flag and removes the dead branch.

The actual ALTER TABLE runs in step 1's release window, before the canary starts, via a kubectl apply of a Job wrapping flyway migrate that the pipeline runs as a pre-step ahead of the Rollout apply. If the migration job fails, the pipeline never reaches the Rollout step.

Troubleshooting log from six months of running this

analysis run 'checkout-1234' failed: measurements[0]: failed to retrieve metric from prometheus: failed to retrieve metric is the most common error when standing this up for a new service. The controller cannot reach Prometheus. The address in the AnalysisTemplate is wrong, the service has a different name in your monitoring namespace, or there is a NetworkPolicy between argo-rollouts and monitoring that drops the traffic. The fix is to kubectl exec into the controller pod and wget -O - http://prometheus.monitoring.svc.cluster.local:9090/-/healthy from there. If that returns 200, the address is right and the problem is in the query syntax.

rollout 'checkout' has 0 ready replicas after a fresh kubectl apply is almost always image-pull related. The controller has created the canary ReplicaSet, pods are starting, and they are failing to pull because the cluster's acrpullidentity is not bound on the workload namespace, or the image tag does not exist in ACR yet. kubectl describe pod shows the actual reason. The pipeline produces the tag before the apply, but if the build and deploy stages are on different agents and the ACR replication races, the deploy can race ahead of the image becoming available.

error: timed out waiting for the condition from kubectl argo rollouts status --watch --timeout 30m is the pipeline saying the rollout did not finish in time. The Rollout itself has no timeout; the pipeline's timeout is the only thing that gives up.

Rollout is paused (CanaryPauseStep) with Status: Paused in the CLI means the Rollout has reached a manual pause step and is waiting for kubectl argo rollouts promote checkout to advance. We do not use manual pause steps in production; all our pauses are pause: { duration: 5m } so they self-advance. If you see this state without expecting it, someone has edited the Rollout to add an indefinite pause during an incident and forgotten to remove it.

AnalysisRun is in error state: error: Get "http://prometheus...": dial tcp: lookup prometheus.monitoring.svc.cluster.local: no such host is DNS. CoreDNS has lost its grip on the monitoring service. The Rollout's behaviour during DNS failure is conservative: analysis errors are not treated as failures (they are infrastructure failures, not measurements of the canary). After failureLimit consecutive errors the analysis is marked Error, not Failed, which still aborts the rollout but distinguishes "the system could not measure" from "the canary was bad."

What rolls back automatically, what does not

Argo Rollouts will roll back the canary ReplicaSet. It will scale it to zero, remove the canary Ingress, and leave the stable ReplicaSet untouched. It will not roll back configuration that lives outside the Rollout: ConfigMaps, Secrets, CronJobs, Jobs, separate Deployments, RBAC objects. If the canary failed because of a change in a ConfigMap that the new code expected and the old code did not, the ConfigMap is still in its new state after the rollback, and your stable pods may now also be unhealthy.

The mitigation: package ConfigMap changes as part of the application repo, version them with the same image tag, and use Helm or Kustomize hashes so a config change rolls forward and back together with the application. Argo CD's apps-of-apps pattern handles this implicitly because the entire application is one ApplicationSet that reconciles atomically. Our checkout service has 14 Kubernetes resources in its chart, the Rollout being one of them, and the chart is reconciled as a unit. The pipeline does not kubectl apply the Rollout directly; it bumps the image tag in the Helm values file, commits, and lets Argo CD reconcile. The pipeline then watches the Rollout's status from outside, which is what the kubectl argo rollouts status --watch step is doing.

I simplified the pipeline above to make the example readable. In reality, the deploy step commits to a deploy repo, Argo CD picks up the commit within 30 seconds, and the watch step blocks on the Rollout status. The Rollout's behaviour is identical either way; what changes is whether the cluster state is driven by kubectl apply from the pipeline or by Argo CD reconciliation from Git. The Git-driven flow is what we run in production. The pipeline still feels like the orchestrator from a human point of view, because that is where the runs are visible and where notifications come from.

Six months in

Eight aborts, seven real catches, one false positive that produced a guard we wish we had had on day one. Median rollout time of 31 minutes from kubectl apply to fully promoted. P99 latency regressions are caught at roughly the 10% to 20% canary mark, which is to say, within the first 10 minutes, far before the change has had any meaningful customer impact.

The thing that made this stick was the unattended part. The first three months we deployed during business hours and an engineer watched the Rollout dashboard for every release. Eventually it became clear that the engineer was not adding signal; the analysis was deciding faster than a human could, and the human's main job was to trust the system. We moved to overnight deploys for non-revenue-critical services and the abort at 03:11 was the system passing its own test. The on-call engineer slept through what would have been, in the old world, a paged-out fire drill. The morning post-mortem was a calm conversation about a regex compile rather than a blameful conversation about who broke production.

Eleven services on the pattern at the time of writing. The pattern is becoming the team's default for anything user-facing. The internal services that do not yet have a defensible SLO continue to use vanilla Deployment resources, because turning on Argo Rollouts without a meaningful analysis is theater. The day a service grows real traffic and a real SLO is the day it gets converted, and the conversion is roughly an hour of work: rewrite the Deployment as a Rollout, write two AnalysisTemplates, define the strategy ladder, point the pipeline at the new wait command. The hard work, the analysis design, is the bit that takes thought; the YAML is the easy part.

The seven services that are not on this pattern yet are the next quarter's work. I would like to be at eighteen services by the time we ship the Istio adoption and shift the traffic split from NGINX annotations to VirtualService weights. That migration is additive; the Rollout strategy stays the same, only the trafficRouting block changes, and the AnalysisTemplates do not need to change at all because they query the same Prometheus that scrapes the same pods. The investment in the analysis templates carries forward across every subsequent change to the traffic layer, which is the property of well-chosen abstractions, and the reason this work has felt like leverage rather than maintenance.