Skip to content
OObaro.Olori
All articles
Azure DevOps

Cutting an AKS deploy from 45 minutes to 8 minutes with Azure Pipelines

A 4:47pm Friday queue of four pull requests waiting on a 45-minute AKS pipeline kicked off eleven weeks of surgery on Cache@2, parallel stages, and a Bicep what-if that had no business living in the deploy. The full rebuild, to 7:51.

15 min read 327 viewsAzure PipelinesAKSCache@2Bicep

It was a Friday in late August, 4:47pm, and the release manager pinged me in Teams with a single screenshot: a four-deep queue of pull requests stacked against main, every one of them waiting on the AKS deploy pipeline to finish. The pipeline currently in flight had been running for 38 minutes. Behind it, three more, each forecasted at roughly 45 minutes. The math said the engineer who had merged the first PR at 4:11pm would still be at his desk at quarter past seven. He had a flight at nine.

That weekend, I sat with the run logs and a coffee and started timing each stage by hand. Twelve minutes on build. Eight on Bicep what-if. Six on Helm install. Nine on smoke tests. Eight more on what I started calling "the boring middle," which is what happens when stages run sequentially even though half of them have no business waiting on each other. The total, 43 to 47 minutes depending on the agent, was the cost of a pipeline that had grown organically for two years and nobody had ever sat down and shaken out. Eleven weeks later, the same pipeline, the same cluster, the same Bicep templates, deploys in 7 minutes 51 seconds on a warm cache and 9 minutes 12 seconds on a cold one. This is the whole rebuild.

Where the 45 minutes actually went

The first thing I did was instrument. Azure DevOps gives you stage duration in the run summary, but it does not give you the long tail: which restore took the longest, which test project hogged the agent, which kubectl rollout status quietly waited 70 seconds for the readiness probe. The cheapest tool here is time wrapped around every meaningful command in the existing pipeline, with the output piped to tee so it lands in the run log:

- task: Bash@3
  displayName: 'time: dotnet restore'
  inputs:
    targetType: inline
    script: |
      { time dotnet restore Api.sln --locked-mode ; } 2> >(tee -a $(Build.ArtifactStagingDirectory)/timings.log >&2)

After two runs the timings.log told the real story. The 12 minute build broke down to 4 minutes 20 of dotnet restore, 3 minutes 10 of npm ci (the front-end was bundled into the same job), 2 minutes 40 of dotnet build, and a slow docker build because the base image layer was being pulled fresh every run. The 8 minutes of Bicep what-if was, embarrassingly, what-if running three times: once on the AKS module, once on the Key Vault module, once on the networking module, each in its own step, each spinning up its own ARM session. The 9 minute smoke test was a curl loop that polled the new pod for a 200 on /healthz with a 30 second sleep between attempts, and the pod usually came healthy on the second poll, which means 7 of those 9 minutes was the loop waiting for itself.

Then there was the structural problem: the whole pipeline was one stage with one job. Build, scan, what-if, deploy-dev, deploy-test, deploy-prod, all in a straight line. Nothing parallel. Nothing skipped on PR. The agent pool was generous (Standard_D4ds_v5) but most of the time only one core was working.

So the optimisation had four threads: cache what is repeatedly fetched, fan out what can run in parallel, move what-if to where it actually belongs (the PR validation pipeline, not the deploy), and tune the AKS rolling deploy so the cluster stops being the bottleneck. None of these are clever individually. They compound.

Thread one: Cache@2 on restore artifacts

The Cache@2 task is the single biggest win in any Azure Pipelines refactor I have done. It is documented on Microsoft Learn, but the doc is a reference, not a worked example. The two things the doc undersells: the cache key is a fingerprint, not a name, and a cache miss is not a failure, it is a no-op followed by the cache being populated at job end.

Here is the NuGet cache as we ended up with it for the .NET API:

variables:
  NUGET_PACKAGES: $(Pipeline.Workspace)/.nuget/packages

stages:
  - stage: Restore
    displayName: 'Restore dependencies'
    jobs:
      - job: RestoreDotnet
        pool:
          vmImage: ubuntu-latest
        steps:
          - checkout: self
            fetchDepth: 1

          - task: Cache@2
            displayName: 'Cache NuGet packages'
            inputs:
              key: 'nuget | "$(Agent.OS)" | **/packages.lock.json,!**/bin/**,!**/obj/**'
              restoreKeys: |
                nuget | "$(Agent.OS)"
              path: $(NUGET_PACKAGES)
              cacheHitVar: NUGET_CACHE_HIT

          - task: Bash@3
            displayName: 'dotnet restore (locked mode)'
            condition: ne(variables.NUGET_CACHE_HIT, 'true')
            inputs:
              targetType: inline
              script: |
                dotnet restore Api.sln --locked-mode --packages $NUGET_PACKAGES

Three things to notice. The key includes $(Agent.OS) because a cache populated on ubuntu-latest is not portable to a Windows agent; I learned that the hard way the first time I added a Windows test agent to the matrix and it spent four minutes "restoring" from a cache that was bit-for-bit wrong for the platform. The key also fingerprints **/packages.lock.json, which only works if you actually commit lock files; we did not, until this work, and the speed-up from that one decision alone (enabling locked mode with committed lockfiles) was worth doing on its own.

The restoreKeys block is the fallback chain. If the exact-match key is not in the cache, Azure Pipelines tries the next looser key (nuget | "$(Agent.OS)"), which gives you a "warm but not exact" cache. On the first deploy after a lockfile change you still pay a partial restore (maybe 50 seconds instead of 4 minutes 20), then the new cache is populated under the new key for next time.

The npm side is the same shape with one wart:

- task: Cache@2
  displayName: 'Cache node_modules'
  inputs:
    key: 'npm | "$(Agent.OS)" | web/package-lock.json'
    restoreKeys: |
      npm | "$(Agent.OS)"
    path: web/node_modules
    cacheHitVar: NPM_CACHE_HIT

- task: Bash@3
  displayName: 'npm ci'
  condition: ne(variables.NPM_CACHE_HIT, 'true')
  inputs:
    targetType: inline
    workingDirectory: web
    script: npm ci --no-audit --prefer-offline

The wart: when node_modules is restored from cache, npm does not run, which means lifecycle scripts like postinstall also do not run. We had a postinstall that built a generated client from the OpenAPI spec, and on the first deploy after this change the build broke because the generated client was a year out of date. The fix was to move that generation into its own pipeline step that runs unconditionally and writes into a separate path. Cache restores the heavy node_modules tree; the codegen step is fast and runs every time.

Docker layer caching took a different approach. We were already building images on the agents, and the agent's local Docker daemon does not survive past the job. The trick is buildx with --cache-from and --cache-to pointed at the registry itself:

- task: AzureCLI@2
  displayName: 'docker buildx (with registry cache)'
  inputs:
    azureSubscription: $(serviceConnection)
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: |
      az acr login -n $(acrName)
      docker buildx create --use --name builder || true
      docker buildx build \
        --platform linux/amd64 \
        --cache-from type=registry,ref=$(acrName).azurecr.io/api:buildcache \
        --cache-to   type=registry,ref=$(acrName).azurecr.io/api:buildcache,mode=max \
        --tag $(acrName).azurecr.io/api:$(Build.BuildNumber) \
        --push \
        ./api

The registry now holds a buildcache tag that is shared across all pipeline runs and agent VMs. The first build populates it; every subsequent build pulls layers from there. The 2:40 we used to spend building Docker dropped to about 40 seconds on the third run and stayed there. The AzureCLI@2 task (docs) handles the registry auth for us; do not introduce a separate docker login step, because the federated token from the Service Connection is the one you want acting on the registry.

Thread two: fan out from a single restore

Once restore was cached, I could split build, test, and security scan into parallel jobs that all depend on the restore stage but not on each other. Azure Pipelines makes this clean if you treat dependsOn as a directed graph, not a list:

stages:
  - stage: Restore
    displayName: 'Restore'
    jobs:
      - job: RestoreAll
        steps:
          - { template: templates/restore-nuget.yml }
          - { template: templates/restore-npm.yml }

  - stage: BuildAndVerify
    displayName: 'Build, test, scan'
    dependsOn: Restore
    jobs:
      - job: BuildApi
        steps:
          - { template: templates/build-dotnet.yml }
      - job: BuildWeb
        steps:
          - { template: templates/build-web.yml }
      - job: TestUnit
        dependsOn: BuildApi
        steps:
          - { template: templates/test-unit.yml }
      - job: ScanContainer
        dependsOn: BuildApi
        steps:
          - { template: templates/scan-trivy.yml }
      - job: LintBicep
        steps:
          - { template: templates/lint-bicep.yml }

  - stage: DeployDev
    dependsOn: BuildAndVerify
    jobs:
      - deployment: DeployDev
        environment: dev-aks
        strategy:
          runOnce:
            deploy:
              steps:
                - { template: templates/helm-upgrade.yml, parameters: { env: dev } }

  - stage: DeployTest
    dependsOn: BuildAndVerify
    jobs:
      - deployment: DeployTest
        environment: test-aks
        strategy:
          runOnce:
            deploy:
              steps:
                - { template: templates/helm-upgrade.yml, parameters: { env: test } }

Two things changed in shape. First, BuildAndVerify is now a single stage with five jobs, four of which can run concurrently because they only depend on the artifacts produced by RestoreAll. The agent pool has enough headroom (we have eight parallel jobs available on our Azure DevOps Services plan) to run all five at the same time. Total wall-clock for the stage dropped from 12 minutes sequential to about 4 minutes, because the longest single job (the .NET build, 3:40) is now the floor, not the sum of everything.

Second, DeployDev and DeployTest both depend on BuildAndVerify but not on each other. The old pipeline had DeployTest waiting on DeployDev to finish, which made sense back when the smoke test in dev was meant to gate test. It no longer made sense because dev and test were running on separate clusters, separate Bicep deployments, with separate failure surfaces. They could fan out. Production is still gated by the test smoke test passing (more on that below) but dev and test deploy in parallel, saving roughly another 6 minutes off the wall-clock.

The deployment job type (docs) is what gives you environment binding and approval gates. The first time I used it I tried to put it inside a regular job and got bitten by the fact that approval gates only fire on deployment jobs. The shape is small; the wiring matters.

Thread three: getting Bicep what-if out of the deploy

The Bicep what-if doing 8 minutes of work on every deploy was the most defensible-looking line item to keep, and the easiest to remove once I looked at it properly. The premise of what-if is: tell me what would change if I ran this deployment. The premise of running the deploy is: I have decided to run this deployment. Once the deploy starts, the what-if output is not informing any decision; it is just delaying the deploy and being read by nobody.

The correct home for what-if is the PR validation pipeline, where the diff is the actual artifact a reviewer wants to read before approving the merge. We moved it there, and the deploy pipeline lost 8 minutes outright.

The PR pipeline now has one stage:

# .azure/pipelines/pr-validation.yml
trigger: none
pr:
  branches:
    include: [main]
  paths:
    include:
      - infra/**
      - api/**
      - web/**

pool:
  vmImage: ubuntu-latest

stages:
  - stage: Validate
    jobs:
      - job: BicepWhatIf
        condition: |
          or(
            startsWith(variables['System.PullRequest.SourceBranch'], 'refs/heads/feature/'),
            startsWith(variables['System.PullRequest.SourceBranch'], 'refs/heads/release/')
          )
        steps:
          - checkout: self
          - task: AzureCLI@2
            displayName: 'az deployment sub what-if (single shot)'
            inputs:
              azureSubscription: $(readOnlyServiceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                az deployment sub what-if \
                  --location eastus2 \
                  --template-file ./infra/main.bicep \
                  --parameters @./infra/prod.bicepparam \
                  --result-format FullResourcePayloads \
                  -o json > whatif.json

                python3 ./scripts/whatif-to-md.py whatif.json > $(Build.ArtifactStagingDirectory)/whatif.md

          - task: PublishPipelineArtifact@1
            inputs:
              targetPath: $(Build.ArtifactStagingDirectory)/whatif.md
              artifactName: whatif-report

Three small but earned details. We run what-if at subscription scope against the single root Bicep file, not against each module separately, which is what was costing the old pipeline 8 minutes (three ARM sessions, three module fetches, three full plan evaluations). az deployment sub what-if (reference) plans the whole template tree in one shot, takes about 90 seconds for our infra, and emits one consolidated JSON that we render into Markdown via a small Python script. The Markdown gets posted as a PR comment by a downstream step. Reviewers read it before approving the merge; the deploy pipeline never has to render it again.

The service connection on the PR pipeline is the read-only one, scoped to Reader on the target subscription. That is enough to run what-if (what-if does not write) and means a PR pipeline that hits a Bicep regression cannot actually mutate anything. We had been running PR what-if under the writer service connection out of habit; that was a defensible-looking habit that did not survive a security review.

Bicep what-if is documented here and the subtlety to read carefully is the difference between the three result formats: ResourceIdOnly, ResourceIdAndProperties, FullResourcePayloads. We use FullResourcePayloads for the PR comment because reviewers want to see the actual property deltas; the runtime cost is roughly 20 seconds extra over ResourceIdOnly and worth it for the read.

Thread four: tuning the AKS rolling deploy

The Helm install step used to take 6 minutes wall-clock. Most of that was waiting for kubectl rollout status to declare the deployment healthy. The rollout was using the Kubernetes default of maxSurge: 25% and maxUnavailable: 25%, which on a 4-replica deployment means surging by 1 pod and tolerating 1 unavailable. The new pod's readiness probe was a 5 second httpGet on /healthz with initialDelaySeconds: 30, which guaranteed each pod waited 30 seconds before its first probe. Four pods, rolling one at a time, 30 seconds of forced wait each, plus the probe interval and the registry pull, and you have a 6 minute deploy mostly composed of pauses.

The fix was three small changes to the Helm chart values:

# helm/values-prod.yaml
replicaCount: 4

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 50%
    maxUnavailable: 0

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 2
  failureThreshold: 5

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 2
  failureThreshold: 20

maxSurge: 50%, maxUnavailable: 0 means we add 2 pods first, wait for them to be ready, then start retiring old pods. The number of healthy pods never drops below the original 4 during the rollout, but we are now rolling 2 at a time instead of 1, which roughly halves the wall-clock.

The probe change is the bigger one. initialDelaySeconds: 30 was a safety pad from when the .NET startup was slow; the app now warms in about 4 seconds. Moving the 30 second pad into a dedicated startupProbe (docs) means slow starts are still tolerated (the startup probe gives the pod up to 40 seconds before liveness even begins) but the fast normal case fires its first readiness probe at 3 seconds and the rollout proceeds. The Helm install on AKS now reports ready in about 70 seconds, down from 6 minutes.

The Helm command itself I also tightened. The old step was a helm upgrade --install --wait with a default timeout of 5 minutes, and the implicit cluster-wide kubectl get polling that --wait does is wasteful. The new step is explicit:

- task: AzureCLI@2
  displayName: 'helm upgrade'
  inputs:
    azureSubscription: $(serviceConnection)
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: |
      az aks get-credentials -g $(rgName) -n $(aksName) --overwrite-existing
      helm upgrade api ./helm \
        --install \
        --namespace api \
        --create-namespace \
        --values ./helm/values-${{ parameters.env }}.yaml \
        --set image.tag=$(Build.BuildNumber) \
        --atomic \
        --timeout 4m \
        --wait
      kubectl rollout status deployment/api -n api --timeout=2m

--atomic rolls the release back if it does not converge within --timeout, which is what you want in CI; a half-applied Helm release that you then have to manually helm rollback was a 2am incident I had earlier in the year. kubectl rollout status as an explicit final gate catches the edge case where Helm thinks the release succeeded but a pod has crashed in its post-start hook.

Caching Helm chart dependencies is also worth a step. If your chart pulls subcharts (we pull a Redis subchart and an external-secrets subchart), the first run does a helm dependency update that hits the chart registry. We cache the charts/ directory:

- task: Cache@2
  displayName: 'Cache helm subcharts'
  inputs:
    key: 'helm | "$(Agent.OS)" | helm/Chart.lock'
    path: helm/charts
    cacheHitVar: HELM_CACHE_HIT

- task: Bash@3
  displayName: 'helm dependency update'
  condition: ne(variables.HELM_CACHE_HIT, 'true')
  inputs:
    targetType: inline
    script: helm dependency update ./helm

The chart lock file is the fingerprint, exactly the same shape as the NuGet and npm caches. The dependency update is only 25 seconds when it does run, but skipping it on a warm cache is free latency.

Smoke tests, the right way

The old smoke test was a curl loop that polled the new pod over the AKS ingress with a 30 second sleep between attempts. That implementation made the smoke test slow because the smoke test was its own bottleneck. The new shape uses an internal LoadBalancer with a known address and a much tighter loop:

- task: Bash@3
  displayName: 'smoke: /healthz'
  inputs:
    targetType: inline
    script: |
      INGRESS_IP=$(kubectl get svc api-internal -n api -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
      for i in $(seq 1 60); do
        if curl -fsS --max-time 2 "http://${INGRESS_IP}/healthz" > /dev/null; then
          echo "healthy after $((i*2))s"
          exit 0
        fi
        sleep 2
      done
      echo "smoke timeout"
      exit 1

- task: Bash@3
  displayName: 'smoke: /api/version'
  inputs:
    targetType: inline
    script: |
      INGRESS_IP=$(kubectl get svc api-internal -n api -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
      VERSION=$(curl -fsS "http://${INGRESS_IP}/api/version" | jq -r .commit)
      if [ "$VERSION" != "$(Build.SourceVersion)" ]; then
        echo "version mismatch: got $VERSION, expected $(Build.SourceVersion)"
        exit 1
      fi

Two seconds between attempts, two second timeout per attempt, up to two minutes total. The previous version's 30 second sleep was just lore that nobody had revisited. The version-check second step is the safety belt I always add: confirm the pod that is now serving traffic is the one this pipeline run built, by reading back a commit SHA exposed on a /api/version endpoint. Twice in the last year that check has caught a deploy where the rollout had silently routed me back to an older pod because the new pod's imagePullSecrets had expired and the new replica was stuck in ImagePullBackOff while the old replicas served. The smoke test catches that case in about 4 seconds.

The Windows agent cache gotcha

About three weeks into the rollout we added Windows agents to the matrix for a separate suite of platform-team tools. The pipeline that used the cached NuGet packages on Linux started failing on the Windows agents with:

error NU1100: Unable to resolve 'Microsoft.NET.Sdk.Functions (>= 4.4.1)' for 'net8.0'.

The cache key was nuget | "$(Agent.OS)" | **/packages.lock.json. The $(Agent.OS) variable is Linux on a Linux agent and Windows_NT on a Windows agent, which on its own should have been enough to separate the caches. The bug was further down: our packages.lock.json had been generated on Linux and committed, and on Windows the path separator inside the lockfile (forward slash vs backslash for nested package references) was causing dotnet restore --locked-mode to think the lock was inconsistent.

The fix was to commit two lockfiles, packages.lock.json (Linux, default) and packages.lock.windows.json (Windows), and have the restore step pick one based on $(Agent.OS). Ugly, but stable. The deeper fix would be to standardise on Linux agents for everything, which is where we are heading; the Windows tools are being ported to PowerShell 7 cross-platform variants and the matrix will collapse next quarter.

Troubleshooting, the actual log lines

A few errors I actually hit during the cut-over, with the diagnosis that worked.

Cache@2 task failed: Tar failed with error: The process '/usr/bin/tar' failed with exit code 2. This was a cache path that included a symlink loop (an old node_modules had a .bin folder linking to itself via a buggy postinstall). Cache@2 packages the path via tar, and the symlink loop made tar exit non-zero. Deleting the broken node_modules once and letting it repopulate fixed it, but the right long-term fix was to not cache web/node_modules/.bin and instead cache just web/node_modules. Cache@2 honours .gitignore-style negations in its path input via newline-separated entries if you really need to exclude a subtree.

Helm install failed: another operation (install/upgrade/rollback) is in progress. This happened because a previous pipeline run had been cancelled mid-Helm-install, and the Helm release record in the cluster was stuck in a pending-upgrade state. The release lock did not time out automatically. The fix was a helm rollback api --history-max=5 from a kubectl-enabled agent, after which the next pipeline run installed cleanly. We added a pre-step to the deploy job that checks helm status api -n api and, if it reports pending-upgrade for more than four minutes, calls helm rollback first. It has fired twice in the last eight months, both times correctly.

ERROR: AADSTS70021: No matching federated identity record found for presented assertion. This was the read-only service connection on the PR pipeline. The subject claim on the federated credential had been set against the old PR pipeline's path, and the new PR pipeline was running under a different name. Re-reading the issuer and subject from the Service Connection JSON and updating the federated credential on the Entra app fixed it. I have written about this exact failure mode in a separate article on workload identity federation; in this pipeline the symptom showed up as a one-line failure in AzureCLI@2 before any az command had a chance to run.

The pipeline is not valid. Job RestoreAll cannot reference job BuildApi because BuildApi has not yet run. Self-inflicted: I had a step in the restore stage trying to dependsOn a job in the next stage. Stages are sequential; dependsOn between jobs only works within a stage. The fix was to move the dependency to the stage level (dependsOn: BuildAndVerify on the next stage) instead of trying to express it job-to-job across stages.

The new trace, end to end

Here is the final timing breakdown on a warm-cache prod deploy, taken from a run on a Tuesday three weeks ago:

Stage: Restore                       0:42
  RestoreAll                         0:42
Stage: BuildAndVerify                3:54
  BuildApi          (parallel)       3:38
  BuildWeb          (parallel)       1:12
  TestUnit          (parallel)       2:08
  ScanContainer     (parallel)       1:42
  LintBicep         (parallel)       0:18
Stage: DeployDev                     1:48
  helm upgrade                       1:11
  smoke /healthz                     0:08
  smoke /api/version                 0:02
Stage: DeployTest    (parallel)      1:51
  helm upgrade                       1:14
  smoke /healthz                     0:09
  smoke /api/version                 0:02
Stage: DeployProd                    2:31  (approval-gated, blocks on test smoke)
  helm upgrade                       1:34
  smoke /healthz                     0:09
  smoke /api/version                 0:02
  post-deploy sanity                 0:28

Total wall clock: 7:51

DeployDev and DeployTest run in parallel; their wall clocks overlap, which is why the total adds up to less than the linear sum. DeployProd waits on the test smoke test passing because we want one real environment to have proved the build before we touch production.

The cold-cache equivalent (first run after a lockfile change in NuGet and npm and Helm Chart.lock simultaneously, which is a near-worst case) is 9 minutes 12 seconds. Most of the extra time is paid in BuildAndVerify; the deploy stages are unchanged because they do not depend on the caches.

The 4:47pm Friday queue does not happen anymore. The same four-PR scenario at 4:47pm now clears by 5:30, because each pipeline run is 8 minutes instead of 45 and the queue itself drains faster. The release manager has stopped pinging me on Fridays. The flight got caught.

Reflective coda

A pipeline is rarely slow for one reason. The 45 minutes did not come from a single bad step; it came from a dozen mediocre defaults that nobody had a budget to revisit. The build was slow because nobody had set up Cache@2, what-if was slow because it ran three times instead of once, the rollout was slow because the readiness probe pad was a year out of date, and the whole structure was sequential because that was how it had been when the pipeline was first written and nobody had a forcing function to redraw the dependency graph. The optimisation, looked at as a whole, is not technically novel. It is a pattern any senior engineer on this stack would land on if they sat with the run logs for an afternoon. The fact that it took us two years to do that afternoon is the more interesting observation.

The other observation, which surprised me, is how much of the wall-clock savings came from changes that are not really about pipeline speed at all. Moving what-if to PR validation made the deploy faster, but the real value was that reviewers started reading the diff before approving merges, which caught two infrastructure drift bugs in the first month that would otherwise have shipped. Tuning the readiness probe made the rollout faster, but the real value was that production rollouts now match dev rollouts in shape, which made on-call runbooks shorter. The deploy-time-to-eight-minutes was the headline; the second-order effects on developer experience and on-call sanity were the actual win. I would not have framed it that way at the start. I do now.

There is one opinion I will hold to. If your CI pipeline takes more than 12 minutes for the change that gets queued most often, the team's relationship to deployment is broken in ways you cannot see until you fix it. People stop deploying close to end of day. They batch merges. They get nervous about small refactors because the cost of validating them dwarfs the change itself. The technical work to bring the pipeline down is, in retrospect, the easy part. The harder part is convincing the org that a fast deploy pipeline is a primary engineering output, not a thing the platform team gets to when there is slack. The metric I would measure now is not pipeline duration; it is "how often does someone choose to merge after 4pm." Before the rebuild, that number was effectively zero across our team. Now it is whatever the calendar happens to make it. That, more than any of the YAML above, is the change worth defending.