Azure DevOps

Container Apps with VNet integration: locked egress, the 15:48 image-pull failure, and the Dapr+KEDA pattern that replaced our queue workers

At 15:48 on a Wednesday, eleven Container Apps in one VNet-integrated environment stopped pulling their images. The Firewall lockdown we shipped the day before had named AzureContainerRegistry but missed MicrosoftContainerRegistry, and the platform shim that gives every revision its Dapr and KEDA hooks comes from there. The full migration story: 11 AKS queue workers down to one Container Apps environment, Dapr deleting 300 lines of state and secrets glue per app, KEDA's azure-servicebus scaler on managed identity, and the egress rule set we ended up with after the page.

08 Jul 2025 20 min read 312 viewsContainer AppsDaprKEDAAzure Firewall

The 15:48 page came in as a single sentence from the platform monitoring bot: "containerapp-orders-worker: revision unhealthy, replicas 0/3, image pull failed." Twenty seconds later the same alert fired for ten other Container Apps in the same environment. We had finished migrating the last of the queue worker fleet off AKS three weeks earlier. Eleven apps, all in one Container Apps environment, all wired to Service Bus via KEDA, all reading state through Dapr. They had been running cleanly through 28 deploys and roughly 14 million Service Bus messages since the cutover. At 15:48 on a Wednesday, every one of them stopped pulling its image.

The error in the activity log was unambiguous: Failed to pull image "myregistry.azurecr.io/orders-worker:2026.05.07-rev41": GET https://mcr.microsoft.com/v2/azure-container-apps/k8se/manifests/... 403 Forbidden. Container Apps does not, on the surface, talk to MCR for app images; the app image came from our private ACR. What it does talk to MCR for is the platform shim it injects into every revision, the bit that gives you Dapr and the KEDA sidecar without you doing the Kubernetes plumbing yourself. That pull was failing because at 09:00 that morning, the platform team had pushed a Firewall rule change that tightened egress on the Container Apps subnet. The Service Tags list we wrote into the rules looked complete on paper. It missed the one tag the activity log was now begging for.

This is the whole story of that migration, from the AKS workload it replaced through the egress design that broke us, to the Dapr and KEDA pattern that sits there now, deploying clean.

The before state, 11 queue workers on AKS

The workload had been the same shape for two years. Eleven Go services, each consuming from one Service Bus topic subscription, each doing one thing per message (write to Cosmos, call a downstream API, fan out to a smaller topic), each running as a Kubernetes Job triggered by a separate KEDA ScaledJob on the cluster. The cluster itself was a Standard_D8ds_v5 node pool of four nodes, always on, costing us roughly £3,200 a month including reservations and observability sidecars.

The Kubernetes manifests for one worker came to about 340 lines of YAML before Helm rendering. Most of that was scaffolding: secret mounts from CSI for Service Bus connection strings (we had not finished the move to managed identity), an Envoy sidecar for outbound TLS to Cosmos, a small Go binary mounted as init container that warmed the state cache. Multiply by eleven, factor in the value drift between manifests because every team had their own copy, and the platform team was spending one engineer-day a week on YAML rather than on the actual data plane.

The motivation to move was not exotic. It was: less Kubernetes per worker, dapr for the cross-cutting (state, secrets, pub/sub), KEDA built in instead of bolted on, and per-second billing during the long idle stretches every Saturday morning when the upstream stopped pushing. The platform team's read of Container Apps on Microsoft Learn said all of those properties were true. The proof-of-concept I ran in March agreed.

The architecture we landed on

One Container Apps environment in workloadProfiles mode, infrastructure-subnet-injected into the existing platform VNet. Eleven Container Apps, each its own resource, each scaled by KEDA's azure-servicebus trigger, each with two Dapr components attached: a Cosmos state store and an Entra-backed secrets resolver. ACR private endpoint into the same VNet, no public ingress. All egress through Azure Firewall in the hub. The relevant Bicep, trimmed but real:

@description('Container Apps environment in workloadProfiles mode, VNet integrated')
param location string = resourceGroup().location
param environmentName string = 'cae-platform-prod-eus2'
param infrastructureSubnetId string
param logAnalyticsWorkspaceCustomerId string
@secure()
param logAnalyticsWorkspaceSharedKey string

resource caEnv 'Microsoft.App/managedEnvironments@2024-03-01' = {
  name: environmentName
  location: location
  properties: {
    appLogsConfiguration: {
      destination: 'log-analytics'
      logAnalyticsConfiguration: {
        customerId: logAnalyticsWorkspaceCustomerId
        sharedKey: logAnalyticsWorkspaceSharedKey
      }
    }
    vnetConfiguration: {
      internal: true
      infrastructureSubnetId: infrastructureSubnetId
    }
    workloadProfiles: [
      {
        name: 'Consumption'
        workloadProfileType: 'Consumption'
      }
      {
        name: 'D8'
        workloadProfileType: 'D8'
        minimumCount: 0
        maximumCount: 3
      }
    ]
    daprAIConnectionString: '' // we send Dapr telemetry to App Insights via env var instead
    zoneRedundant: true
  }
}

output environmentId string = caEnv.id

Two things in that block matter for what follows. vnetConfiguration.infrastructureSubnetId is what wires the environment into our hub-and-spoke network rather than the default Microsoft-managed VNet. workloadProfiles with a Consumption profile plus a D8 profile is the part that gave us per-second billing for the bursty workers and reserved-capacity D8 for the one worker that needed 8 vCPU bursts. The reasoning on the profile mix is detailed on workload profiles on Microsoft Learn.

Each Container App is its own resource. The Bicep is small once the environment exists. This is the orders worker:

param environmentId string
param image string                   // e.g. myregistry.azurecr.io/orders-worker:2026.05.07-rev41
param managedIdentityId string
param serviceBusFqdn string          // platformbus.servicebus.windows.net
param serviceBusTopic string = 'orders'
param serviceBusSubscription string = 'orders-worker'

resource ordersWorker 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'ca-orders-worker'
  location: resourceGroup().location
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${managedIdentityId}': {}
    }
  }
  properties: {
    environmentId: environmentId
    workloadProfileName: 'Consumption'
    configuration: {
      activeRevisionsMode: 'Multiple'
      maxInactiveRevisions: 3
      registries: [
        {
          server: 'myregistry.azurecr.io'
          identity: managedIdentityId
        }
      ]
      dapr: {
        enabled: true
        appId: 'orders-worker'
        appProtocol: 'http'
        appPort: 8080
        logLevel: 'info'
      }
      ingress: null   // pure background worker, no HTTP ingress
    }
    template: {
      revisionSuffix: 'rev41'
      containers: [
        {
          name: 'app'
          image: image
          resources: {
            cpu: json('1.0')
            memory: '2.0Gi'
          }
          env: [
            { name: 'SERVICE_BUS_FQDN', value: serviceBusFqdn }
            { name: 'SERVICE_BUS_TOPIC', value: serviceBusTopic }
            { name: 'SERVICE_BUS_SUBSCRIPTION', value: serviceBusSubscription }
            { name: 'DAPR_HTTP_PORT', value: '3500' }
          ]
        }
      ]
      scale: {
        minReplicas: 0
        maxReplicas: 30
        rules: [
          {
            name: 'servicebus-orders'
            custom: {
              type: 'azure-servicebus'
              identity: managedIdentityId
              metadata: {
                topicName: serviceBusTopic
                subscriptionName: serviceBusSubscription
                namespace: split(serviceBusFqdn, '.')[0]
                messageCount: '15'
              }
            }
          }
        ]
      }
    }
  }
}

The two pieces of that Bicep that took the longest to get right were the KEDA scaler block and the registries[].identity. The scaler uses the worker's own user-assigned managed identity to call Service Bus's management API for queue length, which removes the last connection string from the system; it does require that the same identity has Azure Service Bus Data Receiver and Azure Service Bus Data Owner (the data-owner is for the management read, the data-receiver is for the runtime consume) at the namespace scope. KEDA's azure-servicebus scaler doc on Microsoft Learn is clearer on the auth shape than the upstream KEDA doc, partly because Container Apps surfaces the identity wiring at the Bicep level rather than through Kubernetes annotations.

The two Dapr components are deployed against the environment, not the individual app, and selected per-app by scopes. The state store:

resource daprStateOrders 'Microsoft.App/managedEnvironments/daprComponents@2024-03-01' = {
  parent: caEnv
  name: 'state-orders'
  properties: {
    componentType: 'state.azure.cosmosdb'
    version: 'v1'
    metadata: [
      { name: 'url',           value: 'https://cosmos-orders-eus2.documents.azure.com:443/' }
      { name: 'database',      value: 'orders' }
      { name: 'collection',    value: 'workerstate' }
      { name: 'azureClientId', value: workerManagedIdentityClientId }
    ]
    scopes: [ 'orders-worker' ]
  }
}

That azureClientId resolution path, where Dapr's Cosmos binding talks Cosmos with the worker's managed identity rather than a key, was the line item that deleted the longest stretch of boilerplate from the old AKS manifests. We tracked it: about 300 lines of secret-mount and CSI driver YAML per app, gone. The Dapr Cosmos binding pattern is set out under Dapr on Container Apps on Microsoft Learn.

The egress lockdown, and exactly what we got wrong

The platform team's rule is one line. Apps in spokes do not get public egress. Everything outbound goes through Azure Firewall in the hub, with rules at the network and application layer that name what each subnet may talk to. The Container Apps infrastructure subnet is treated like any other.

The team had drafted the Firewall rule set the day before the incident. The intent was right: allow everything Container Apps needs to operate, and nothing else. The reference page for Container Apps required egress on Microsoft Learn lists what the platform needs to phone home for: control plane traffic, image pulls, log shipping, telemetry, and the various Entra exchanges that keep managed identity alive.

We added the Service Tags it asks for. These are the network rules:

// Inside the Azure Firewall policy rule collection group
{
  name: 'allow-containerapps-controlplane'
  priority: 200
  ruleCollectionType: 'FirewallPolicyFilterRuleCollection'
  action: { type: 'Allow' }
  rules: [
    {
      ruleType: 'NetworkRule'
      name: 'azure-aad'
      sourceAddresses: [ containerAppsSubnetCidr ]
      destinationAddresses: [ 'AzureActiveDirectory' ]
      destinationPorts: [ '443' ]
      ipProtocols: [ 'TCP' ]
    }
    {
      ruleType: 'NetworkRule'
      name: 'azure-monitor'
      sourceAddresses: [ containerAppsSubnetCidr ]
      destinationAddresses: [ 'AzureMonitor' ]
      destinationPorts: [ '443' ]
      ipProtocols: [ 'TCP' ]
    }
    {
      ruleType: 'NetworkRule'
      name: 'azure-keyvault'
      sourceAddresses: [ containerAppsSubnetCidr ]
      destinationAddresses: [ 'AzureKeyVault' ]
      destinationPorts: [ '443' ]
      ipProtocols: [ 'TCP' ]
    }
    {
      ruleType: 'NetworkRule'
      name: 'azure-container-registry'
      sourceAddresses: [ containerAppsSubnetCidr ]
      destinationAddresses: [ 'AzureContainerRegistry' ]
      destinationPorts: [ '443' ]
      ipProtocols: [ 'TCP' ]
    }
  ]
}

The reading we did was: ACR is the registry our images come from, so AzureContainerRegistry covers image pulls. AAD covers Entra exchanges. Monitor covers log shipping. Key Vault covers Dapr secret components. Service Bus is dual-stacked through private endpoint, so the runtime consume goes inside the VNet without a Firewall hop and does not need a rule. By every reading of the page above, this should have worked. It did, for about thirty hours. Then at 15:48 the existing image layers in the platform's edge cache rolled over and the platform tried to pull fresh.

The error in the activity log singled out the actual reachable host: mcr.microsoft.com. That host is not under AzureContainerRegistry. The Service Tag AzureContainerRegistry covers customer-owned ACR registries; the platform's internal control-plane registry, the one that ships the per-pod sidecar shim that gives a Container App its Dapr and KEDA hooks, comes from MicrosoftContainerRegistry. They are documented as two separate tags on the Service Tags page on Microsoft Learn. We had read that page once, six months earlier, in the context of an entirely different workload. The distinction had not registered.

The other thing Service Tags missed: the actual data plane of MCR is served from *.data.mcr.microsoft.com, which is not enumerable as a static IP set the way control-plane MCR is. The Service Tag MicrosoftContainerRegistry lets you reach the index and the manifest endpoints, but the blob fetch goes to a CDN egress that needs an FQDN rule rather than an IP-list rule. So even with the right Service Tag added, image-blob pulls still fail until the application rule lands.

The fix went out at 16:23, ten minutes after the on-call engineer paged me. Two changes. First, add the missing Service Tag to the network rules:

{
  ruleType: 'NetworkRule'
  name: 'microsoft-container-registry'
  sourceAddresses: [ containerAppsSubnetCidr ]
  destinationAddresses: [ 'MicrosoftContainerRegistry' ]
  destinationPorts: [ '443' ]
  ipProtocols: [ 'TCP' ]
}

Second, add the FQDN-shaped application rule for the blob endpoints:

{
  name: 'allow-containerapps-fqdn'
  priority: 210
  ruleCollectionType: 'FirewallPolicyFilterRuleCollection'
  action: { type: 'Allow' }
  rules: [
    {
      ruleType: 'ApplicationRule'
      name: 'mcr-blobs'
      sourceAddresses: [ containerAppsSubnetCidr ]
      protocols: [
        { protocolType: 'Https', port: 443 }
      ]
      targetFqdns: [
        'mcr.microsoft.com'
        '*.data.mcr.microsoft.com'
        '*.azurecr.io'
        '*.blob.core.windows.net'
      ]
    }
  ]
}

The *.azurecr.io line is for our own ACR pulls, which were going through Firewall in some agent paths despite the private endpoint, because the deploy agents on the hub VNet sit on the public side of that endpoint. The *.blob.core.windows.net line is for ACR's underlying blob store; we discovered it at the same time we discovered *.data.mcr.microsoft.com because the same diagnostic packet capture showed both.

Firewall takes about ninety seconds to reload rules across the cluster. At 16:25 the first replica came back. By 16:31 all eleven apps had pulled and were running again. The Service Bus messages that had piled up during the outage drained in nineteen minutes because KEDA spun replicas straight to maxReplicas on every app. Net business impact: forty-three minutes of stale orders, no permanent data loss, two customer support emails that the success team handled the next morning.

The deploy pipeline, after the fix

The Azure Pipelines deploy now has one extra responsibility, which is to verify the Firewall rules are intact before pushing a new image. The verify step is small. The build, push, and revision-update flow is the meat:

trigger:
  branches:
    include: [main]
  paths:
    include:
      - workers/orders/**
      - infra/containerapps/orders-worker.bicep

variables:
  serviceConnection: 'sc-platform-prod-eus2'
  acrName: 'myregistry'
  resourceGroup: 'rg-platform-prod-eus2'
  containerAppName: 'ca-orders-worker'
  revisionSuffix: $(Build.BuildNumber)

stages:
  - stage: VerifyEgress
    displayName: 'Verify Firewall rule baseline'
    jobs:
      - job: VerifyRules
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                EXPECTED='AzureActiveDirectory AzureMonitor AzureKeyVault AzureContainerRegistry MicrosoftContainerRegistry'
                FOUND=$(az network firewall policy rule-collection-group collection list \
                  --policy-name afwp-platform-prod \
                  --resource-group rg-platform-prod-eus2 \
                  --rule-collection-group-name rcg-containerapps \
                  --query "[].rules[].destinationAddresses[]" -o tsv | sort -u | tr '\n' ' ')
                for tag in $EXPECTED; do
                  echo " $FOUND " | grep -q " $tag " || { echo "missing tag: $tag"; exit 1; }
                done
                echo "egress baseline ok"

  - stage: BuildAndPush
    dependsOn: VerifyEgress
    displayName: 'Build worker image'
    jobs:
      - job: Docker
        pool:
          vmImage: ubuntu-latest
        steps:
          - checkout: self
            fetchDepth: 1
          - task: AzureCLI@2
            displayName: 'docker build and push to ACR'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                IMAGE_TAG="$(date -u +%Y.%m.%d)-rev$(Build.BuildId)"
                az acr login --name $(acrName)
                docker build \
                  -t $(acrName).azurecr.io/orders-worker:${IMAGE_TAG} \
                  -f workers/orders/Dockerfile \
                  workers/orders
                docker push $(acrName).azurecr.io/orders-worker:${IMAGE_TAG}
                echo "##vso[task.setvariable variable=imageTag;isOutput=true]${IMAGE_TAG}"
            name: build

  - stage: DeployRevision
    dependsOn: BuildAndPush
    displayName: 'New revision, 10% traffic, then 100%'
    variables:
      imageTag: $[ stageDependencies.BuildAndPush.Docker.outputs['build.imageTag'] ]
    jobs:
      - deployment: ApplyRevision
        environment: prod
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureCLI@2
                  displayName: 'az containerapp update'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      set -euo pipefail
                      az containerapp update \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --image $(acrName).azurecr.io/orders-worker:$(imageTag) \
                        --revision-suffix rev$(Build.BuildId)

                      NEW_REV=$(az containerapp revision list \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --query "[?properties.active && contains(name, 'rev$(Build.BuildId)')].name | [0]" \
                        -o tsv)

                      OLD_REV=$(az containerapp revision list \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --query "[?properties.active && !contains(name, 'rev$(Build.BuildId)')].name | [0]" \
                        -o tsv)

                      az containerapp ingress traffic set \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --revision-weight ${NEW_REV}=10 ${OLD_REV}=90

                      # Soak for three minutes, then shift fully if healthy
                      sleep 180
                      UNHEALTHY=$(az containerapp revision show \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --revision ${NEW_REV} \
                        --query "properties.healthState" -o tsv)
                      if [ "$UNHEALTHY" != "Healthy" ]; then
                        echo "new revision unhealthy after soak: $UNHEALTHY"
                        az containerapp ingress traffic set \
                          --name $(containerAppName) \
                          --resource-group $(resourceGroup) \
                          --revision-weight ${OLD_REV}=100
                        exit 1
                      fi

                      az containerapp ingress traffic set \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --revision-weight ${NEW_REV}=100

                      az containerapp revision deactivate \
                        --name $(containerAppName) \
                        --resource-group $(resourceGroup) \
                        --revision ${OLD_REV}

The background workers do not have ingress, so the ingress traffic set lines are inert for them; we use activeRevisionsMode: Multiple purely to keep a rollback target alive for three minutes after each deploy. For the few apps in the environment that do have HTTP ingress (an admin console and a webhook receiver), the same traffic block does real percentage-based shifting.

The VerifyEgress stage is a small safeguard but it has caught two configuration drifts since we added it. Once because someone restructured the Firewall policy group names. Once because a teammate copied our MicrosoftContainerRegistry rule into a different policy and accidentally removed the original. Both saved by failing the deploy at the verify stage rather than at the image pull twelve minutes later.

Troubleshooting catalogue

These are the errors we hit at one point or another and what they actually mean.

Failed to pull image: GET https://mcr.microsoft.com/v2/azure-container-apps/k8se/manifests/... 403 Forbidden. The platform shim image lives on MCR, not on your ACR. You need the MicrosoftContainerRegistry Service Tag on the Firewall, and you need the mcr.microsoft.com and *.data.mcr.microsoft.com FQDNs in an application rule. Service Tag alone is necessary but not sufficient.

Failed to pull image: dial tcp i/o timeout (against your own ACR). You have the AzureContainerRegistry Service Tag but the pull is going through the public ACR endpoint because the agent or the data plane is on the wrong side of your private endpoint. Check that the Container Apps subnet sees the ACR private endpoint via the Private DNS Zone link, and that the same Private DNS Zone is attached to whichever VNet your deploy agent runs in.

KEDA scaler returned error: Get "https://platformbus.servicebus.windows.net/$Resources/Topics/orders/Subscriptions/orders-worker/messageCountDetails": dial tcp i/o timeout. The KEDA scaler calls Service Bus's management endpoint, which goes over the same FQDN as the data plane. Either your private endpoint is missing the management sub-resource, or your KEDA identity does not have Azure Service Bus Data Owner at namespace scope. The first is a Private Link configuration; the second is a role assignment.

KEDA scaler returned error: Code=AuthorizationFailed. The KEDA identity lacks the data-owner role. KEDA needs to read message counts, which is a management API call, which requires Data Owner. The data-receiver role on its own lets the app consume but does not let KEDA scale on queue length. Both roles, on the namespace, on the same managed identity.

RevisionUnhealthy: replicas not ready. The container started but failed readiness or the platform shim could not attach. Run az containerapp logs show --container app to get the app's stdout, and az containerapp logs show --container daprd to get the Dapr sidecar's. In our case this was almost always the Dapr Cosmos component failing to acquire a token because the worker's managed identity had not yet been propagated to Cosmos's RBAC. The fix is to issue the Cosmos role assignment before deploying the Container App, not after, because Container Apps spins up the sidecar before the first request lands.

Dapr state store rejected request: code=Unauthorized. The Cosmos data plane RBAC is per-database, per-collection, and the role assignment uses the Cosmos identity-specific role definition id, not the generic Azure RBAC roles. We had a script that did this wrong for three days because the Azure RBAC Cosmos DB Built-in Data Reader role looks correct in the portal but does not actually authorise data-plane reads; you need the SQL-API-specific data-plane role. The Cosmos data-plane roles page on Microsoft Learn is the canonical reference.

Container App revision creation failed: workload profile 'Consumption' does not support more than 4 vCPUs per replica. We hit this on the reconciliation worker, which needed 8 vCPU during the nightly run. Move that one app to the D8 workload profile. The other ten stayed on Consumption. The Bicep change is one line in the app: workloadProfileName: 'D8'. The cost difference matters: D8 charges for reserved capacity whether the app is running or not, so we only move apps onto it when their Consumption-mode bill exceeds the D8 baseline.

Cost, before and after

The AKS workers cost us about £3,200 a month at the cluster level, allocated across the eleven apps roughly by CPU share. That number was stable because the nodes were always on, regardless of whether the apps were doing anything. Friday afternoons were busy. Saturday mornings were almost idle. The bill did not care.

Container Apps in workload-profiles mode bills by replica-second when on Consumption, with a per-replica minimum if minReplicas > 0. We set minReplicas: 0 everywhere we could tolerate cold start, which was nine of the eleven apps. The reconciliation worker on D8 has a reserved component, which the cost calculator shows as roughly £180 a month idle plus burst on top. The other ten apps, all Consumption, all scale-to-zero, came to £760 a month in May 2025. Total for the environment: £940. Same throughput. About 71% off the AKS bill, which paid back the migration in less than two months even on the platform-engineer-day cost.

The thing the spreadsheet does not capture is the on-call delta. The AKS workers had three categories of recurring page: nodes hitting memory pressure (about one every two weeks), KEDA's external ScaledJob controller losing its lease and stopping scaling (about one a month), and the smaller workers occasionally failing to acquire a Service Bus lease because the Envoy sidecar had a known TLS quirk under load. All three categories went to zero on Container Apps. The KEDA scaler is integrated, not a separate controller. There is no sidecar for outbound TLS because the platform handles it. There are no nodes to run out of memory. The page volume from this workload, post-migration, is the one we have just discussed plus exactly one other, a Cosmos throttle on the reconciliation worker that was not a Container Apps problem.

What the migration actually changed in the team

The platform team's weekly worker-YAML toil hit zero. That was the headline saving, and it is real, but it understates the change. With AKS, every new worker was a clone-and-modify of an existing 340-line Helm chart, which made every worker incrementally different from the last in small ways that compounded. With Container Apps, every new worker is one Bicep file, one Dockerfile, and one set of Dapr component scopes added to the environment. The cookie-cutter is smaller, so the drift is smaller. We have added four new workers since the migration; each took about thirty minutes from scaffolding to first deploy. Under the AKS model the same work took most of a day.

Dapr, specifically, deleted the part of the codebase that I had given up trying to keep clean: the state-cache code, the secret-resolver wrappers, the Cosmos retry loops. The Go services now talk to http://localhost:3500/v1.0/state/state-orders/{key} and let Dapr handle the retries, the auth, and the per-request telemetry. The lift on day one was tedious; the lift on day fifteen, when the third worker migrated, took two hours instead of two days because the pattern was clear. Per the Dapr building blocks page on Microsoft Learn, the same pattern carries to pub/sub, bindings, and the workflow building block, none of which we have yet adopted but which are sitting there waiting to delete more code.

KEDA built-in is the small thing that mattered out of proportion. On AKS we ran KEDA via Helm with the version pinned to whatever had been stable when the cluster was provisioned. The first few times Microsoft updated the KEDA version inside Container Apps, we did not notice; the scaler block in our Bicep stayed identical, the behaviour stayed identical, and the platform team did the upgrade work. That is the property you pay a managed platform for and it is worth more than the bill.

The egress incident is the part of this I think about every time we adopt a new managed platform. The platform team did the responsible thing, locked egress down, read the page, listed the Service Tags. The page is accurate. The platform requires what it says it requires. What we missed was the texture: that "image pull" is one phrase that covers two registries on two different Service Tag namespaces, that the blob layer is on a CDN whose addresses you must name by FQDN because Service Tags do not enumerate them. Production caught the gap rather than the proof-of-concept did, because the gap only manifests at the moment the platform's edge cache rolls. The mitigation now is that every managed platform we onboard gets a section in the Firewall policy named for it, an explicit egress test stage in its deploy pipeline that confirms the named rules are intact, and a quarterly review against the source pages on Microsoft Learn. The review is twenty minutes per platform and it has already caught two more drifts in two different environments. The cost of doing the review is small. The cost of not doing it, the next 15:48 page that nobody is on call for, is exactly the question this kind of investment is built to remove.