Skip to content
OObaro.Olori
All articles
Azure DevOps

The night our AKS cluster ran out of pod IPs at 21:14: a kubenet postmortem

The page hit at 21:14 on a Thursday: seven pods stuck in ContainerCreating with 'failed to allocate for range 0: no IP addresses available', the HPA scaling out under a 38% marketing-driven traffic spike, and a kubenet per-node /24 that turned out to be a hard ceiling we had never had cause to test. The full diagnostic timeline and the six-week migration to Azure CNI overlay on 100.64.0.0/10.

16 min read 412 viewsAKSAzure CNIPostmortemNetworking

The page hit my phone at 21:14 on a Thursday in late February. PagerDuty subject line: "AKS prod: 7 pods in ContainerCreating > 5 minutes." The on-call dashboard, when I got to it ninety seconds later from the couch, showed something I had not seen in two and a half years of running this cluster: the Horizontal Pod Autoscaler on the checkout service was scaling out from 12 to 22 replicas because traffic was up 38% on the back of a marketing campaign that nobody had warned the platform team about, and roughly half of the new replicas were stuck pending. The error on every one of them was the same string, repeated in three different kubectl describe outputs:

failed to allocate for range 0: no IP addresses available in range set: 10.244.0.0/24

The cluster had run out of pod IPs. Not VNet IPs, which was the failure mode I had been quietly worrying about for a year. Pod IPs, on the kubenet per-node /24. The two are not the same thing and the difference is exactly the thing that made the next six weeks of my work necessary. This is the postmortem and the migration to Azure CNI overlay that followed.

21:14 to 21:44: the page, the diagnosis, the workaround

The first thing I did, before opening anything else, was confirm scope. Three terminal tabs.

$ kubectl get pods --field-selector=status.phase=Pending -A
NAMESPACE   NAME                                  READY   STATUS              RESTARTS   AGE
prod        checkout-api-7df9b8c6dd-2nfgh         0/1     ContainerCreating   0          6m12s
prod        checkout-api-7df9b8c6dd-4xkpl         0/1     ContainerCreating   0          5m48s
prod        checkout-api-7df9b8c6dd-8z2vr         0/1     ContainerCreating   0          5m41s
prod        checkout-api-7df9b8c6dd-d6wlq         0/1     ContainerCreating   0          5m22s
prod        checkout-api-7df9b8c6dd-rmt9p         0/1     ContainerCreating   0          4m58s
prod        checkout-api-7df9b8c6dd-tk4ws         0/1     ContainerCreating   0          4m31s
prod        checkout-api-7df9b8c6dd-zb7nh         0/1     ContainerCreating   0          4m02s

Seven pods, all the same Deployment, all in the same five-minute window. The HPA had pushed the replica count from 12 to 22 over the previous ten minutes. The checkout service is what bookings flow through, so the p99 latency dashboard already looked like the side of a cliff: 230ms baseline, climbing to 19s while the new pods sat pending and the existing pods absorbed traffic they were never sized for. The error in describe on each pending pod was the kubenet IPAM failure, not an image pull problem, not a scheduling problem, not a CNI plugin crash. The IP pool for one specific /24 was empty.

$ kubectl describe pod checkout-api-7df9b8c6dd-2nfgh -n prod
...
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               6m    default-scheduler  Successfully assigned prod/checkout-api-7df9b8c6dd-2nfgh to aks-default-12345-vmss00000
  Warning  FailedCreatePodSandBox  6m    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "abc123...": plugin type="kubenet" failed (add): failed to allocate for range 0: no IP addresses available in range set: 10.244.0.0/24
  Warning  FailedCreatePodSandBox  5m    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "def456...": plugin type="kubenet" failed (add): failed to allocate for range 0: no IP addresses available in range set: 10.244.0.0/24
  Warning  FailedCreatePodSandBox  4m    kubelet            (combined from similar events): Failed to setup network: cni cmdAdd failed: timeout

At 21:18 I ran kubectl describe on the node the seven pods had been scheduled to. The relevant block jumped out:

$ kubectl describe node aks-default-12345-vmss00000
...
Capacity:
  cpu:                4
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16389696Ki
  pods:               110
Allocatable:
  cpu:                3860m
  ephemeral-storage:  119703055367
  memory:             14377024Ki
  pods:               110
...
Non-terminated Pods:          (110 in total)

The node was advertising a pod capacity of 110, the AKS default, and was sitting at exactly 110 non-terminated pods. Three of the cluster's nine nodes were in the same state. The kubelet had told the scheduler the nodes were full at the pod count, the scheduler had picked the next-least-full nodes for the new pods, and those nodes had assigned podCIDRs that were also at capacity for IP allocation against the per-node /24. The pod-CIDR per node under kubenet is /24, which is 256 addresses minus the reserved ones, which is roughly 251 usable. In practice, with sidecars, init containers, completed jobs that had not yet been garbage collected, and the network plugin's own bookkeeping, "110 pods on a node" pushes the /24 hard. We had landed on a configuration that, on paper, said 110 pods per node, and in reality bumped into the IP ceiling before the pod count ceiling on three nodes simultaneously.

At 21:22 I read back the original cluster creation line out of our infra repo. It was the kubenet line from 2022, by then frozen in a Bicep file but originally laid down via:

$ az aks create \
    --resource-group rg-prod-eus2 \
    --name aks-prod-eus2 \
    --node-count 3 \
    --node-vm-size Standard_D4ds_v5 \
    --network-plugin kubenet \
    --pod-cidr 10.244.0.0/16 \
    --service-cidr 10.0.0.0/16 \
    --dns-service-ip 10.0.0.10 \
    --vnet-subnet-id /subscriptions/.../subnets/snet-aks-nodes

Pod CIDR of 10.244.0.0/16 across the cluster, sliced into /24 ranges per node. Sixteen /24s available before the cluster runs out of slices entirely. At nine nodes, with up to /24 each, we had a hard ceiling we had never had cause to test. The HPA event on the marketing-driven traffic spike was the test, and we failed it.

The workaround at 21:31 was crude but effective: scale the node pool. More nodes means more /24 slices means more pod IPs in aggregate, even though no single node's /24 ever grows. I drained two nodes that had nothing critical on them (image-pull mirror pods, a couple of CronJob completions), bumped the VMSS instance count from 9 to 13, and waited for the new nodes to register and the HPA's pending pods to find homes on the fresh /24 ranges. Page cleared at 21:44. Thirty minutes start to finish. The p99 latency curve came back inside the SLO at 21:47.

I went back to bed knowing I had bought time, not solved the problem. The cluster's pod IP ceiling was an architectural property of kubenet, not a configuration knob, and the next traffic spike was a question of when, not whether.

The whiteboard the next morning: why the ceiling exists

Two things had to land on the team before we could plan a fix. The first is the kubenet IPAM model: each node gets a /24 from the cluster-wide pod CIDR, kubenet runs a host-local IPAM database inside that /24, and the per-node /24 is fixed for the lifetime of that node. There is no growing it. There is no shared pool across nodes. If a node is hot and another node is cool, the cool node cannot lend IPs to the hot one. The pod IP space is striped per-node, and that stripe is the cap. AKS documents this model on Microsoft Learn, and the same page now carries a deprecation notice that did not exist when we read it in 2022. Kubenet on AKS is being retired, and that retirement was the other half of the argument for moving sooner rather than later.

The second thing is why we had chosen kubenet in the first place. In 2022 the choice was kubenet versus Azure CNI (the in-VNet flavour, where every pod takes a real VNet IP). The kubenet docs at the time pushed it for clusters where VNet IP scarcity was the bigger concern, and our VNet was a tight one carved out of a corporate hub-and-spoke; we had roughly 4,000 usable IPs in the AKS spoke, against an expected pod count of around 3,000 once everything migrated in. Azure CNI in-VNet would have spent those IPs faster than we could allocate them. Kubenet kept pod IPs out of the VNet entirely. Right answer for the constraint as we understood it then; wrong constraint to optimise for, as the per-node /24 ceiling eventually showed.

The thing that did not exist in 2022 was Azure CNI overlay, which became generally available in 2023. Overlay puts pods on a separate, much larger CIDR that does not consume VNet IPs, while still giving you the Azure CNI control plane (real ENIs for nodes, NetworkPolicy via Calico or the Azure plugin, the kube-proxy integration that kubenet's iptables masquerade had quietly been doing). On paper, it was kubenet's headline benefit (no VNet IP consumption per pod) with none of kubenet's /24-per-node ceiling. The only practical decision left was the pod CIDR to put it on.

I spent Friday morning drawing the same diagram on the whiteboard three times for three different audiences, and by the afternoon the migration plan was approved.

The migration plan, six weeks, no feature freeze

The plan was blue-green at the cluster level. We do not have the cluster runtime to attempt an in-place network plugin swap, and the AKS upgrade procedure for moving from kubenet to CNI overlay does support in-place but with significant constraints we did not want in the path of a production checkout cluster. Stand a new cluster, validate everything in pre-prod, shift traffic via Front Door across hours, leave the old cluster running for a soak window, decommission once the rollback window closed.

Week one was pre-prod. We provisioned a parallel pre-prod AKS cluster with Azure CNI overlay enabled at create time:

$ az aks create \
    --resource-group rg-preprod-eus2 \
    --name aks-preprod-eus2-overlay \
    --node-count 3 \
    --node-vm-size Standard_D4ds_v5 \
    --network-plugin azure \
    --network-plugin-mode overlay \
    --pod-cidr 100.64.0.0/10 \
    --service-cidr 10.0.0.0/16 \
    --dns-service-ip 10.0.0.10 \
    --vnet-subnet-id /subscriptions/.../subnets/snet-aks-nodes \
    --network-policy calico

Two things about that command. The pod CIDR is 100.64.0.0/10, which is the CGNAT range reserved by RFC 6598 for shared address space. I had stared at the choice for half an hour before committing to it. The constraints: it must not overlap the VNet's 10.0.0.0/8 ranges, it must not overlap any on-prem RFC1918 ranges that get advertised over the ExpressRoute, it must be large enough to be future-proof. 100.64.0.0/10 gives roughly four million addresses, does not collide with anything we route on-prem, and is the example range that AKS overlay's own documentation reaches for. The alternative was a private 172.16/12 carve-out that we had not actually claimed anywhere, but the moment another team's network started reaching for that range we would be in conflict, so the CGNAT range was safer.

The other piece is --network-policy calico. Our existing kubenet cluster had Calico installed for NetworkPolicy enforcement, and a non-trivial pile of NetworkPolicy YAML had accumulated over two years. The single highest-stakes part of the validation was confirming those policies still applied unchanged under CNI overlay. The first one we tested was the deny-all baseline on the prod namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-checkout-from-ingress
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: checkout-api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
          podSelector:
            matchLabels:
              app.kubernetes.io/name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080

The deny-all and the allow-from-ingress are the load-bearing pair for the checkout service. Applied them on the overlay pre-prod cluster, ran a curl from a debug pod in the ingress-nginx namespace (allowed), and from a debug pod in default (blocked). Both behaved correctly on the first try. That was the first reassuring data point: Calico on AKS operates against pod identity (label selectors and namespace selectors) rather than against IP, so the IP-range change underneath did not affect policy evaluation. I had not expected it to, but the cost of being wrong here was a production-traffic outage on cutover day, so I tested everything that had a NetworkPolicy attached.

Weeks two and three were validating ingress and egress at the cluster boundary. Internally, the pods on overlay had IPs in the 100.64.0.0/10 range, visible via:

$ kubectl get pods -n prod -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE
checkout-api-7df9b8c6dd-2nfgh     1/1     Running   0          12m     100.64.1.47    aks-default-67890-vmss00000
checkout-api-7df9b8c6dd-4xkpl     1/1     Running   0          11m     100.64.1.51    aks-default-67890-vmss00000
checkout-api-7df9b8c6dd-8z2vr     1/1     Running   0          11m     100.64.2.13    aks-default-67890-vmss00001
checkout-api-7df9b8c6dd-d6wlq     1/1     Running   0          10m     100.64.2.18    aks-default-67890-vmss00001

versus the old kubenet cluster, where the same pods had been on 10.244.x.y. The internal IP change had no external visibility because every external endpoint we cared about saw the node's egress IP, not the pod IP. The exception was anything that whitelisted the node's public egress IP, which is where the migration's one genuinely scary moment lived.

The CronJob that nearly took the cluster off the network

Thirty minutes before the planned cutover, on the Saturday, I ran the pre-flight egress check. The shape of it was: from a pod in the new cluster, hit a known echo service that returns the source IP it sees, and compare against the firewall allow-list on a couple of private endpoints we connect to from CronJobs. The new cluster's egress IP was different from the old cluster's egress IP. Of course it was. The two clusters were different VMSSes with different node IPs and therefore different outbound SNAT addresses. The check I had not run, because nobody had flagged it on the design review, was which downstream services had whitelisted the old cluster's egress IP.

The CronJob that did the nightly billing reconciliation wrote into a SQL Managed Instance via a private endpoint, and the Managed Instance had a firewall rule that allowed the old AKS cluster's outbound public IP. The firewall would have rejected the new cluster's outbound IP. The CronJob runs at 02:00 UTC. The cutover was scheduled for 14:00 local. If we had completed the cutover and not noticed, the next morning's reconciliation would have failed silently and we would have learned about it on Monday from accounting.

I caught it because the pre-flight check returned an IP I did not recognise, looked it up against the firewall rule, and the rule did not include it. Mitigation, sketched in twenty minutes and built in two hours, was to put both clusters behind a NAT Gateway with a static public IP and whitelist that IP everywhere instead. NAT Gateway is, as it turns out, the right answer for AKS egress generally and is what Microsoft now recommends for any cluster that has stable downstream IP allow-list dependencies. The Bicep change was small:

resource publicIp 'Microsoft.Network/publicIPAddresses@2024-01-01' = {
  name: 'pip-aks-egress-eus2'
  location: location
  sku: { name: 'Standard' }
  properties: {
    publicIPAllocationMethod: 'Static'
    publicIPAddressVersion: 'IPv4'
  }
}

resource natGateway 'Microsoft.Network/natGateways@2024-01-01' = {
  name: 'nat-aks-egress-eus2'
  location: location
  sku: { name: 'Standard' }
  properties: {
    idleTimeoutInMinutes: 10
    publicIpAddresses: [
      { id: publicIp.id }
    ]
  }
}

resource aksSubnet 'Microsoft.Network/virtualNetworks/subnets@2024-01-01' = {
  parent: vnet
  name: 'snet-aks-nodes'
  properties: {
    addressPrefix: '10.20.4.0/22'
    natGateway: {
      id: natGateway.id
    }
  }
}

Both AKS clusters were already in the same VNet's snet-aks-nodes subnet. Attaching the NAT Gateway to the subnet meant both clusters routed outbound through the same public IP, and that IP went onto every downstream firewall allow-list as a one-time, static change. The CronJob ran from the old cluster on the night before cutover and ran from the new cluster on the night after cutover; same source IP both times; no allow-list churn.

The lesson, written into the runbook in red, is that any cluster migration touches outbound IP identity, and outbound IP identity is a load-bearing identity for everything in your environment that whitelists you. Audit it before, not during.

Cutover day: Front Door, four hours, manual approval

The cutover ran through an Azure Pipelines deployment job, gated on a manual approval, with explicit traffic-shifting stages. The shape of the pipeline, scrubbed:

trigger: none

parameters:
  - name: trafficStep
    type: string
    default: '10'
    values: ['10', '25', '50', '100']

variables:
  serviceConnection: 'sc-platform-prod-eus2'
  frontDoorProfile: 'fd-checkout-prod'
  oldBackend: 'aks-prod-eus2-kubenet'
  newBackend: 'aks-prod-eus2-overlay'

stages:
  - stage: Shift
    displayName: 'Shift ${{ parameters.trafficStep }}% of traffic to overlay cluster'
    jobs:
      - deployment: ShiftTraffic
        environment: prod-cluster-migration
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureCLI@2
                  displayName: 'Update Front Door backend weights'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      NEW_WEIGHT=${{ parameters.trafficStep }}
                      OLD_WEIGHT=$((100 - NEW_WEIGHT))

                      az afd origin update \
                        --resource-group rg-prod-eus2 \
                        --profile-name $(frontDoorProfile) \
                        --origin-group-name checkout \
                        --origin-name $(oldBackend) \
                        --weight $OLD_WEIGHT

                      az afd origin update \
                        --resource-group rg-prod-eus2 \
                        --profile-name $(frontDoorProfile) \
                        --origin-group-name checkout \
                        --origin-name $(newBackend) \
                        --weight $NEW_WEIGHT

                - task: Bash@3
                  displayName: 'Soak window'
                  inputs:
                    targetType: inline
                    script: |
                      echo "Soaking for 45 minutes at ${{ parameters.trafficStep }}%"
                      sleep 2700

                - task: AzureCLI@2
                  displayName: 'Check overlay cluster error rate'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      QUERY='AppRequests | where TimeGenerated > ago(45m) | where Properties.cluster == "overlay" | summarize errorRate = countif(Success == false) * 100.0 / count()'
                      RATE=$(az monitor log-analytics query \
                        --workspace $(LAW_ID) \
                        --analytics-query "$QUERY" \
                        --query '[0].errorRate' -o tsv)
                      echo "Overlay error rate: ${RATE}%"
                      AWK_CHECK=$(awk -v r="$RATE" 'BEGIN { print (r > 0.5) ? "fail" : "ok" }')
                      if [ "$AWK_CHECK" = "fail" ]; then
                        echo "Error rate above 0.5% threshold; halting"
                        exit 1
                      fi

The pipeline was run four times that afternoon with trafficStep set to 10, 25, 50, 100 in turn, each gated on a manual approval and each followed by a 45-minute soak with an error-rate check pulled from Log Analytics. The 10% step caught nothing worth caring about. The 25% step caught a single 502 from one ingress-nginx replica that was still terminating long-lived connections from the old cluster; not a real defect, the connection drained on its own. The 50% step was clean. The 100% step ran at 17:48, completed at 18:33, and from that minute the new cluster was carrying all production checkout traffic and the old cluster had no traffic-bearing role except holding the rollback option open.

I left both clusters running for nine days. Nothing went wrong. On day ten I drained the old cluster, scaled the node pool to zero, and on day eleven I deleted the old cluster's resource group entirely.

What changed in the resource shapes

The same workloads are running. The Deployments are byte-for-byte identical YAML. The HPAs, the PDBs, the NetworkPolicies, the Service definitions, the Ingress YAML, all unchanged. The cluster underneath them has a different network model.

Pod IPs come from 100.64.0.0/10 instead of 10.244.0.0/16. There is no per-node /24 to exhaust. The pod-per-node default is still 110 but is now bounded by node capacity (CPU, memory) rather than by an IPAM stripe. We can raise it on a node pool basis if we ever need to; we have not.

VNet consumption is unchanged from the kubenet days, which is the headline property of overlay. The AKS spoke is still nine /22 worth of node IPs, none of which are touched by pod scheduling. We could grow to several thousand pods without renumbering the VNet.

The p99 on checkout came back to its 230ms baseline on cutover day and has not moved since. Overlay's data path is documented to add a small overhead from the encapsulation, and our service-to-service traces show roughly 0.2ms additional internal hop latency, which is below the noise floor for anything we care about externally.

The Calico NetworkPolicies are all still in force. I re-audited every one of them in week six and the labels-and-selectors model meant the YAML did not need to change. The one validation I would not skip again is running each policy's deny test from a pod with the wrong label, to confirm the deny is actually denying and not just defaulting to "no rules match so allow by accident." I caught zero defects in our policy set, but I caught one mis-labelled pod that had been quietly bypassing a policy for three months in the old cluster, because the cutover gave me the excuse to re-run every test.

Where the work ended up sitting

I had two things on my desk that morning of February the 28th: a pager that had cleared at 21:44 the night before, and a Bicep template that said network-plugin: kubenet and was now a six-week migration in disguise. The migration cost roughly 40 hours of my own time over six weeks, plus the NAT Gateway and the second cluster's running costs, which together came to about 1,200 dollars all-in for the migration window. The cluster will not run out of pod IPs again from per-node /24 exhaustion, because per-node /24 exhaustion is no longer a property of the cluster's network model.

The deeper lesson is the one I keep coming back to in postmortems generally. The configuration I had picked in 2022 was right for the constraints I could see at the time and wrong for the constraints I could not. Kubenet's per-node IP ceiling was visible in the docs from day one; I had read the docs and chosen anyway, because the alternative (in-VNet CNI) was visibly worse against the constraint I was looking at, which was VNet IP scarcity. The third option, which would have made both constraints disappear, did not exist yet. The decision was sound at the time. The decision being sound at the time did not protect us from being paged at 21:14 two and a half years later when the third option had quietly become the right answer and nobody had revisited the choice.

I now have a calendar entry, recurring every six months, that re-reads the AKS networking docs for any new GA feature that could obsolete a choice we made earlier. The entry takes about thirty minutes to action. It is the cheapest insurance policy on the platform team's roster, and the only reason it exists is that one Thursday in February when the HPA scaled and the pods would not come up.