Skip to content
OObaro.Olori
All articles
Azure DevOps

Karpenter vs Cluster Autoscaler vs Node Auto-Provisioning on AKS: a benchmark, a cost comparison, and the bursty workload that broke one

At 09:00 UTC on a Monday, a scheduled batch job pushed 340 pods into Pending state and Cluster Autoscaler took 5 minutes 48 seconds to add capacity. The same workload on a parallel test cluster running Node Auto-Provisioning took 1 minute 23 seconds. Three clusters, three provisioners, three workload patterns, and the numbers that pushed our production fleet onto NAP.

17 min read 312 viewsAKSKarpenterNode Auto-ProvisioningAutoscaling

At 09:00 UTC on a Monday in October, a scheduled batch job we had been running quietly for two years pushed 340 pods into Pending state on our production AKS cluster. The cluster was using vanilla Cluster Autoscaler on a VMSS-backed node pool. It took 5 minutes 48 seconds to add enough capacity for the burst. By the time the last pod scheduled, the upstream queue had backed up to roughly 14,000 messages, the SLO dashboards had gone amber, and the on-call engineer had already kicked off a manual node pool expansion that turned out to be unnecessary by the time it finished.

The same workload, replayed against a parallel test cluster running Node Auto-Provisioning, took 1 minute 23 seconds. The third test cluster, running Karpenter on AKS (still in preview at the time of writing), took 1 minute 18 seconds.

This is the full benchmark. Three identical cluster templates, three provisioners, three workload patterns, and the numbers that came out the other side. It is also the story of why we ended up moving production to NAP rather than Karpenter, even though Karpenter was technically the fastest, and why we kept Cluster Autoscaler on one specific legacy cluster instead of migrating it. The cost delta at our scale is about £4,920 a year per cluster. The bigger number is the latency I no longer think about on Monday mornings.

The three options on AKS in 2025

Before the numbers, the lay of the land. AKS in 2025 ships three meaningfully different ways to scale node capacity, and the differences matter.

Cluster Autoscaler is the original. It watches the cluster for unschedulable pods, decides which existing node pool can host them, and asks the underlying Virtual Machine Scale Set to add more instances. It is mature, GA for years, well understood, and what most existing AKS clusters are running. Its constraint is structural: it can only scale within the SKUs you defined in your node pools, and adding capacity means waiting for a fresh VMSS instance to be created, joined, and made ready. In our environment that consistently took 3 to 6 minutes.

Karpenter is a node lifecycle controller originally from AWS. Microsoft has been porting it to AKS as a preview feature. Instead of being locked to a VMSS, Karpenter looks at the shape of pending pods and picks a VM SKU from a NodePool spec on demand, creates the node directly through the compute API, and skips the VMSS layer entirely. It can mix Spot and on-demand in one pool, juggle multiple instance types, and is generally seconds-faster on cold scale-up. The wrinkle is the preview label, which has real operational consequences I will come back to.

Node Auto-Provisioning (NAP) is Microsoft's managed take on the same model. Under the hood it is Karpenter, with Microsoft owning the controller, the upgrades, and the support relationship. NAP went GA earlier in 2025. You enable it on a cluster, define one or more AKSNodeClass and NodePool resources, and Microsoft's control plane does the rest. Same provisioning speed as Karpenter, lower operational burden, fewer flexibility knobs.

The honest summary: Cluster Autoscaler is the mature option you already understand, Karpenter is the flexible option with preview risk, NAP is the managed option that gets you most of Karpenter's speed without owning the controller.

The test setup

I built three clusters from a single Bicep template, parameterised on the provisioning mode. All three lived in the same region (uksouth), used the same VNet, the same node SKU family options (D-series v5), the same Kubernetes version (1.30.5), the same set of installed add-ons (Azure Monitor, Microsoft Defender for Containers, Workload Identity). The differences were exactly the provisioning configuration.

The Bicep for the Cluster Autoscaler variant looked like this:

param clusterName string
param location string = 'uksouth'
param kubernetesVersion string = '1.30.5'

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: clusterName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: kubernetesVersion
    dnsPrefix: clusterName
    networkProfile: {
      networkPlugin: 'azure'
      networkPluginMode: 'overlay'
      podCidr: '10.244.0.0/16'
      serviceCidr: '10.0.0.0/16'
      dnsServiceIP: '10.0.0.10'
    }
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        count: 3
        minCount: 3
        maxCount: 5
        enableAutoScaling: true
        vmSize: 'Standard_D4ds_v5'
        osDiskSizeGB: 128
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
      }
      {
        name: 'workload'
        mode: 'User'
        count: 3
        minCount: 3
        maxCount: 40
        enableAutoScaling: true
        vmSize: 'Standard_D8ds_v5'
        osDiskSizeGB: 256
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
      }
    ]
    autoScalerProfile: {
      'scale-down-delay-after-add': '10m'
      'scale-down-unneeded-time': '10m'
      'scale-down-utilization-threshold': '0.5'
      expander: 'least-waste'
      'max-node-provision-time': '15m'
    }
  }
}

For the NAP variant, the cluster definition is shorter because there are no agent pool size limits to pick. NAP requires networkPlugin: 'azure' with networkPluginMode: 'overlay', which is the same setting we already used. The bit that turns NAP on is the nodeProvisioningProfile:

resource aksNap 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: clusterName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: kubernetesVersion
    dnsPrefix: clusterName
    networkProfile: {
      networkPlugin: 'azure'
      networkPluginMode: 'overlay'
      podCidr: '10.244.0.0/16'
      serviceCidr: '10.0.0.0/16'
      dnsServiceIP: '10.0.0.10'
    }
    nodeProvisioningProfile: {
      mode: 'Auto'
    }
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        count: 3
        vmSize: 'Standard_D4ds_v5'
        osDiskSizeGB: 128
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
      }
    ]
  }
}

Note the system node pool is still there. NAP only provisions user workloads; the system pool still hosts CoreDNS, the metrics server, and the Azure-managed components, and that pool is sized statically. Trying to set mode: 'Auto' on a cluster that already has user node pools fails at deployment time; NAP needs to own user provisioning from cluster creation.

You can enable NAP on an existing cluster too, via the CLI, as long as the network plugin is right:

az aks update \
  --resource-group rg-aks-bench \
  --name aks-nap-bench \
  --node-provisioning-mode Auto

The Karpenter preview variant required the AKS Karpenter feature flag, the Helm install of the Karpenter chart, and a custom resource definition pair. The cluster is created with no scaling autopilot, then Karpenter takes over:

az aks create \
  --resource-group rg-aks-bench \
  --name aks-karpenter-bench \
  --kubernetes-version 1.30.5 \
  --node-count 3 \
  --node-vm-size Standard_D4ds_v5 \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --enable-aad \
  --enable-workload-identity \
  --enable-oidc-issuer

# After cluster is up, register and install the karpenter chart
helm install karpenter oci://mcr.microsoft.com/aks/karpenter/karpenter \
  --version 0.6.0 \
  --namespace karpenter \
  --create-namespace \
  --set settings.clusterName=$CLUSTER_NAME \
  --set settings.clusterEndpoint=$CLUSTER_ENDPOINT

The NodePool and AKSNodeClass resources are what tells Karpenter what to provision. This is where the flexibility shows up. A single NodePool can specify multiple SKU families, mix on-demand and Spot, and bound CPU and memory ranges so you do not accidentally provision a Standard_E96 for a single 200m CPU pod:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: workload
spec:
  template:
    spec:
      requirements:
        - key: karpenter.azure.com/sku-family
          operator: In
          values: ["D", "E"]
        - key: karpenter.azure.com/sku-cpu
          operator: In
          values: ["4", "8", "16", "32"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default
      expireAfter: 720h
  limits:
    cpu: "1000"
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.azure.com/v1beta1
kind: AKSNodeClass
metadata:
  name: default
spec:
  imageFamily: Ubuntu2204
  osDiskSizeGB: 128

That consolidateAfter: 30s is the bit that gets you the cost win on the steady-state workload. Karpenter (and therefore NAP) constantly looks for opportunities to repack workloads onto fewer larger nodes, or onto cheaper Spot capacity, and recycles the displaced nodes. Cluster Autoscaler does the same conceptually with scale-down-unneeded-time, but it can only remove whole VMSS instances, not consolidate across SKUs.

The three workload patterns

I tested each cluster against three deliberate workload shapes, run in sequence, with cluster state reset between runs.

Pattern A: steady-state web traffic. A 24-hour run of a generated load on a stateless API service. The deployment is sized at 30 pods of 500m CPU and 1Gi memory each. Traffic is a Poisson distribution with mean of 400 req/s, peaks at 720, troughs at 90. The test driver is vegeta running from a separate cluster:

echo "GET https://api.bench.example.com/items" | \
  vegeta attack -rate 400/s -duration 24h | \
  vegeta report

What I am measuring here is node count, total compute cost, and pod-density efficiency. Bursting matters less; the cluster should find a comfortable steady state and stay there.

Pattern B: bursty batch. A scheduled job that abruptly demands 340 additional pods of 1 CPU and 2Gi memory each, with anti-affinity rules that forbid more than four of them on the same node. This is the Monday-morning case. The test driver is one kubectl line, fired at a specific timestamp:

kubectl scale deployment loadgen --replicas 340

Plus a watcher that records the first Pending timestamp and the last Running timestamp:

kubectl get pods -l app=loadgen --watch-only -o wide \
  --output-watch-events \
  -o jsonpath='{.type} {.object.metadata.name} {.object.status.phase} {.object.status.startTime}{"\n"}' \
  | tee burst-timeline.log

The metric is end-to-end scale-up time: from the first pod going Pending to the last pod going Running.

Pattern C: memory-heavy long-running jobs. A simulated analytical workload of 12 pods, each requesting 1 CPU and 28Gi memory, running for 6 hours each. The interesting question here is whether the provisioner picks the right SKU. A naive choice puts each pod on its own Standard_D8ds_v5 (32Gi memory) and wastes the other cores. A smart choice picks Standard_E8ds_v5 (64Gi memory) and packs two pods per node.

The Prometheus queries I used to measure things, recorded for honesty:

# Scale-up duration: time from first Pending to last Running for a given label
max by (label_app) (
  kube_pod_status_phase{phase="Running", label_app="loadgen"}
) - on(label_app) group_left()
min by (label_app) (
  kube_pod_status_phase{phase="Pending", label_app="loadgen"}
)

# Average node CPU utilization over the test window
avg by (node) (
  1 - rate(node_cpu_seconds_total{mode="idle"}[5m])
)

# Cost: nodes by SKU, weighted by their on-demand hourly rate
sum by (instance_type, capacity_type) (
  kube_node_labels{label_node_kubernetes_io_instance_type=~".+"}
)

The Prometheus queries above are simplifications of what I actually ran. The cost numbers I quote below come from cross-referencing those queries against the Azure pricing API for uksouth, on-demand and Spot. I am not reading them off the Cost Management dashboard because that dashboard's reservation accounting can hide the actual on-demand cost of a transient workload behind a baseline commitment.

The results

Three days of data, condensed:

Workload Cluster Autoscaler Karpenter (preview) Node Auto-Provisioning
Pattern A (steady): avg nodes 11 7 7
Pattern A: avg cluster CPU util 38% 64% 62%
Pattern A: monthly cost £4,210 £3,720 £3,810
Pattern B (burst): scale-up time 5m 48s 1m 18s 1m 23s
Pattern B: nodes added 14 11 11
Pattern B: pods stranded > 60s 312 0 0
Pattern C (memory): nodes used 6x D16ds_v5 6x E8ds_v5 6x E8ds_v5
Pattern C: monthly cost £2,840 £2,160 £2,160

The pattern-A column is what I expected to find, but it was still bracing seeing it on a graph. Cluster Autoscaler held 11 nodes at average 38% CPU because it was structurally unable to consolidate down. The two D8ds_v5 user pool nodes that the workload barely touched stayed up because removing them would have violated the min-count of the node pool, and there was no cheaper SKU it was allowed to substitute. Karpenter and NAP repacked the same workload onto 7 nodes by picking a mix of D8 and D16 instances based on actual pod shapes.

Pattern B is the one I opened with. 5 minutes 48 seconds for Cluster Autoscaler, 1 minute 18 for Karpenter, 1 minute 23 for NAP. The mechanical reason is that Cluster Autoscaler asks VMSS to add instances, and VMSS goes through its own provisioning lifecycle, which includes joining the cluster network, pulling the kubelet image, and registering. Karpenter (and NAP, which is the same code path) calls the Azure compute API directly with a custom image that has the kubelet pre-baked, and the join time is closer to 60 seconds than 4 minutes.

The pattern-B "pods stranded > 60s" row is the row that, on the day, made me sit up. 312 of the 340 pods on the Cluster Autoscaler cluster were stuck in Pending for longer than a minute. On the NAP cluster, that number was zero. Not in the sense that everything was instant, but in the sense that within 60 seconds, every pending pod had a node assignment.

The workload that broke Cluster Autoscaler

The bursty pattern was a stress test. The interesting failure was a different one, on the same Cluster Autoscaler cluster, with a workload that had anti-affinity rules I had not paid attention to.

The workload was a stateful service: 60 pods, each requesting 4 CPU and 16Gi memory, with a podAntiAffinity rule that said no two pods of the same service could land on nodes in the same topology.kubernetes.io/zone. We had three zones. The math says each zone gets 20 pods, on as many nodes as needed.

The Cluster Autoscaler user pool was a single VMSS configured for Standard_D16ds_v5 (16 CPU, 64Gi memory), so each node could host 4 of these pods. The deployment scaled up cleanly to 8 nodes total (32 pod capacity). At 33 pods, Cluster Autoscaler tried to add a ninth node. The VMSS in zone 1 added a node. The pod did not schedule.

0/12 nodes are available: 12 Insufficient memory

The error was misleading. The new node had plenty of memory. The actual problem, which I worked out only after about 40 minutes of staring at events, was that the new node was in zone 1, and zone 1 already had 4 pods of this service running, and the anti-affinity rule was per zone, so the new pod could not land on any zone 1 node regardless of its memory. Cluster Autoscaler did not know about the zone-affinity constraint; it just kept adding nodes to zone 1 (because that VMSS instance happened to be the cheapest expander option), and pods kept staying Pending.

The workaround on Cluster Autoscaler was to split the workload into three node pools, one per zone, each with its own min/max. That worked, but it tripled the number of node pools to manage, and it meant the zone balance was hardcoded into the cluster definition rather than the workload spec. When we added a fourth zone six months later, I had to remember to update the cluster, not just the deployment.

On Karpenter (and NAP), the same workload just worked. The provisioner looked at the unschedulable pod, saw the anti-affinity rule, picked a SKU in a different zone, and provisioned the node there. The configuration is at the workload level, where it belongs.

Where each one fits

Cluster Autoscaler is still the right answer for a specific shape of cluster. If you have a long-lived cluster running a homogeneous workload with predictable peaks, where the SKU choice is fixed by some external constraint (a reserved instance commitment, a compliance requirement, a software licence that bills per VM core), and where 3-to-6-minute scale-up is acceptable, Cluster Autoscaler is mature, stable, and well-understood. I am not chasing teams off of it.

Karpenter on AKS (preview) is the right answer if you have an unusual workload mix that needs the fine-grained NodePool spec, you have the operational capacity to upgrade the Karpenter controller yourself, and you are comfortable being on a preview-labelled component. The case I have seen it most useful for is ML training clusters: lots of GPU SKU variation, Spot heavy, jobs that can survive interruption. The preview status matters more than it sounds. We had a NodeClaim get stuck in Unknown state during one test run:

NodeClaim/loadgen-7f2k4 Unknown: failed to create instance: SkuNotAvailable for Standard_D16ds_v5 in zone uksouth-2

That is a transient Azure capacity error, not a Karpenter bug, but the recovery path on a preview component is harder to reason about: the support relationship is community-driven, the bug-fix cadence is upstream-driven, and the upgrade path between minor versions is hand-rolled. For our production workload, those properties together were enough to push us towards NAP instead.

Node Auto-Provisioning is what I would default to today on any new AKS cluster. Same speed as Karpenter, same SKU flexibility, same packing efficiency, with Microsoft owning the controller upgrades and the support relationship. The cost gap to Karpenter at our scale (£90/month per cluster) is the price of having someone else patch the controller. I will pay that all day long.

The constraint to be honest about: NAP requires --network-plugin azure --network-plugin-mode overlay. If your cluster is on kubenet or on Azure CNI without overlay mode, you cannot enable NAP in place. The migration path is documented but is a full cluster rebuild: stand up a new cluster with the right networking, migrate workloads via GitOps. I went through that exercise for a different reason last year (running out of pod IPs on a kubenet cluster), and it is a real piece of work; budget two weeks of clock time per cluster for a clean migration.

Troubleshooting

A few real errors I hit during the bench, all worth knowing about:

Cluster Autoscaler not scaling up at all. The log line you want is from the autoscaler pod itself:

kubectl logs -n kube-system -l app=cluster-autoscaler --tail=200

Usually one of three things. The pod's resource request is bigger than the largest node in any node pool (Cluster Autoscaler will not "scale to a SKU you did not define"). The node pool is at its maxCount. Or the expander policy is choosing a pool that cannot host the pod and is not retrying with a different one.

Karpenter NodeClaim stuck in Unknown. As above, the typical cause is Azure capacity:

status:
  conditions:
    - type: Launched
      status: "False"
      reason: SkuNotAvailable
      message: "failed to create instance: SkuNotAvailable for Standard_D16ds_v5 in zone uksouth-2"

The fix is to widen the NodePool to include more SKU options, so Karpenter has alternatives when one is unavailable. Spot capacity in particular is volatile across SKUs; never restrict a Spot-preferring NodePool to a single SKU.

NAP error: pod cannot be scheduled because tolerations do not match any provisioner. This one tripped me up the first time. NAP requires that pods have tolerations that match the taints NAP applies to its provisioned nodes. The default NAP nodes carry no taints, so most pods schedule fine, but if you taint a NodePool to isolate a workload class, the workload must explicitly tolerate it:

spec:
  template:
    spec:
      taints:
        - key: workload
          value: batch
          effect: NoSchedule
# In your deployment spec:
spec:
  tolerations:
    - key: workload
      operator: Equal
      value: batch
      effect: NoSchedule

I lost about 30 minutes the first time on this one, staring at events that said 0/8 nodes are available: 8 node(s) had untolerated taint, before remembering I had applied the taint a week earlier and never updated the deployment.

Out-of-memory failures on consolidation. Both Karpenter and NAP will occasionally consolidate workloads onto a smaller node, and if the workload's actual memory use is higher than its requests (a common mistake), the consolidation triggers an OOMKill. The fix is workload hygiene: set realistic memory requests. The provisioner is right to consolidate; the workload spec is wrong.

The creationMode mismatch on Spot nodes. Spot VMs are documented on Microsoft Learn and integrate cleanly with Karpenter and NAP, but there is one configuration trap: AKS Spot nodes automatically carry the taint kubernetes.azure.com/scalesetpriority=spot:NoSchedule. Karpenter knows about this taint and adds the toleration automatically when it picks Spot capacity. If you are manually building a NodeClass that mixes Spot and on-demand, make sure the toleration is in the workload template.

Where we ended up

Production migrated to NAP for the new clusters, which is now seven clusters across three regions. The migration was a fresh build per cluster, not an in-place flip, because we were already running Azure CNI overlay and most of our older clusters had Cluster Autoscaler. We did the cutover via GitOps and a Front Door traffic shift, the same pattern we had used for previous cluster rebuilds.

We kept Cluster Autoscaler on one specific cluster: a legacy data platform cluster that runs an in-house licensed product, where the licence is keyed to specific VM SKUs (do not ask), and where the cost of a clean migration is roughly the same as four years of the £350/month efficiency gap. We will migrate it when the licence terms change, not before.

The Karpenter cluster, the one I built for the benchmark, stayed up for three more months as a holding tank for the ML team's experimental workloads, until NAP added the few feature toggles that the ML team specifically wanted (custom scripts on node init), at which point we tore it down and folded that workload into a NAP cluster too.

The numbers I keep coming back to are not the cost numbers, which are real but easy to defend after the fact. They are the latency numbers. 5 minutes 48 seconds versus 1 minute 23 seconds, on a workload that lives or dies by how fast it scales. That difference shows up downstream in queue depth, in alert noise, in the on-call engineer's confidence that they do not need to manually nudge the cluster on a Monday morning. The cost win is what I put in the business case. The latency win is what made the on-call rota easier to sleep through.

The honest reflective coda. Three years ago I would have argued that Cluster Autoscaler was the right answer for almost any AKS cluster, because the operational maturity was a real asset and the scale-up cost was rarely on the critical path. That argument was correct then. It is no longer correct now. NAP is not a research project; it is a managed, GA, supported component of AKS with the same blast radius as Cluster Autoscaler and a meaningfully better runtime behaviour. The thing that changed was not my opinion. The thing that changed was that Microsoft shipped, in production-ready form, the provisioning model that the open-source community had been building. I would rather adopt a managed implementation of someone else's good idea than keep running a less-good idea because it is the one I am used to. The Monday-morning batch job that started this whole exercise has not paged anyone in seven months. The 340 pods get their nodes in 83 seconds. The queue does not back up. I have stopped opening the dashboard at 09:00 to check.