Skip to content
OObaro.Olori
All articles
Azure DevOps

Lifting a .NET Framework 4.8 monolith into AKS without rewriting it

The colo lease ended in November and the business had just said no to a £2.4M rewrite. Eleven months later, the ASP.NET 4.8 monolith was running in a Windows Server Core container on an AKS Windows node pool, seven new .NET 8 services were carved off the side of it on a Linux pool, and the same pipeline was deploying both halves. This is the full playbook, including the night a Windows pod sat in ContainerCreating for eleven minutes and I thought we were going to miss the cutover.

17 min read 326 viewsAKSWindows containers.NET FrameworkStrangler fig

The diagram I drew on the whiteboard in January 2024 had seven boxes on it. Top left, an IIS 10 server on Windows Server 2016 hosting PaymentReconciliation.Web, an ASP.NET 4.8 application with 184 controllers and a Global.asax.cs that had not been touched since 2018. Top right, a Windows service called PaymentReconciliation.Worker.exe that polled MSMQ every two seconds and called into the same shared DLLs the web app used, including PaymentReconciliation.Engine.dll (14,200 lines, no tests, the author had left the company in 2019). Middle, a SQL Server 2017 Failover Cluster Instance on two physical hosts in our colo, with a 1.8TB database that had been growing 40GB a month for the last four years. Bottom left, an SMB file share holding 92,000 PDFs of reconciliation reports that the application wrote to and a separate batch job read from. Bottom right, MSMQ queues with Recoverable=True and journaling on, because we had once had an outage in 2019 where we lost six hours of messages and the postmortem ended with "turn journaling on and never turn it off." The seventh box was a Windows scheduled task on the IIS server that ran Cleanup.ps1 at 02:00 every morning. Nobody knew what Cleanup.ps1 did. It was 312 lines of PowerShell, half of it commented out, and the only person who had ever read it in full had left in 2021.

That was the system we needed to get off bare metal. The colo lease ended in November 2024. The business had run a costed proposal for rewriting PaymentReconciliation as a set of .NET 8 microservices, the estimate came back at £2.4 million and 18 months, and the finance director said no within an hour of receiving it. The actual ask was simpler: get it into AKS, keep it running, do not break the daily settlements file that the FCA expected at 06:30 every weekday, and do it by November.

We finished it in 11 months. The monolith is still alive, now running in a Windows Server Core container on an AKS Windows node pool, and seven new .NET 8 services have been carved off the side of it on a Linux node pool. The same CI/CD pipeline deploys both halves. The colo is decommissioned. The £2.4M rewrite never happened. This is the playbook, including the night in October when a single Windows pod sat in ContainerCreating for eleven minutes and I genuinely thought we were going to miss the cutover.

What the legacy stack actually cost to keep alive

Before I talk about the migration, I want to be honest about what we were paying for the status quo, because the case for moving was not "containers are cool." It was operational arithmetic.

The colo lease was £18,400 a month. The two SQL FCI hosts were on extended support for Windows Server 2016, priced into the high five figures annually. The IIS server's OS was due for an in-place upgrade that nobody wanted to attempt, because the last time we had touched its registry the application had stopped serving requests for nine hours. The Windows service had a known memory leak that we worked around by recycling it every 12 hours. Patching weekends were monthly, six hours of downtime each.

The tax on developer velocity was worse. Every feature shipped touched the same PaymentReconciliation.Engine.dll. A deploy meant a 45-minute build, a 25-minute MSI installer run, and an IIS app pool recycle that froze the application for 90 seconds. Developers had stopped shipping anything but mandatory regulatory changes.

The total cost of staying still, including the colo, the licensing, the on-call hours, and the features the team would not ship, was something like £640,000 a year. The migration was sold internally as paying for itself in 14 months. It actually paid for itself in nine.

Why a rewrite was off the table

The £2.4M number was real. We had quoted out a clean rewrite to two consultancies. The shape of the work was: rebuild PaymentReconciliation.Engine as six bounded contexts, migrate the data model from 41 stored procedures into a CQRS-ish read model, decommission MSMQ, and write a test suite from scratch because the existing one was four NUnit projects with 38 passing tests and 1,200 ignored ones.

The business calculation was straightforward. £2.4M up front, 18 months during which no new features ship, and an unquantified risk that the new system would have different bugs nobody had seen yet. The director's exact phrase was "we are not betting the business on a rewrite."

So we strangled it instead. The pattern is fifteen years old at this point and still works because the underlying constraint, "the legacy application is the source of truth and you cannot turn it off," has not changed. The idea is to wrap the monolith in a routing layer, divert new functionality to new services behind that layer, and slowly grow the new system around the old one until the old one is small enough to retire. We were not aiming to retire PaymentReconciliation. We were aiming to get it into a place where a future team could.

The shape of the target architecture

The architecture we landed on:

  1. The monolith stays as-is, packaged into a Windows Server Core container running IIS, deployed to a Windows node pool on AKS. No source changes beyond what the container build required.
  2. The Windows service becomes a separate Windows container, deployed as its own AKS Deployment, also on the Windows node pool.
  3. A new Linux node pool runs seven new .NET 8 services that handle features extracted from the monolith. Each service is a normal containerised ASP.NET Core app on a Linux base image.
  4. An NGINX ingress on the Linux pool fronts everything. Routes prefixed with /v2/* go to the new services; everything else falls through to the monolith's IIS container. This is the strangler-fig boundary.
  5. Authentication is unified via a JWT issued by Entra ID, validated by both sides through a shared middleware. The monolith picks up a NuGet-packaged validation library; the new services use the standard Microsoft.AspNetCore.Authentication.JwtBearer.
  6. State moves off the file system. The 92,000 PDFs migrate to Azure Files mounted via the CSI driver. MSMQ is replaced with Azure Service Bus. SQL Server moves to Azure SQL Managed Instance, kept private behind a private endpoint inside the AKS VNet.

The thing that made all of this tractable was AKS Windows Server containers. They had matured enough by late 2023 that the rough edges I had hit on a previous evaluation in 2021 were mostly gone. The Windows node pool feature was GA, the container images had a stable tagging scheme, and the kubelet behaviour on Windows was no longer surprising. It was production-shaped.

The Windows container, end to end

The Dockerfile for the monolith was the first artifact I built and the first one I argued with for a week. The published Microsoft base image for ASP.NET 4.8 is mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2022, documented on the .NET Framework Docker images page. It ships with IIS preconfigured and an EntryPoint that wires IIS to PID 1 so Kubernetes can see the container as healthy.

# escape=`
FROM mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2022

SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]

# The application's pre-deploy hardening script that the legacy server ran.
# We baked it into the image so the container starts in the same state.
COPY ops/Set-IisDefaults.ps1 C:/ops/Set-IisDefaults.ps1
RUN C:/ops/Set-IisDefaults.ps1

# WebDeploy is what the on-prem deploy tooling used. We kept the package format
# the same so the build job produces the same .zip it always produced, and the
# container just imports it. Smaller change surface during cutover.
RUN Install-WindowsFeature Web-Mgmt-Service; `
    Install-WindowsFeature Web-Asp-Net45; `
    Install-WindowsFeature Web-Net-Ext45

# Required Windows features the monolith depended on. The 2018-era
# Global.asax.cs called System.DirectoryServices on startup.
RUN Install-WindowsFeature RSAT-AD-PowerShell

# Site setup. The monolith assumed a fixed app pool identity; in the container
# we run as the built-in IIS_IUSRS pool, which is what the base image's
# default app pool already uses.
RUN Remove-Website -Name 'Default Web Site'; `
    New-Website -Name 'PaymentReconciliation' `
                -PhysicalPath 'C:\inetpub\PaymentReconciliation' `
                -Port 80 `
                -ApplicationPool 'DefaultAppPool'

COPY publish/PaymentReconciliation.Web/ C:/inetpub/PaymentReconciliation/

# A reduced web.config that pulls connection strings from environment variables
# at runtime. This was the only application source change required.
COPY ops/web.runtime.config C:/inetpub/PaymentReconciliation/web.config

# The standard healthcheck against the application's existing /heartbeat
# endpoint. Kubernetes uses this via httpGet probe.
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 `
  CMD powershell -command `
    try { (Invoke-WebRequest -Uri http://localhost/heartbeat -UseBasicParsing).StatusCode -eq 200 } `
    catch { exit 1 }

EXPOSE 80

The image weighed in at 6.2GB after the first successful build. That number was not a surprise; the base image alone is 4.7GB compressed. It was the operational implication that bit us. Pulling a 6.2GB image to a freshly added Windows node took eleven minutes in our region. I will come back to this when I talk about the cutover.

The Windows service was a similar shape, smaller in size (3.1GB) because it did not need IIS, but built on the same base image family:

# escape=`
FROM mcr.microsoft.com/dotnet/framework/runtime:4.8-windowsservercore-ltsc2022

WORKDIR C:/app
COPY publish/PaymentReconciliation.Worker/ .

# The Windows service used to install itself via InstallUtil. In the container
# we run the binary directly with the same command-line that the service host
# would have invoked.
ENV PAYREC_RUN_MODE=Container
ENTRYPOINT ["PaymentReconciliation.Worker.exe", "--mode=console"]

The --mode=console flag was a single-line code change in Program.Main that I had to write. The original code detected Environment.UserInteractive to decide between service mode and console mode, and inside a container the value of that property was false, so without the flag the binary tried to attach to the Windows Service Control Manager and immediately exited. Two hours of debugging produced one line of code. That ratio felt familiar by month three.

The AKS cluster, with two node pools

The cluster itself is Bicep. The interesting properties for this migration are: a system pool on Linux (kept small, runs CoreDNS and the ingress controller), a user pool on Linux (the new services), and a user pool on Windows (the monolith and the worker). The Windows pool was the part that needed the most care, because the Windows node pool prerequisites include constraints around networking plugin choice and node count.

param location string = resourceGroup().location
param clusterName string = 'aks-payrec-prod-uks'
param adminUsername string = 'azureuser'
@secure()
param windowsAdminPassword string

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: clusterName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    dnsPrefix: clusterName
    kubernetesVersion: '1.29.7'
    enableRBAC: true
    networkProfile: {
      networkPlugin: 'azure'
      networkPolicy: 'calico'
      loadBalancerSku: 'standard'
      serviceCidr: '10.40.0.0/16'
      dnsServiceIP: '10.40.0.10'
    }
    windowsProfile: {
      adminUsername: adminUsername
      adminPassword: windowsAdminPassword
      enableCSIProxy: true
    }
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        osType: 'Linux'
        osSKU: 'AzureLinux'
        vmSize: 'Standard_D4ds_v5'
        count: 3
        availabilityZones: ['1', '2', '3']
        maxPods: 60
      }
      {
        name: 'linsvcs'
        mode: 'User'
        osType: 'Linux'
        osSKU: 'AzureLinux'
        vmSize: 'Standard_D8ds_v5'
        count: 4
        enableAutoScaling: true
        minCount: 4
        maxCount: 12
        availabilityZones: ['1', '2', '3']
        nodeLabels: {
          workload: 'dotnet-linux'
        }
      }
      {
        name: 'winmono'
        mode: 'User'
        osType: 'Windows'
        osSKU: 'Windows2022'
        vmSize: 'Standard_D8s_v5'
        count: 3
        enableAutoScaling: true
        minCount: 3
        maxCount: 6
        availabilityZones: ['1', '2', '3']
        nodeLabels: {
          workload: 'dotnet-framework'
        }
        nodeTaints: [
          'workload=dotnet-framework:NoSchedule'
        ]
      }
    ]
  }
}

A few specifics that mattered. enableCSIProxy: true is what lets Windows nodes mount Azure Files volumes via the CSI driver; without it, the Azure Files PVCs sit in Pending forever and the kubelet logs on the Windows node will tell you exactly that. The windowsProfile.adminPassword was rotated out of the Bicep parameter file the day after first deploy via az aks update --windows-admin-password, because the Windows password rotation procedure is well documented but easy to forget.

I tainted the Windows pool so only workloads that explicitly tolerated workload=dotnet-framework would land there. The two cases I had seen before, where a Linux DaemonSet from a third-party Helm chart tried to land on a Windows node and produced an opaque scheduling error, were exactly the kind of incident the taint pre-empts. The new Linux services use a nodeSelector of workload: dotnet-linux and have no toleration for the Windows taint.

Adding the Windows pool from the CLI looks like this, for reference:

az aks nodepool add \
  --resource-group rg-payrec-prod-uks \
  --cluster-name aks-payrec-prod-uks \
  --name winmono \
  --os-type Windows \
  --os-sku Windows2022 \
  --node-count 3 \
  --node-vm-size Standard_D8s_v5 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 6 \
  --node-taints workload=dotnet-framework:NoSchedule \
  --labels workload=dotnet-framework

The pipeline that builds both halves

We had one Azure Pipelines YAML file that built both the Windows image and the Linux images. The Windows job ran on a Windows hosted pool because building a Windows container requires a Windows kernel; the Linux jobs ran on the default Ubuntu pool. The same pipeline pushed both, the same release stage deployed both via Helm, and the cutover bookkeeping lived in one place.

trigger:
  branches:
    include: [main, release/*]

variables:
  acrName: 'acrpayrecprod'
  imageRepoMono: 'payrec/web'
  imageRepoWorker: 'payrec/worker'
  imageRepoApiV2: 'payrec/api-v2'
  serviceConnection: 'sc-payrec-prod-acr'

stages:
  - stage: Build
    jobs:
      - job: BuildMonolithImage
        pool:
          vmImage: 'windows-2022'
        steps:
          - task: NuGetCommand@2
            inputs:
              command: 'restore'
              restoreSolution: 'src/PaymentReconciliation.sln'
          - task: VSBuild@1
            inputs:
              solution: 'src/PaymentReconciliation.sln'
              msbuildArgs: '/p:DeployOnBuild=true /p:WebPublishMethod=FileSystem /p:PublishUrl=$(Build.ArtifactStagingDirectory)/publish'
              configuration: 'Release'
          - task: Docker@2
            displayName: 'Build & push Windows monolith image'
            inputs:
              containerRegistry: $(serviceConnection)
              repository: $(imageRepoMono)
              command: 'buildAndPush'
              Dockerfile: 'ops/docker/monolith.Dockerfile'
              buildContext: '$(Build.ArtifactStagingDirectory)'
              tags: |
                $(Build.BuildNumber)
                latest
          - task: Docker@2
            displayName: 'Build & push Windows worker image'
            inputs:
              containerRegistry: $(serviceConnection)
              repository: $(imageRepoWorker)
              command: 'buildAndPush'
              Dockerfile: 'ops/docker/worker.Dockerfile'
              buildContext: '$(Build.ArtifactStagingDirectory)'
              tags: |
                $(Build.BuildNumber)
                latest

      - job: BuildLinuxServices
        pool:
          vmImage: 'ubuntu-latest'
        strategy:
          matrix:
            api:
              service: 'PaymentReconciliation.ApiV2'
              repo: 'payrec/api-v2'
            settlements:
              service: 'PaymentReconciliation.SettlementsV2'
              repo: 'payrec/settlements-v2'
        steps:
          - task: UseDotNet@2
            inputs:
              version: '8.0.x'
          - script: |
              dotnet publish src/$(service)/$(service).csproj -c Release -o publish/$(service)
            displayName: 'dotnet publish $(service)'
          - task: Docker@2
            inputs:
              containerRegistry: $(serviceConnection)
              repository: $(repo)
              command: 'buildAndPush'
              Dockerfile: 'ops/docker/linux.Dockerfile'
              buildContext: '.'
              arguments: '--build-arg SERVICE=$(service)'
              tags: |
                $(Build.BuildNumber)
                latest

  - stage: Deploy
    dependsOn: Build
    jobs:
      - deployment: HelmRelease
        environment: prod
        pool:
          vmImage: 'ubuntu-latest'
        strategy:
          runOnce:
            deploy:
              steps:
                - task: HelmDeploy@0
                  displayName: 'helm upgrade payrec'
                  inputs:
                    connectionType: 'Azure Resource Manager'
                    azureSubscription: 'sc-payrec-prod-aks'
                    azureResourceGroup: 'rg-payrec-prod-uks'
                    kubernetesCluster: 'aks-payrec-prod-uks'
                    command: 'upgrade'
                    chartType: 'FilePath'
                    chartPath: 'charts/payrec'
                    releaseName: 'payrec'
                    namespace: 'payrec'
                    arguments: >
                      --install
                      --atomic
                      --timeout 20m
                      --set monolith.image.tag=$(Build.BuildNumber)
                      --set worker.image.tag=$(Build.BuildNumber)
                      --set apiV2.image.tag=$(Build.BuildNumber)
                      --set settlementsV2.image.tag=$(Build.BuildNumber)

The --timeout 20m on the Helm upgrade is calibrated for the Windows image pull. The monolith pod's first-time pull to a brand-new Windows node was the slowest single operation in the deploy, by an order of magnitude. The same release on the Linux pool finished in 90 seconds; the Windows half took, on a cold node, between four and eleven minutes. I'll come back to why and how we fixed it. Helm on AKS is the official walk-through if you have not set this part up before.

The Helm chart, in the bits that matter

The chart deploys five things: the monolith Deployment, the worker Deployment, a Service for the monolith, an Ingress for the whole product, and the new Linux service Deployments. The monolith and worker have explicit Windows node selectors and tolerations.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payrec-monolith
  namespace: payrec
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payrec-monolith
  template:
    metadata:
      labels:
        app: payrec-monolith
    spec:
      nodeSelector:
        kubernetes.io/os: windows
        workload: dotnet-framework
      tolerations:
        - key: workload
          operator: Equal
          value: dotnet-framework
          effect: NoSchedule
      containers:
        - name: web
          image: acrpayrecprod.azurecr.io/payrec/web:{{ .Values.monolith.image.tag }}
          ports:
            - containerPort: 80
          env:
            - name: PAYREC_SQL_CONNECTION
              valueFrom:
                secretKeyRef:
                  name: payrec-sql
                  key: connectionString
            - name: PAYREC_SERVICEBUS_CONNECTION
              valueFrom:
                secretKeyRef:
                  name: payrec-sb
                  key: connectionString
          volumeMounts:
            - name: reports
              mountPath: C:\reports
          readinessProbe:
            httpGet:
              path: /heartbeat
              port: 80
            initialDelaySeconds: 120
            periodSeconds: 15
          livenessProbe:
            httpGet:
              path: /heartbeat
              port: 80
            initialDelaySeconds: 180
            periodSeconds: 30
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
      volumes:
        - name: reports
          persistentVolumeClaim:
            claimName: payrec-reports
---
apiVersion: v1
kind: Service
metadata:
  name: payrec-monolith
  namespace: payrec
spec:
  selector:
    app: payrec-monolith
  ports:
    - port: 80
      targetPort: 80
  type: ClusterIP

A few details worth flagging. The initialDelaySeconds: 120 on the readiness probe was tuned to the application's actual cold start. The first hit against IIS after a fresh container start triggered the JIT compile of PaymentReconciliation.Engine.dll and took around 70 seconds on the chosen VM SKU. With a 60-second initial delay the probe failed on the first scrape, the pod went into Unhealthy, and Kubernetes started killing it before it ever served a request. 120 seconds gave it the headroom plus a margin.

The volumeMounts path is C:\reports, not /reports, because the container is Windows. This is the kind of thing that obviously should not surprise anyone, and which I still got wrong on the first commit because muscle memory typed the Linux path.

The state migration, two by two

There were three things that lived on the IIS server's local disk or in adjacent infra, and each one had its own migration story.

The 92,000 PDFs. These had been written by the application via System.IO.File.WriteAllBytes to D:\Reports\YYYY\MM\ on the IIS box. The application also read them back when the user clicked "download" in the web UI. Inside the container, the local D: drive was an ephemeral container layer that vanished on pod restart, which was obviously not acceptable.

We moved them to Azure Files, mounted via the CSI driver. The application code did not change; the mount path inside the container was C:\reports, and a one-line tweak in web.config pointed the ReportsPath setting to that path. The migration of the existing 92,000 files was done via a one-off azcopy sync from the SMB share to the Azure Files share, which took about six hours during which the application kept writing to both places via a feature flag in the code.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: payrec-reports
  namespace: payrec
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: azurefile-csi-premium
  resources:
    requests:
      storage: 500Gi

The MSMQ queues. This was the migration I was most worried about, and it earned the worry. MSMQ on Windows Server had a property the new code path did not natively have: strict FIFO ordering inside a queue, with transactional dequeue. The monolith and the worker both relied on this. A reconciliation message had a sequence number, and the worker assumed it would see them in order so that it could fail-fast on a missing sequence.

Azure Service Bus has FIFO via sessions, where messages carrying the same SessionId are delivered in order to a single consumer. The migration was to add a SessionId to every published message equal to the partition key the worker had previously assumed (the merchant account number), enable sessions on the destination queue, and rewrite the worker's dequeue loop to use MessageSessionReceiver rather than the original MessageQueue.Receive.

This was 380 lines of changed C# in the worker. The monolith's publisher was simpler, around 90 lines, because publishers do not need to know about session semantics, only that they set SessionId correctly. The trickiest part was the ordering-during-cutover problem: when we flipped the publisher from MSMQ to Service Bus, there were 4,200 messages in the MSMQ that the worker had not yet processed, and we had to drain those first before the new worker (reading from Service Bus) was allowed to start. We did this by running the old worker on the old IIS box, paused publishers, drained the MSMQ to zero, started the new worker, then resumed publishers against Service Bus. The maintenance window for the cutover absorbed this.

The SQL Server FCI. This stayed structurally the same, just moved. Azure SQL Managed Instance gave us a SQL Server-compatible target that supported the cross-database queries and SQL Server Agent jobs the monolith depended on. The migration was DMS-based, with a final cutover-night incremental sync. The connection string in the container's environment variable pointed to the MI's private endpoint inside the AKS VNet. No application code change beyond pointing at a new server name.

The cutover weekend, hour by hour

The migration was incremental for ten months and atomic for one weekend. By October, the new infrastructure had been running in parallel for six weeks, with synthetic traffic mirrored from production. The cutover itself was the moment we switched the public DNS record from the colo's IP to the AKS ingress IP.

The runbook for the weekend, slightly redacted:

Friday 18:00 .. Announce maintenance window. Customer comms went out three weeks prior.
Friday 19:00 .. Final delta sync of SQL via DMS. Stop publishers on the old IIS server.
Friday 19:30 .. Confirm MSMQ depth at zero on the old worker. Stop the old worker.
Friday 19:45 .. Flip DNS TTL down from 3600 to 60 (had been pre-staged for 48h).
Friday 20:00 .. Update DNS A record to point at AKS ingress IP.
Friday 20:05 .. Start synthetic transaction probes against the new endpoint.
Friday 20:30 .. First real customer transaction observed on the new stack.
Friday 21:00 .. Hold point. Three engineers awake, watching dashboards.
Saturday 08:00 .. Verify overnight settlements file regenerated correctly. Compare line-by-line against the parallel run from the old stack (which we had kept running).
Saturday 10:00 .. Continue monitoring. No rollback required.
Sunday 18:00 .. Decision point: keep going or roll back. We kept going.
Monday 09:00 .. Old IIS server kept warm but receiving zero traffic.
Tuesday 09:00 .. Decommission decision: park another 72h.
Friday (T+7) .. Old IIS box powered off.

The rollback plan was: flip DNS back to the colo IP, restart the old worker against the still-running MSMQ (whose journaling we had not turned off), and accept that any transactions that had gone through the new stack between cutover and rollback would need a manual reconciliation. We estimated 30 minutes to execute rollback if we triggered it, and we set a 24-hour bar for "things are going well enough not to roll back." We never hit it.

The single scariest moment came on Friday at 20:18. A Windows pod that the autoscaler had just spun up sat in ContainerCreating for eleven minutes. The events on the pod showed it was pulling the image, and the kubelet logs on the new node confirmed the same. The 6.2GB image, on a fresh Windows node that did not have the layers cached, took eleven minutes to pull and unpack. During those eleven minutes I was watching a graph of latency on the existing pods slowly climb because the load was now concentrated on the two existing pods rather than the three that were supposed to be running.

The fix, retrofitted the next week, was a Windows DaemonSet that pre-pulled the monolith image onto every Windows node as it joined the cluster. The pattern is well-known; the implementation is a DaemonSet with a single container running pwsh -c "Write-Host 'image cached'; Start-Sleep -Seconds infinite" against an image identical to the monolith image. When the node joins, the DaemonSet pulls the image, and from that point on the actual monolith pod's image pull is instant because the layers are local.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: payrec-monolith-prewarm
  namespace: payrec
spec:
  selector:
    matchLabels:
      app: payrec-monolith-prewarm
  template:
    metadata:
      labels:
        app: payrec-monolith-prewarm
    spec:
      nodeSelector:
        kubernetes.io/os: windows
        workload: dotnet-framework
      tolerations:
        - key: workload
          operator: Equal
          value: dotnet-framework
          effect: NoSchedule
      containers:
        - name: prewarm
          image: acrpayrecprod.azurecr.io/payrec/web:latest
          command: ["powershell.exe"]
          args: ["-Command", "Write-Host 'image cached'; while ($true) { Start-Sleep -Seconds 3600 }"]
          resources:
            requests:
              cpu: "10m"
              memory: "100Mi"
            limits:
              cpu: "50m"
              memory: "200Mi"

The DaemonSet keeps a container running on each Windows node solely so that the image layers stay resident in the local store and do not get garbage-collected. The CPU and memory footprint is negligible. New nodes joining the autoscaler still pay the eleven-minute pull cost once, but they pay it before any production pod is scheduled onto them, because the DaemonSet schedules earlier in the lifecycle. From the perspective of the application pods, scaling out is instant from there on.

Troubleshooting log, the entries worth sharing

failed to create containerd container: hcsshim::CreateComputeSystem ... The system cannot find the file specified was the error I got the first time I built a Windows image with the wrong base tag. The build pipeline produced an image based on ltsc2019 and tried to schedule it onto an ltsc2022 node. Windows containers require host-and-container OS version compatibility within a major LTSC version unless you opt into Hyper-V isolation. The fix is to match the base image tag to the node OS, which means rebuilding when the node pool's OS version changes. The Windows Server Core base image retirement schedule is worth keeping a calendar reminder on because the consequence of missing it is rebuild work, not a security incident.

Application 'DefaultAppPool' could not be started in AppFabric happened once, in dev, the first time I ran the monolith image after a web.config change that referenced a managed module the base image did not have installed. The default IIS app pool in the base image is configured with a specific set of installed modules, and adding one in web.config without also installing it in the Dockerfile via Install-WindowsFeature produces this exact error message inside the container's event log. The fix was a RUN Install-WindowsFeature Web-Net-Ext45 line in the Dockerfile, which I had thought I already had, but had removed during an earlier optimisation pass to shrink the image. The image grew by 8MB; the application started.

MountVolume.SetUp failed for volume "reports" : rpc error: code = Internal desc = volume mount failed: Mount: exit status 32 from the kubelet on a Windows node, with the kubelet log line azure-file mount failed: Mount: smb mount failed, was a CSI proxy mis-configuration. The enableCSIProxy: true flag on the cluster's windowsProfile was the cause. Once that was set, Azure Files mounts on Windows started working.

The transaction has been aborted on the worker's first run after the Service Bus migration was a sessions misconfiguration. The worker had been using MessageSessionReceiver correctly, but the receive call's sessionId parameter was being passed a null value because the message's SessionId had not been set by the publisher. Service Bus accepts the message without SessionId if the queue does not have sessions enabled, but if sessions are required, it rejects the publish with a different error and the receive never sees the message. The publisher fix was a one-line addition to the message-building code.

Forbidden: User \"system:serviceaccount:payrec:default\" cannot list resource \"secrets\" was a Helm chart upgrade where a new ServiceAccount and Role had not been applied to the cluster yet because the chart's templates/rbac.yaml had a templating typo. The pipeline's helm upgrade --atomic rolled the deploy back, which was the correct behaviour, and the error was visible in the Helm history. Atomic mode plus a sane --timeout saved us from a half-deployed state more than once.

Where the work ended up, eighteen months in

The monolith is still running. It is in AKS now, three pods on a Windows node pool, a fourth and fifth pod spun up by the autoscaler during the morning peak. The worker runs as a separate Deployment on the same pool, two replicas. Seven new .NET 8 services, each with a discrete bounded context (SettlementsV2, ApiV2, Notifications, Reporting, MerchantPortal, Compliance, IngestionV2), run on the Linux pool. The strangler-fig boundary at the ingress routes new traffic to the new services and falls through to the monolith for everything else. The proportion of traffic going through the new services is, by row count of monthly transactions, around 22%. By revenue, more like 41%, because the highest-value flows were the ones we extracted first.

Operational metrics, comparing the six months before cutover with the six months after: monthly hosting cost is down 38% (the colo lease is gone, Azure spend is up, the net is favourable). Deploy frequency is up from roughly one a week to three a day. Production incidents per month are down from an average of 4.2 to an average of 1.1, and the residual incidents are mostly in the new services rather than the monolith. The Windows pod pull-time problem is gone because of the pre-warm DaemonSet. The patching weekend is gone entirely; the Windows node pool gets node image upgrades monthly with rolling drain and the application sees no downtime.

The team's relationship to the monolith has changed in a way I did not predict at the start. When PaymentReconciliation.Web was hosted on a single server, every team member was scared of it, because a bad deploy meant the whole product went down for 90 seconds and might not come back. Now that the monolith is a pod, and there are three of them, and we can roll one at a time, the fear is gone. Developers have started fixing small things in the monolith that were never in scope for a rewrite: a 2017 bug in the audit log writer, a deprecated NuGet reference, a controller action that always threw NullReferenceException on Tuesdays. The monolith is improving, slowly, because it is no longer the load-bearing wall of the building.

The £2.4M rewrite the business said no to in January 2024 would have been finished sometime in 2026 if it had ever started, and we would have spent eighteen months not shipping features in the meantime. What we have now is a system where the monolith is still alive, still doing the same job, and the new services around it can be replaced one at a time without ever again being forced into the all-or-nothing rewrite decision. The colo is empty. The Windows scheduled task that ran Cleanup.ps1 at 02:00 every morning was, it turned out, deleting orphaned temp files that the application had stopped creating in 2020. We did not port it. Nothing has missed it. That kind of finding, the dead code that everyone assumed was load-bearing and turned out not to be, was a regular feature of the last 11 months. The migration was, more than anything else, a forced reread of a system that had not been read in five years.