Skip to content
OObaro.Olori
All articles
Azure DevOps

GitOps with Flux v2 on five AKS clusters: drift, sealed secrets, and a Sunday-night save

A tired engineer ran kubectl edit on prod-eu at 23:40 on a Sunday. Flux reverted it twelve minutes later, before anyone noticed. The whole five-cluster design that made that save boring, from Azure Pipelines bootstrap to shared sealed-secrets keys.

22 min read 312 viewsFluxAKSGitOpsAzure Pipelines

At 23:40 on a Sunday in October, a tired engineer on the on-call rotation ran kubectl edit deployment payments-api -n payments against the prod-eu cluster. He was chasing a memory-pressure alert that had paged him out of bed, and he bumped the container memory limit from 512Mi to 1Gi directly on the live object. He did not open a PR. He did not tell anyone. He went back to sleep at 00:05.

We found out at Monday's standup at 09:30. The engineer mentioned it almost as an aside, and the room went quiet for a second because the prod-eu deployment file in git still said 512Mi. Then the platform lead pulled up the Teams channel that the Flux notification controller posts into. There was a message at 23:52 on Sunday:

[prod-eu] Kustomization 'apps' applied resources:
  Deployment/payments/payments-api
reconciliation succeeded after 14s (drift detected)

Flux had reverted the manual edit twelve minutes after it was made. The pod was running on 512Mi again well before Monday morning. The memory alert had been a transient thing that resolved on its own; the manual edit had been unnecessary and would have caused a node-eviction cascade an hour later when the limit got hit on a different replica. Nobody had to know. Flux already knew.

That story is the reason this whole pattern is in production across our five AKS clusters. The boring part, the part that took eleven weeks to build, is what made the Sunday-night story boring. This is the design.

Why GitOps over imperative deploys from a pipeline

The pipeline we had before Flux was a perfectly reasonable Azure DevOps pipeline that ran kubectl apply -k ./clusters/prod-eu after each merge. It worked. The problem is that the pipeline was the only enforcer. If somebody bypassed it, like the engineer on a Sunday night did, the live cluster state diverged silently and stayed diverged until the next deploy stomped it back, which might be hours or days later.

Three properties of imperative pipeline deploys made the case for switching. First, drift was undetectable between runs. We had no programmatic answer to the question "is the live state of cluster prod-eu equal to what is in git on the main branch?" Second, the audit story was shaped wrong. The pipeline log told us "the pipeline applied these manifests at 14:02," but said nothing about whether the cluster still matched at 18:00. Third, five clusters meant five separate apply jobs to coordinate. When one of them failed halfway, we ended up with three clusters on version N and two on version N-1 until someone manually re-ran.

GitOps inverts the model. The cluster pulls from git on a schedule. The Git commit is the source of truth, full stop. If you want to change prod, you open a PR. If somebody runs kubectl edit directly, the next reconciliation interval reverts it and a notification fires. The pipeline's job shrinks to one-time bootstrap and ongoing validation. The clusters do the rest themselves.

There is a good overview of the model on the Flux docs that covers the source / kustomize / helm controller separation. The thing the docs do not say loudly enough is that the operational change is bigger than the architectural one. You stop pushing to clusters. You commit to git.

Picking Flux v2 over Argo CD

The team evaluated both. Argo CD has a better UI, an active project, and more name recognition. We went with Flux v2 for three reasons specific to our setup.

The notification controller in Flux is a first-class CR-driven thing. Our platform team had already standardised on Microsoft Teams for ops alerts, and the Flux Provider and Alert CRs let us declare the Teams webhook in git alongside everything else. We did not need to leave the GitOps repo to set up alerting. With Argo, the notifications subsystem exists, but the Teams piece would have meant a separate config file in the Argo namespace.

The Helm controller in Flux uses native Kubernetes CRs (HelmRelease) to manage Helm charts. We had a stack of around 30 charts in production (ingress, cert-manager, external-dns, the in-house apps), and modelling each one as a CR with a values block in git was a cleaner mental model than Argo's Application wrapper around Helm.

Footprint. Flux runs as four lightweight controllers, each in its own deployment, totalling about 150Mi of memory per cluster. Argo runs a server, a repo-server, a controller, and a Redis. The cost is not the resource consumption, it is the operational surface. Four pods to think about versus seven.

None of these are dealbreakers for Argo. If our team had already been running Argo elsewhere, we would have stayed on Argo. The decision was made on the margin.

Repo layout for five clusters

The repository structure took two iterations to get right. The first attempt had a single manifests/ directory and per-environment branches, which is a pattern you see in older Flux v1 docs. Branches as environments are a footgun: PR review against an environment branch hides the cross-environment drift, and merging dev into prod becomes a giant blob. We threw that out after week three.

The shape that worked, and that is still in production, is environment-as-directory on a single main branch:

flux-clusters/
├── clusters/
│   ├── prod-eu/
│   │   ├── flux-system/            # bootstrapped by `flux bootstrap`
│   │   │   ├── gotk-components.yaml
│   │   │   ├── gotk-sync.yaml
│   │   │   └── kustomization.yaml
│   │   ├── infrastructure.yaml     # Kustomization → ../../infrastructure
│   │   └── apps.yaml               # Kustomization → ../../apps/prod-eu
│   ├── prod-us/
│   ├── prod-ap/
│   ├── stage/
│   └── dev/
├── infrastructure/
│   ├── ingress-nginx/
│   │   ├── namespace.yaml
│   │   ├── release.yaml            # HelmRelease
│   │   └── kustomization.yaml
│   ├── cert-manager/
│   ├── external-dns/
│   └── sealed-secrets/
└── apps/
    ├── payments-api/
    │   ├── base/
    │   │   ├── deployment.yaml
    │   │   ├── service.yaml
    │   │   └── kustomization.yaml
    │   └── overlays/
    │       ├── prod-eu/
    │       │   ├── kustomization.yaml
    │       │   ├── replicas-patch.yaml
    │       │   └── ingress-patch.yaml
    │       ├── prod-us/
    │       ├── prod-ap/
    │       ├── stage/
    │       └── dev/
    └── catalog-api/

The reconciliation root for each cluster is clusters/<env>/. That directory contains exactly two Kustomization CRs at the cluster level: one that points at infrastructure/ (the shared platform bits, identical across environments) and one that points at an environment-specific overlay path under apps/. Everything else hangs off those two roots.

A cluster-level Kustomization CR for prod-eu looks like this:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 10m
  retryInterval: 2m
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./apps/payments-api/overlays/prod-eu
  prune: true
  wait: true
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: payments-api
      namespace: payments

prune: true is the bit that makes drift detection two-way. If a resource exists in the cluster that is not in git, it gets removed at the next reconciliation. wait: true blocks the reconciliation as healthy until the named health checks pass. interval: 10m is the floor on how often Flux pulls and reconciles; the actual practical reconciliation time when something changes in git is closer to 90 seconds because the source-controller polls on a tighter interval and the kustomize-controller wakes when the source updates.

The GitRepository CR that everything else references is small:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m
  ref:
    branch: main
  url: https://dev.azure.com/contoso/_git/flux-clusters
  secretRef:
    name: azure-devops-pat

The azure-devops-pat secret holds a deploy-scoped PAT with read-only access to that one repo. It is a managed secret that rotates every 90 days; the rotation itself is the only piece of imperative work in this whole system, and we are slowly migrating it to a workload identity pattern once Azure DevOps catches up with federated git auth for Flux.

The thing the Flux docs get right and that took me a while to internalise: interval on GitRepository is how often the source-controller polls git; interval on Kustomization is the floor on how often kustomize-controller reconciles to cluster even if nothing changed. They are different timers. We tuned GitRepository.interval down to 1m because git polling is cheap; we left Kustomization.interval at 10m because reconciling against the API server is more expensive in aggregate.

The Azure DevOps Pipeline

There are two pipelines, not one. The first is bootstrap-cluster.yml, a manually-triggered pipeline that runs flux bootstrap against a freshly-provisioned AKS cluster. We run it exactly once per cluster, when the cluster is first stood up. The second is validate-pr.yml, a PR validation pipeline that runs on every pull request into the flux-clusters repo and does kustomize build, flux diff, and a kubeconform schema check.

Here is the bootstrap pipeline. It depends on the Bicep that provisions the AKS cluster having completed first, and on a workload-identity-federated service connection to Azure (separate write-up).

trigger: none
pr: none

parameters:
  - name: clusterName
    type: string
  - name: resourceGroup
    type: string
    default: rg-aks-platform

pool:
  vmImage: ubuntu-latest

variables:
  serviceConnection: 'sc-aks-platform'
  fluxVersion: '2.3.0'

stages:
  - stage: Bootstrap
    displayName: 'Flux bootstrap on ${{ parameters.clusterName }}'
    jobs:
      - deployment: Bootstrap
        environment: aks-cluster-init
        strategy:
          runOnce:
            deploy:
              steps:
                - checkout: self

                - task: Bash@3
                  displayName: 'Install flux CLI'
                  inputs:
                    targetType: inline
                    script: |
                      curl -s https://fluxcd.io/install.sh | sudo bash
                      flux --version

                - task: AzureCLI@2
                  displayName: 'Get AKS credentials'
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      az aks get-credentials \
                        --resource-group ${{ parameters.resourceGroup }} \
                        --name ${{ parameters.clusterName }} \
                        --overwrite-existing

                - task: AzureCLI@2
                  displayName: 'flux bootstrap'
                  env:
                    AZURE_DEVOPS_EXT_PAT: $(adoPat)
                  inputs:
                    azureSubscription: $(serviceConnection)
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      flux bootstrap git \
                        --url=https://dev.azure.com/contoso/_git/flux-clusters \
                        --branch=main \
                        --password=$AZURE_DEVOPS_EXT_PAT \
                        --token-auth=true \
                        --path=clusters/${{ parameters.clusterName }} \
                        --components-extra=image-reflector-controller,image-automation-controller \
                        --version=v$(fluxVersion)

The bootstrap command does three things in one call: it installs the four Flux controllers in the flux-system namespace, it writes the flux-system/ directory contents into the git repo on the path we specified, and it creates the initial GitRepository + Kustomization CRs in the cluster pointing at that path. After it finishes, the cluster is self-driving. Every subsequent change to anything in clusters/<env>/, infrastructure/, or apps/ is propagated by the running controllers, not by the pipeline.

The PR validation pipeline is the one that runs on every pull request:

trigger: none
pr:
  branches:
    include: [main]
  paths:
    include:
      - clusters/**
      - infrastructure/**
      - apps/**

pool:
  vmImage: ubuntu-latest

variables:
  fluxVersion: '2.3.0'

jobs:
  - job: ValidateAll
    displayName: 'kustomize build + flux diff + kubeconform'
    strategy:
      matrix:
        prod-eu:
          clusterPath: clusters/prod-eu
        prod-us:
          clusterPath: clusters/prod-us
        prod-ap:
          clusterPath: clusters/prod-ap
        stage:
          clusterPath: clusters/stage
        dev:
          clusterPath: clusters/dev
    steps:
      - checkout: self
        fetchDepth: 0

      - task: Bash@3
        displayName: 'Install tooling'
        inputs:
          targetType: inline
          script: |
            curl -s https://fluxcd.io/install.sh | sudo bash
            curl -sLO https://github.com/yannh/kubeconform/releases/download/v0.6.6/kubeconform-linux-amd64.tar.gz
            tar -xzf kubeconform-linux-amd64.tar.gz
            sudo mv kubeconform /usr/local/bin/

      - task: Bash@3
        displayName: 'kustomize build $(clusterPath)'
        inputs:
          targetType: inline
          script: |
            set -euo pipefail
            flux build kustomization apps \
              --path=$(clusterPath) \
              --kustomization-file=$(clusterPath)/apps.yaml \
              > /tmp/built.yaml
            wc -l /tmp/built.yaml

      - task: Bash@3
        displayName: 'kubeconform schema check'
        inputs:
          targetType: inline
          script: |
            kubeconform \
              -strict \
              -ignore-missing-schemas \
              -schema-location default \
              -schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
              /tmp/built.yaml

The matrix runs the same job across all five clusters in parallel. A PR that breaks prod-eu but works on the others fails the matrix cell for prod-eu and is visible in the PR check status. The whole validation completes in about 3 minutes on a cold agent.

The pipeline never runs kubectl apply. It never holds cluster credentials beyond the bootstrap pipeline's one-time use. Pull requests are gated on the validation matrix passing and on a human reviewer approving. Once merged, the cluster pulls the change itself.

This separation is documented in the Azure Pipelines deployment jobs page under "approval gates," but the pattern that matters here is the absence of a deploy job. The pipeline is a validator. The cluster is the deployer.

Sealed Secrets, and why not SOPS

The team's first instinct was SOPS plus Azure Key Vault, because SOPS with a Key Vault key would have let us use workload identity to decrypt at the controller pod level. We tried it for a fortnight. It worked. We backed out because rotating the encryption key required re-encrypting every secret in the repo, which is 47 files at last count, and the operators wanted to be able to rotate quarterly without a multi-hour rebase exercise.

Sealed Secrets does the inverse trade-off. The controller in each cluster holds a private key. You encrypt a secret with the cluster's public key and commit the resulting SealedSecret to git. The controller decrypts at apply time and creates the live Secret resource. Rotating the key does not require re-encrypting old secrets, because the controller keeps old keys around and accepts ciphertext encrypted under any historical key.

The install lives in infrastructure/sealed-secrets/release.yaml as a HelmRelease:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: sealed-secrets
  namespace: sealed-secrets
spec:
  interval: 30m
  chart:
    spec:
      chart: sealed-secrets
      version: '2.15.x'
      sourceRef:
        kind: HelmRepository
        name: sealed-secrets
        namespace: flux-system
  values:
    fullnameOverride: sealed-secrets-controller
    image:
      repository: docker.io/bitnami/sealed-secrets-controller
    keyrenewperiod: 720h
    rbac:
      pspEnabled: false
    metrics:
      serviceMonitor:
        enable: true

A SealedSecret resource looks like a normal Kubernetes manifest with the ciphertext inline. To produce one, the developer pipes a plain Secret into kubeseal and commits the output:

kubectl create secret generic payments-db-password \
  --from-literal=password='hunter2-not-really' \
  --dry-run=client -o yaml \
  | kubeseal \
      --controller-namespace sealed-secrets \
      --controller-name sealed-secrets-controller \
      --format yaml \
      --cert public-keys/prod-eu.pem \
      > apps/payments-api/overlays/prod-eu/payments-db-password.sealed.yaml

The committed file looks like this, abbreviated:

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: payments-db-password
  namespace: payments
spec:
  encryptedData:
    password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
  template:
    metadata:
      name: payments-db-password
      namespace: payments
    type: Opaque

The ciphertext is safe to commit. Only the controller in the target cluster, holding the matching private key, can decrypt it.

The gotcha with five clusters and one SealedSecret

The first time we tried to add a SealedSecret that needed to exist across all five clusters, the platform team hit a wall. Each cluster's Sealed Secrets controller had generated its own private key at install. The ciphertext encrypted with prod-eu.pem could only be decrypted by the prod-eu controller. We had five different ciphertexts for what was conceptually one secret, and the controller pods kept restarting in two of the clusters because the operator had accidentally committed a SealedSecret encrypted with the wrong cluster's key, the controller failed to decrypt it, retried, hit the failure threshold, and crashlooped.

The fix had two parts. First, we generated a single RSA keypair offline, stored it in an Azure Key Vault secret called sealed-secrets-shared-key, and configured the Sealed Secrets controller in every cluster to use that key on boot via the Secrets Store CSI Driver. Now all five controllers use the same private key, which means a SealedSecret encrypted once works in all five clusters.

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: sealed-secrets-key
  namespace: sealed-secrets
spec:
  provider: azure
  parameters:
    usePodIdentity: 'false'
    useVMManagedIdentity: 'true'
    userAssignedIdentityID: <client-id-of-aks-kubelet-identity>
    keyvaultName: kv-flux-shared
    objects: |
      array:
        - |
          objectName: sealed-secrets-shared-tls-key
          objectType: secret
        - |
          objectName: sealed-secrets-shared-tls-crt
          objectType: secret
    tenantId: <tenant-id>
  secretObjects:
    - secretName: sealed-secrets-key
      type: kubernetes.io/tls
      data:
        - objectName: sealed-secrets-shared-tls-key
          key: tls.key
        - objectName: sealed-secrets-shared-tls-crt
          key: tls.crt

Second, the HelmRelease for the controller was updated to consume that secret as its key source, and a customLabels selector tells the controller to find it.

The trade-off is real and worth naming. A shared private key across five clusters means a compromise of any one cluster's pod-spec gives an attacker the ability to decrypt SealedSecrets that were intended for the other four. We accepted that risk because all five clusters are in the same security boundary already (same Entra tenant, same operations team, same threat model). If they were not, we would have run per-cluster keys and accepted the operational cost of five ciphertexts per secret.

The Bitnami Sealed Secrets docs cover the bring-your-own-key path. The piece they do not cover, that we figured out by reading the controller's source, is that key rotation under a shared key still works: the controller continues to accept ciphertext encrypted under previously-active keys, so a quarterly rotation in Key Vault does not break old SealedSecrets.

The Bicep that bootstraps each AKS cluster

The cluster itself is provisioned by Bicep before Flux ever touches it. The Bicep is identical across the five clusters; only the parameter file varies. The relevant pieces for GitOps are workload identity, the kubelet identity that pulls from Key Vault for Sealed Secrets, and the Secret Store CSI driver add-on.

@description('Cluster name, e.g., aks-prod-eu')
param clusterName string

@description('Region for the AKS control plane')
param location string = 'westeurope'

@description('Object ID of the platform engineers AAD group, gets cluster-admin')
param platformEngineersObjectId string

resource aks 'Microsoft.ContainerService/managedClusters@2024-05-01' = {
  name: clusterName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: '1.29.4'
    dnsPrefix: clusterName
    agentPoolProfiles: [
      {
        name: 'system'
        count: 3
        vmSize: 'Standard_D4ds_v5'
        mode: 'System'
        osDiskType: 'Ephemeral'
        availabilityZones: ['1', '2', '3']
      }
      {
        name: 'apps'
        count: 4
        vmSize: 'Standard_D8ds_v5'
        mode: 'User'
        osDiskType: 'Ephemeral'
        availabilityZones: ['1', '2', '3']
        enableAutoScaling: true
        minCount: 4
        maxCount: 20
      }
    ]
    addonProfiles: {
      azureKeyvaultSecretsProvider: {
        enabled: true
        config: {
          enableSecretRotation: 'true'
          rotationPollInterval: '2m'
        }
      }
    }
    aadProfile: {
      managed: true
      enableAzureRBAC: true
      adminGroupObjectIDs: [platformEngineersObjectId]
    }
    oidcIssuerProfile: {
      enabled: true
    }
    securityProfile: {
      workloadIdentity: {
        enabled: true
      }
    }
    networkProfile: {
      networkPlugin: 'azure'
      networkPolicy: 'cilium'
      networkDataplane: 'cilium'
    }
  }
}

output kubeletIdentityClientId string = aks.properties.identityProfile.kubeletidentity.clientId

The azureKeyvaultSecretsProvider addon and oidcIssuerProfile are the two AKS knobs that, together with the Sealed Secrets controller config above, make the shared-key pattern work. The AKS workload identity docs on Microsoft Learn cover the OIDC issuer side; the Key Vault CSI driver page covers the secret-mount side.

Drift detection via the Notification Controller

The notification controller posts to Teams on every reconciliation. The Provider and Alert CRs live in infrastructure/flux-notifications/:

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: teams-platform
  namespace: flux-system
spec:
  type: msteams
  channel: aks-platform-events
  secretRef:
    name: teams-webhook
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: all-cluster-events
  namespace: flux-system
spec:
  providerRef:
    name: teams-platform
  eventSeverity: info
  eventSources:
    - kind: Kustomization
      name: '*'
    - kind: HelmRelease
      name: '*'
    - kind: GitRepository
      name: '*'
  exclusionList:
    - 'reconciliation in progress'

eventSeverity: info is the noisy setting, on purpose. Every successful reconciliation, every drift correction, every HelmRelease upgrade fires a message. On a quiet day, prod-eu produces around 47 events per hour, most of them no-op confirmations. The cluster prod-us, which has a much larger app footprint, sits closer to 90 events per hour.

The signal-to-noise tradeoff was discussed at length. Quieter alerts (warning + error only) would have been less spammy, but the team wanted the positive heartbeat: "if I do not see prod-ap posting in the last 30 minutes, something is wrong." The drift events are the ones operators actually act on. They are easy to spot in the feed because they include applied resources: listing what got changed. A normal reconciliation says reconciliation succeeded after 1.2s. A drift correction says reconciliation succeeded after 14s applied resources: Deployment/payments/payments-api. The applied resources line is the tell.

The Sunday-night save that opened this article was visible in that channel within 13 minutes of the kubectl edit. Nobody was looking at 23:52 on a Sunday, but the message was there waiting on Monday morning, and the chain of evidence was complete: an edit happened, a reconciliation reverted it, the cluster came back to a known-good state from git.

Multi-cluster fan-out with shared infrastructure

The whole point of five clusters sharing the infrastructure/ directory is that platform changes propagate uniformly. Bumping the ingress-nginx chart from 4.10.0 to 4.10.1 is one line in infrastructure/ingress-nginx/release.yaml:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  interval: 15m
  chart:
    spec:
      chart: ingress-nginx
      version: '4.10.1'    # was 4.10.0
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
        namespace: flux-system
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    controller:
      replicaCount: 3
      service:
        annotations:
          service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
      resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          memory: 512Mi

That single edit, merged to main, propagates to all five clusters within the source-controller's 60-second poll interval. The kustomize-controller in each cluster wakes up, sees the source has updated, runs through its reconciliation, and the helm-controller picks up the HelmRelease change and runs the chart upgrade. Within roughly 4 minutes, all five clusters are on 4.10.1. We have done this kind of fan-out 23 times in the last quarter and have not once needed to log in to a cluster manually.

For app-level differences, the overlay pattern carries the divergence. apps/payments-api/base/ has the deployment with a placeholder replica count. The overlays/prod-eu/ directory carries a patch that bumps replicas to 6 for the EU region's higher traffic. The overlays/dev/ directory carries a patch that drops it to 1 and replaces the image with the latest dev tag. The base file is shared. The patches are tiny:

# apps/payments-api/overlays/prod-eu/replicas-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 6
# apps/payments-api/overlays/prod-eu/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
  - payments-db-password.sealed.yaml
patches:
  - path: replicas-patch.yaml
  - path: ingress-patch.yaml
images:
  - name: payments-api
    newTag: '2.14.7'

The image tag pin is the version gate. Promoting a build from stage to prod-eu is a PR that changes newTag in the prod-eu overlay. The same build, the same image SHA, just moved up the environment ladder. Five clusters, five different image tags at any moment, all driven by the overlays.

Troubleshooting

Kustomization 'prod-eu/apps' failed: build failed: accumulating resources: invalid Kustomization: trouble building patches: failed to load patch file: open replicas-patch.yaml: no such file or directory means the overlay's kustomization.yaml references a patch file that does not exist at the path it expects. The kustomize-controller runs the build in a temp directory, so relative paths are resolved relative to the kustomization file, not the repo root. Fix: confirm the file exists at the path written in the kustomization, including capitalisation, which on case-sensitive Linux but case-insensitive macOS will bite you if you renamed on a Mac.

helm-controller: chart pull error: 403 Forbidden from a HelmRelease against a private Helm repo means the HelmRepository resource is missing its secretRef, or the secret it references has stale credentials. In our setup, this came up once when we rotated the Azure Container Registry pull token and forgot to update the secret in flux-system. The fix is to update the secret and either wait for the next reconciliation interval or run flux reconcile source helm <name> manually.

kustomize build failed: trouble configuring builtin PatchTransformer with config: failed to find unique target for patch is a kustomize error that fires when a strategic merge patch matches multiple resources because of a missing target selector. We saw this when an overlay's patch targeted kind: Deployment without naming which deployment, and the base had two deployments. Fix: add metadata.name to the patch's target.

Kustomization stalled: dependency 'flux-system/infrastructure' is not ready is what you see when a Kustomization declares dependsOn and the dependency has not finished its first reconciliation. The fix is usually to wait; the controller retries automatically. If it persists, run flux get kustomizations -A and look for the one that is False, then flux logs --kind=Kustomization --name=<the-stuck-one> for the actual error.

unable to verify signed metadata: signature verification failed from the source-controller on a GitRepository means somebody force-pushed over a tag we had previously signed and committed. We do not sign commits in this repo, so this only fires when a feature branch gets force-rebased after we have already pulled it for testing. The fix is to delete the cached git ref on the source-controller pod and let it re-fetch.

HelmRelease ingress-nginx fails post-upgrade hook: timed out waiting for the condition is the helm-controller equivalent of a Helm install hanging. It almost always means a pod the chart is waiting for cannot become ready. kubectl describe pod on the new ingress controller pod usually tells you what is wrong, which in our experience has been "image pull failed because the new tag does not exist," "PodDisruptionBudget too tight for the rolling update to make progress," or "a webhook from cert-manager is blocking the new deployment because it cannot reach the API server."

What the steady state looks like

Eight months in, the operational shape of the five-cluster GitOps setup is genuinely calm. Pull requests into the flux-clusters repo are the only path by which prod state changes. The PR validation pipeline catches schema errors and broken kustomize builds before the merge. The cluster pulls the change within 60 to 90 seconds of merge. Reconciliation completes within another minute or two. The notification controller posts the result to Teams. The audit log is the git commit log, and the live state matches git within 10 minutes at any moment, which is the floor reconciliation interval.

Drift is detected and corrected automatically. The Sunday-night kubectl edit story has happened twice since the original. Both times Flux reverted the change before the next morning. Both times the engineer who made the edit found out about it from the Teams channel, not from an incident. One of them apologised; the other did not realise the system would catch it. Both are fine outcomes.

Cluster bootstrap, the part that used to be a half-day pairing exercise, is now the bootstrap pipeline plus roughly four minutes of flux reconcile source git flux-system follow-up. We have stood up two new dev-tier clusters in the last quarter using this path, and the team did them solo without paging anyone. The vending is solid enough that we stopped tracking the per-cluster bootstrap time in our internal metrics.

The thing I keep coming back to is the inversion of where work happens. Before Flux, the pipeline was the actor and the cluster was the passive thing being changed. The audit story was "who triggered this pipeline run, when, and what did it apply." After Flux, the cluster is the actor and the pipeline is a validator. The audit story is "what is in git, and what reconciliation events happened in the last 10 minutes." That is a structurally smaller surface to reason about, and it is the reason the platform team's on-call load on the AKS clusters specifically has dropped from roughly four pages a month to one. Most weeks, nothing pages. The Sunday-night save would have been a Monday-morning incident in the old world. In the new one, it was a footnote, and that is the whole point.