Skip to content
OObaro.Olori
All articles
Azure DevOps

OPA Gatekeeper on AKS: 14 constraints, an 11:47 deny, and what code review had been missing

An 11:47 deny on a Tuesday blocked a hostPath docker socket the human reviewers had missed. The catalogue of 14 constraints, the Rego under the hood, and the four-week rollout from dryrun to deny across five clusters.

18 min read 312 viewsOPA GatekeeperAKSPolicy as codeRego

At 11:47 on a Tuesday in February, a deploy got blocked by a policy nobody on the dev team had seen before, which is exactly the result we had been working toward for the previous four months. The PR was a routine debug change. An engineer was chasing a flaky container start in a staging namespace and, while doing it, had added a small pod spec to the manifest folder with hostPath: /var/run/docker.sock mounted into the container so the pod could shell into the Docker daemon on the node. He needed it for one debugging session. He intended to revert it before merge. He forgot, and the PR was approved by two reviewers anyway, because what code review had been silently missing for two years on this repo was: nobody on review was reading the volumes section closely enough to spot a docker socket mount.

OPA Gatekeeper, running inside our PR validation pipeline as gator test against an in-repo policy library, caught it. The PR build failed with [K8sDisallowHostPath] container 'debug' mounts hostPath '/var/run/docker.sock' which is not in the allowed list. The engineer pinged me confused, then read the failure, then said "oh god, yeah, I should not have done that." The PR was reverted on the volumes section and re-pushed. Total elapsed time from "PR opened" to "policy violation surfaced": eleven seconds. Total elapsed time without the policy: probably the lifetime of that hostPath mount inside the cluster, because the existing review process had genuinely never caught one. We checked. Going back through two years of merged PRs, three other hostPath mounts had slipped in through routine review. None of them had gone to prod, all of them had landed in staging and been removed weeks later by accident, but the review surface was clearly leaky.

That was the moment the policy library stopped being a side project and started being the way we work. This is the catalogue: 14 constraints, what each one catches, the Rego under the hood for the four heaviest, the Azure DevOps pipeline that runs them on every PR, and the rollout story that took us from dryrun to deny across five clusters over four weeks.

Why vanilla OPA Gatekeeper and not Azure Policy for Kubernetes

The first call we had to make was which Gatekeeper to use. There is the upstream open-policy-agent/gatekeeper project, and there is Azure Policy for Kubernetes, which wraps Gatekeeper and adds a layer that surfaces compliance state into the Azure Policy compliance dashboard. We tried Azure Policy first. It works, the dashboard is genuinely nice, and if you only need the built-in policy definitions (no privileged containers, no host network, allowed registries, the usual suspects) it is the right answer.

We did not pick it. The reason was iteration speed on custom Rego. The Azure Policy for Kubernetes flow for a custom policy is: write the ConstraintTemplate, upload it as a custom Azure Policy definition through the Azure portal or CLI, assign it to the cluster, wait for the agent to sync, then test. That round trip is roughly ten minutes on a good day. With upstream Gatekeeper installed via Helm directly into the cluster and a gator test runner on every PR, the loop is roughly fifteen seconds: edit the template file in the repo, run gator test locally, see the result.

What we gave up: the Azure Policy compliance dashboard view across our subscription. We accept that. Our compliance state lives in the audit-mode output of Gatekeeper itself, queried with kubectl get constraints and surfaced into a small Grafana panel we built. It is less polished than Azure's view. It is also faster to inspect because it lives in the same place as the rest of our cluster telemetry.

The other thing we considered, briefly, was Kyverno. It has a friendlier policy language (YAML, not Rego) and a richer mutation story. We were already four months into the Rego learning curve when we hit the comparison, so the switching cost would have erased the language advantage. If we were starting today, the call would be closer.

Installing Gatekeeper, and the namespace shape

Gatekeeper itself is a Helm install. Nothing exotic.

helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update

helm install gatekeeper gatekeeper/gatekeeper \
  --version 3.16.0 \
  --namespace gatekeeper-system \
  --create-namespace \
  --set replicas=3 \
  --set audit.replicas=1 \
  --set auditInterval=300 \
  --set constraintViolationsLimit=100 \
  --set logLevel=INFO \
  --set image.crdRepository=openpolicyagent/gatekeeper-crds \
  --set validatingWebhookTimeoutSeconds=5 \
  --set mutatingWebhookTimeoutSeconds=3

Three settings matter for production. validatingWebhookTimeoutSeconds=5 because the default of 3 seconds occasionally bit us during cold starts on the audit pod; bumping to 5 stopped the intermittent webhook timeout API errors during rolling cluster upgrades. auditInterval=300 runs the audit loop every five minutes, which is frequent enough that violations introduced by something other than kubectl apply (a templated Argo CD sync, for example) show up before someone notices. constraintViolationsLimit=100 caps how many violations are recorded per constraint in the audit, which matters because one misbehaving Helm chart can flood that field on day one.

We exclude kube-system, gatekeeper-system, and the AKS-managed kube-public from all constraints via a single Config resource:

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  match:
    - excludedNamespaces:
        - kube-system
        - kube-public
        - kube-node-lease
        - gatekeeper-system
        - calico-system
        - tigera-operator
      processes:
        - "*"
  sync:
    syncOnly:
      - group: ""
        version: "v1"
        kind: "Namespace"
      - group: ""
        version: "v1"
        kind: "Pod"
      - group: "networking.k8s.io"
        version: "v1"
        kind: "NetworkPolicy"
      - group: "policy"
        version: "v1"
        kind: "PodDisruptionBudget"

The syncOnly block is what lets policies reference resources they did not directly receive (referential constraints, used below for the NetworkPolicy and PDB requirements). Without it, the audit log fills with data.inventory.cluster.networking.k8s.io.v1.NetworkPolicy is undefined.

The 14 constraint families

The list, organised by what they protect.

Security (four). K8sDisallowPrivileged blocks any container with securityContext.privileged: true. K8sDisallowHostPath blocks hostPath volumes except for a tightly bounded allow list (/var/log for the logging daemonset, nothing else). K8sDisallowHostNetwork blocks pods with hostNetwork: true outside the ingress namespace. K8sRestrictCapabilities blocks adding any capability not on a small allowlist (we permit NET_BIND_SERVICE for one specific deployment that needs port 80 as non-root).

Reliability (three). K8sRequireResourceLimits requires every container to declare CPU and memory requests and limits, with min and max bands so somebody asking for 64Gi memory on a 16Gi node gets caught at PR time. K8sRequireReadinessProbe requires a readinessProbe on every container in a Deployment, StatefulSet, or DaemonSet (jobs and init containers are exempt). K8sRequireImagePullPolicy requires imagePullPolicy: Always or IfNotPresent, never the default Kubernetes inference, because the default depends on tag shape and silently flips between the two.

Hygiene (three). K8sRequiredLabels requires the three Kubernetes recommended labels on every workload: app.kubernetes.io/name, app.kubernetes.io/version, app.kubernetes.io/part-of. K8sAllowedRegistries constrains container images to two registries: myacr.azurecr.io (our Azure Container Registry) and mcr.microsoft.com. K8sDisallowLatestTag blocks any image reference that ends in :latest or has no tag at all.

Operational (four). K8sRequireNetworkPolicy is a referential constraint that fails a namespace if no NetworkPolicy exists in it. K8sRequirePDB requires a PodDisruptionBudget for any Deployment or StatefulSet with replicas > 1. K8sRequireServiceAccount blocks pods that use the default service account. K8sBlockAutomountServiceAccountToken requires automountServiceAccountToken: false unless an annotation explicitly opts in.

Fourteen. The categories overlap slightly (a privileged container without resource limits trips two constraints, not one), and we are fine with that. Multiple failures on one PR is information, not noise.

K8sRequiredLabels: the simplest, the most boring, the most useful

The first constraint we wrote, and still the one that catches the most violations.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: object
                properties:
                  key:
                    type: string
                  allowedRegex:
                    type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_].key}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("workload %v/%v is missing required labels: %v",
            [input.review.object.kind, input.review.object.metadata.name, missing])
        }

        violation[{"msg": msg}] {
          value := input.review.object.metadata.labels[key]
          expected := input.parameters.labels[_]
          expected.key == key
          expected.allowedRegex != ""
          not regex.match(expected.allowedRegex, value)
          msg := sprintf("label %v has value %v which does not match regex %v",
            [key, value, expected.allowedRegex])
        }

The two violation blocks are independent. The first catches missing labels. The second catches labels present but malformed (a version of dev, for example, fails the semver regex). Gatekeeper combines them, so a single PR can fail on both.

The constraint that consumes it:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: workloads-must-have-recommended-labels
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet", "DaemonSet"]
      - apiGroups: ["batch"]
        kinds: ["CronJob", "Job"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
  parameters:
    labels:
      - key: "app.kubernetes.io/name"
      - key: "app.kubernetes.io/version"
        allowedRegex: "^\\d+\\.\\d+\\.\\d+(-[a-z0-9.-]+)?$"
      - key: "app.kubernetes.io/part-of"

The audit, three weeks after we shipped this constraint in deny mode, caught two surprises. First, a Helm chart we did not own (a community-maintained Prometheus exporter) used app instead of app.kubernetes.io/name. We patched it with a one-line postRender. Second, a CronJob template in a long-forgotten repo had no labels at all. The CronJob had not been edited in two years and was still running nightly. The label requirement exposed it; we used the exposure as a prompt to audit the rest of the CronJob inventory and retired three jobs that nobody could explain.

K8sAllowedRegistries: the chart with the templated image

This is the one with the gotcha I want to spend real time on.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedregistries
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRegistries
      validation:
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedregistries

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          image := container.image
          not registry_allowed(image)
          msg := sprintf("container %v uses image %v which is not from an allowed registry; allowed: %v",
            [container.name, image, input.parameters.registries])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          image := container.image
          not registry_allowed(image)
          msg := sprintf("initContainer %v uses image %v which is not from an allowed registry",
            [container.name, image])
        }

        registry_allowed(image) {
          registry := input.parameters.registries[_]
          startswith(image, registry)
        }

And the constraint:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
  name: only-acr-and-mcr
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    registries:
      - "myacr.azurecr.io/"
      - "mcr.microsoft.com/"

The gotcha. Our flagship app is deployed via a HelmRelease, where the image is set with image: {{ .Values.image.repository }}:{{ .Values.image.tag }}. When we ran gator test against the raw chart manifest (without rendering), the rendered image string in the test fixture was literally {{ .Values.image.repository }}:{{ .Values.image.tag }}, which obviously does not start with myacr.azurecr.io/. The test failed. The chart was fine. The test was wrong.

The fix took two attempts. The first was to maintain a separate set of pre-rendered manifests for testing, which we abandoned within a week because nobody remembered to keep them in sync. The second, which we still use, is to render the chart in the PR pipeline before testing:

# In the PR pipeline, before gator test
helm template ./chart \
  --values ./chart/values.yaml \
  --values ./chart/values.test.yaml \
  --release-name app \
  --namespace app \
  > /tmp/rendered.yaml

gator test --filename /tmp/rendered.yaml --filename ./policies/

The render produces the same YAML the cluster will receive at install time. The image string in /tmp/rendered.yaml is the resolved myacr.azurecr.io/app:1.4.7, which passes. The policy is unchanged, the test fixture is the real manifest, and the round trip is honest.

We codified this pattern across all charts in the org with a script that walks every chart directory in the manifests repo, renders with the chart's own test values, and feeds the aggregate output to gator. Total render-plus-test time across roughly 40 charts: about 90 seconds.

K8sDisallowHostPath: the 11:47 deny, plus the one exception

The Rego is short.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sdisallowhostpath
spec:
  crd:
    spec:
      names:
        kind: K8sDisallowHostPath
      validation:
        openAPIV3Schema:
          type: object
          properties:
            allowedHostPaths:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sdisallowhostpath

        violation[{"msg": msg}] {
          volume := input.review.object.spec.volumes[_]
          volume.hostPath
          path := volume.hostPath.path
          not allowed(path)
          not exception_granted
          msg := sprintf("pod %v uses hostPath '%v' which is not in the allow list",
            [input.review.object.metadata.name, path])
        }

        allowed(path) {
          allowed_path := input.parameters.allowedHostPaths[_]
          startswith(path, allowed_path)
        }

        exception_granted {
          input.review.object.metadata.annotations["policy.platform.io/allow-hostpath"] == "true"
        }

The exception path is the interesting bit. We do not believe a policy library should be inflexible enough that the platform team becomes a bottleneck. We do believe an exception should be loud. The constraint allows an annotation, policy.platform.io/allow-hostpath: "true", that turns off the check for that single pod. Adding the annotation requires a PR. PRs touching annotations on workloads trigger a CODEOWNERS rule that requires platform team review. Net effect: an exception is granted in code, reviewed by a human, and visible forever in the manifest.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDisallowHostPath
metadata:
  name: no-hostpath-except-logging
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    allowedHostPaths:
      - "/var/log/"

We have one active exception across all clusters. It is on the fluent-bit daemonset, which needs /var/lib/docker/containers to scrape logs. The annotation is there, the PR that added it has two platform-team approvals, and the audit dashboard shows zero hostPath violations because the legitimate one is annotated.

The 11:47 deny on the debug pod, of course, had no annotation. The engineer was not trying to dodge policy; he had not heard of the annotation pattern. The conversation we had after the deny was about the annotation, not the deny. He used it the next day for a different pod where the hostPath was actually justified.

K8sRequireResourceLimits: min/max bands

The simplest version of this constraint checks that limits exist. Our version goes further: it also fails if the request or limit is outside a sensible band, because we had been bitten by a deploy that requested 200 cores by accident.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequireresourcelimits
spec:
  crd:
    spec:
      names:
        kind: K8sRequireResourceLimits
      validation:
        openAPIV3Schema:
          type: object
          properties:
            minCPU:
              type: string
            maxCPU:
              type: string
            minMemory:
              type: string
            maxMemory:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequireresourcelimits

        import future.keywords.in

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.resources.limits.cpu
          msg := sprintf("container %v does not specify cpu limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.resources.limits.memory
          msg := sprintf("container %v does not specify memory limit", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          cpu_limit_millicores(container.resources.limits.cpu) > parse_cpu(input.parameters.maxCPU)
          msg := sprintf("container %v cpu limit %v exceeds max %v",
            [container.name, container.resources.limits.cpu, input.parameters.maxCPU])
        }

        cpu_limit_millicores(value) = result {
          endswith(value, "m")
          result := to_number(trim_suffix(value, "m"))
        }

        cpu_limit_millicores(value) = result {
          not endswith(value, "m")
          result := to_number(value) * 1000
        }

        parse_cpu(value) = result {
          endswith(value, "m")
          result := to_number(trim_suffix(value, "m"))
        }

        parse_cpu(value) = result {
          not endswith(value, "m")
          result := to_number(value) * 1000
        }

The CPU parsing is the bit that took longest to get right. Kubernetes accepts CPU in two forms (500m for 500 millicores, 2 for 2 cores), and the Rego has to normalise both before comparing. The function pattern above, two rules with the same name and different bodies, is how Rego does dispatch by predicate. The first body fires when the value ends in m, the second fires otherwise.

The constraint, with bands tuned to our actual node sizes:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequireResourceLimits
metadata:
  name: workload-resource-bands
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    minCPU: "10m"
    maxCPU: "8000m"
    minMemory: "16Mi"
    maxMemory: "32Gi"

maxCPU: 8000m is 8 cores. Our Standard_D8ds_v5 worker nodes have 8 cores; nothing should be requesting more than a node's worth of CPU on a single container, so anything that does is either a typo or a sign that whoever wrote the manifest is thinking about it wrong. maxMemory: 32Gi is the memory of a Standard_E16ds_v5, our largest worker. We do not want a single pod to be schedulable only on the largest node type; that introduces a fragility we have learned to detect at PR time.

The min values catch the opposite problem: a container that asks for 1m CPU and 1Mi memory because the developer never set the values and copy-pasted a placeholder. The audit flagged six of those in the first week.

A small ConfigMap that the policies share

Several constraints reference the same constants (the allowed registries, the cluster name, the environment). Hard-coding those across 14 constraint files is the path to a stale value somewhere. We use a ConfigMap that Gatekeeper syncs into the policy evaluation context:

apiVersion: v1
kind: ConfigMap
metadata:
  name: platform-policy-constants
  namespace: gatekeeper-system
  labels:
    gatekeeper.sh/sync: "true"
data:
  allowed-registries: |
    myacr.azurecr.io/
    mcr.microsoft.com/
  cluster-environment: "prod"
  cluster-region: "westeurope"
  exception-annotation-prefix: "policy.platform.io/"

The gatekeeper.sh/sync label is the bit that brings this ConfigMap into the policy's data.inventory namespace. A constraint can then reference data.inventory.namespace["gatekeeper-system"]["v1"]["ConfigMap"]["platform-policy-constants"].data["allowed-registries"]. We use this for the registry policy specifically; it lets us change the registry list in one place across all clusters by editing the ConfigMap, rather than editing every cluster's constraint.

The PR validation pipeline

The pipeline runs on every PR that touches a *.yaml, *.yml, or Chart.yaml file. The runner is gator test, the test CLI shipped with Gatekeeper.

trigger: none

pr:
  branches:
    include: [main]
  paths:
    include:
      - 'manifests/**'
      - 'charts/**'
      - 'policies/**'

pool:
  vmImage: ubuntu-latest

variables:
  gatorVersion: '3.16.0'
  helmVersion: '3.14.0'

stages:
  - stage: PolicyValidate
    displayName: 'Validate manifests against policy library'
    jobs:
      - job: GatorTest
        steps:
          - checkout: self
            fetchDepth: 1

          - task: Bash@3
            displayName: 'Install gator and helm'
            inputs:
              targetType: inline
              script: |
                set -euo pipefail
                curl -sL "https://github.com/open-policy-agent/gatekeeper/releases/download/v$(gatorVersion)/gator-v$(gatorVersion)-linux-amd64.tar.gz" \
                  | tar -xz -C /tmp
                sudo mv /tmp/gator /usr/local/bin/gator
                curl -sL "https://get.helm.sh/helm-v$(helmVersion)-linux-amd64.tar.gz" | tar -xz -C /tmp
                sudo mv /tmp/linux-amd64/helm /usr/local/bin/helm
                gator version
                helm version --short

          - task: Bash@3
            displayName: 'Render charts and run gator test'
            inputs:
              targetType: inline
              script: |
                set -euo pipefail
                RENDERED=/tmp/rendered.yaml
                : > $RENDERED

                for chart in charts/*/; do
                  if [ -f "$chart/Chart.yaml" ]; then
                    echo "rendering $chart"
                    helm template "$chart" \
                      --release-name "$(basename "$chart")" \
                      --namespace "$(basename "$chart")" \
                      --values "$chart/values.yaml" \
                      --values "$chart/values.test.yaml" \
                      >> $RENDERED
                    echo "---" >> $RENDERED
                  fi
                done

                # Also include raw manifests under manifests/
                find manifests -name '*.yaml' -o -name '*.yml' \
                  | xargs -I{} sh -c 'cat {}; echo "---"' \
                  >> $RENDERED

                gator test \
                  --filename $RENDERED \
                  --filename policies/ \
                  --output json \
                  | tee gator-results.json

                # Fail the build if any deny-action violations
                VIOLATIONS=$(jq '[.[] | select(.constraint.spec.enforcementAction != "dryrun")] | length' gator-results.json)
                echo "deny-action violations: $VIOLATIONS"
                if [ "$VIOLATIONS" -gt 0 ]; then
                  echo "##vso[task.logissue type=error]Policy violations found"
                  exit 1
                fi

          - task: PublishBuildArtifacts@1
            condition: always()
            inputs:
              pathToPublish: gator-results.json
              artifactName: 'policy-results'

The jq filter at the end is what makes the audit-mode rollout possible. Constraints with enforcementAction: dryrun produce violations in the output but do not fail the build; constraints with deny do. This lets a new constraint live in the policy library, run in CI, and surface violations as informational output for a few weeks before flipping the action.

The rollout: four weeks of dryrun, then deny

The single decision that made this rollout work was: every constraint shipped first as dryrun for four weeks before flipping to deny. Audit mode in Gatekeeper means the constraint runs against every admission and every audit sweep, records the violations, and lets the request through. The violations show up in kubectl get constraints and in the audit pod's logs.

kubectl get constraints -o wide

The output, three days after we shipped the resource-limits constraint in dryrun:

NAME                              ENFORCEMENT-ACTION   TOTAL-VIOLATIONS
no-hostpath-except-logging        deny                 0
only-acr-and-mcr                  deny                 0
workloads-must-have-recommended-  deny                 2
workload-resource-bands           dryrun               47
no-privileged-containers          deny                 0
no-host-network                   deny                 1
no-latest-tag                     deny                 3
require-readiness-probe           dryrun               18
require-image-pull-policy         dryrun               6
require-network-policy            dryrun               4
require-pdb-for-replicas-gt-1     dryrun               9
no-default-serviceaccount         dryrun               14
restrict-capabilities             deny                 0
no-automount-sa-token             dryrun               31

47 existing pods were below or above the resource bands. 31 were automounting service-account tokens unnecessarily. 14 were running as the default service account. We went through each list, in each namespace, and either fixed the manifest or granted an exception via annotation. The fixes were small; the negotiation about the exceptions was the work.

Across five clusters, the audit flagged 47 + 18 + 6 + 4 + 9 + 14 + 31 = 129 violations on the constraints in dryrun mode. After four weeks of fixes, the dryrun violation counts were zero for ten of those constraints. The remaining four had a small residual of legitimate exceptions, which we annotated and then flipped to deny. The cluster was clean when we flipped the switch, so the flip itself was non-disruptive.

The flip is a one-line edit per constraint:

spec:
  enforcementAction: deny  # was: dryrun

A kubectl apply of the updated constraint, no Gatekeeper restart, takes effect on the next admission request. We staggered the flips one constraint per day to keep the surface area of any unexpected denial small.

Troubleshooting

admission webhook "validation.gatekeeper.sh" denied the request: container 'app' does not specify resource limits is the canonical deny message. It is verbose enough that the developer almost always self-services. The pattern is: the policy name in brackets at the start, then the human-readable reason. We tuned every constraint's msg to include the offending container or pod name, because the original messages were technically correct but useless ("workload is missing required labels").

failed to compile rego: rego_parse_error from gator test means the Rego in a ConstraintTemplate is malformed. The error includes a line number against the embedded Rego, which is one-indexed from the start of the rego: block in the YAML. We learned to keep the Rego short per template; a 200-line Rego block is impossible to debug from a parse error, a 30-line block is fine.

gator test failed: no constraint template matches kind 'K8sRequiredLabels' means the constraint file was loaded but the template file was not. The --filename policies/ flag has to point at the directory containing both template.yaml and constraint.yaml. We keep them side by side in policies/k8srequiredlabels/.

Error from server (Forbidden): error when creating "...": admission webhook "validation.gatekeeper.sh" denied the request: namespace "app-prod" does not have a NetworkPolicy is the referential-constraint variant. It fires from the audit pod, not from admission, because the violation is on the namespace, not on the resource the developer just created. The fix is to add a NetworkPolicy to the namespace; the audit pod re-checks every five minutes.

failed to call webhook: Post "https://gatekeeper-webhook-service...": context deadline exceeded shows up during AKS upgrades. The webhook timeout default (3 seconds) is sometimes not enough when the gatekeeper pods are mid-restart. The fix that worked for us is the validatingWebhookTimeoutSeconds: 5 setting in the Helm values, plus failurePolicy: Ignore on a small subset of less-critical constraints so a Gatekeeper outage does not block all admission.

gator results show no violations but the cluster is denying everything happened to us once. The cause was a stale --filename path in the pipeline; gator was loading an old version of the policies, but the cluster had a newer version with stricter rules. We fixed it by pinning the policy library to a commit SHA and reading the same SHA in both the PR pipeline and the cluster's Flux subscription.

Where this ended up

Fourteen constraints, in three audit categories, in deny mode across five clusters. The PR pipeline catches roughly three violations a week, mostly missing labels or missing resource limits on quick experiments. The audit dashboard has been at zero violations on deny constraints for the past six weeks. The dryrun flag has a single tenant on it right now, a K8sRequirePDB extension that requires a minAvailable value rather than a count; that one is still collecting data before we flip it.

The policy library lives in a repository the platform team owns, separate from the application repositories. Application repos consume the policies indirectly: their PR pipelines pull the latest tagged release of the policy library and run gator test against their rendered manifests with that library. When we change a policy, we tag a new version of the library; the application pipelines pick it up on their next PR run. This separation matters because the platform team can iterate on Rego without coordinating with every app team for every change, and the app teams have a stable, versioned dependency rather than a moving target.

Two reflections on the work, neither one a list. The first is about the cost-benefit of policy as code, which I had been a quiet skeptic of for years. The conventional argument for it is "we shift left." That phrase is correct and unhelpful; it explains the geometry without explaining the value. The actual value, for us, was that two years of code reviews had genuinely missed a class of issues nobody on the team was equipped to spot reliably, because the reviewer's eye does not pattern-match on volume specifications the way it pattern-matches on logic in a function. The policies do not replace the reviewer; they handle a slice of review that the reviewer was never doing well. The 11:47 deny is the obvious example, but the boring labels and resource limits violations are the larger truth. The reviewer was never going to catch the engineer who left a CPU limit unset or pasted app: foo instead of the full label set. The policy catches both, every time, in eleven seconds.

The second reflection is on the choice to write the library in Rego rather than adopt Azure Policy for Kubernetes. The dashboard we gave up turned out to matter less than I had feared and the iteration speed mattered more than I had hoped. A four-month Rego learning curve was real, and it was the dominant cost of the project. If you are deciding the same thing today, the right framing is: how many custom policies do you actually plan to write. If the answer is four or fewer and they all already exist in the Azure Policy built-in library, take the dashboard. If the answer is more than four, or any of them are domain-specific to your platform (we had several, the most useful being a constraint that requires every workload to declare an owning team via a label), the iteration speed wins. We are sitting on 14 active policies and a backlog of about half a dozen more, and we revisit the Rego in roughly one in three sprints. The dashboard would have made us slower at exactly the times we most needed to move.

The debug pod from February got its hostPath in the end, properly annotated, reviewed by two platform engineers, scoped to one staging cluster and one namespace. The engineer who wrote it has since used the annotation pattern twice more for legitimate cases and has caught three policy violations in PRs he reviewed for other people. The same pattern, the same library, the same eleven-second feedback loop. The 11:47 deny was not a single moment so much as the moment the team's habits started to bend around the policy, which was the only outcome I had really been hoping for.