Azure DevOps

From 312 user-assigned identities to a fleet of 14: an audit-driven migration and a 14:08 near-miss

The auditor's PDF circled one bullet in red: 312 user-assigned managed identities, 47% inactive, 18% with no role bindings. Eight weeks later we were running on 14 fleet identities, every one in use, every one with explicit role bindings. Between those two states sat a Resource Graph query that turned 312 into a spreadsheet, a Bicep refactor that rewrote how workloads bind to identity, and a four-minute payments outage at 14:08 that taught us to verify the authorization model before flipping the principal. The full rebuild.

30 May 2025 18 min read 294 viewsManaged IdentityAKSWorkload IdentityKey Vault

The auditor's note arrived in a PDF on a Wednesday in February. Page seven, third bullet: "Your subscription footprint contains 312 user-assigned managed identities, of which 47% have not authenticated in the last 90 days, and 18% have no role assignments at all. Recommend reducing the surface area and rationalising the identity model before next year's review." The PDF was 41 pages. That bullet was the only thing the security architect circled in red. By the time we finished the migration eight weeks later, we were running on 14 user-assigned managed identities, every one of them in use, every one of them with explicit role bindings, and the only excitement along the way was a four minute payment-service outage at 14:08 on a Tuesday that I will get to in a minute.

This is the whole rebuild. The Resource Graph query that turned 312 into a spreadsheet. The classification that revealed only 14 real access patterns. The Bicep refactor that swapped per-workload UAMIs for fleet UAMIs on the AKS workload-identity binding. The migration playbook we ran for each of the 96 workloads. The reconciliation script that catches drift. And the gotcha about federated-credential limits that nearly broke two of the fleet identities in the second week.

What 312 identities actually look like

For the people coming to this cold: a user-assigned managed identity in Azure is a standalone resource whose lifecycle is independent of the workload that uses it. You create it, you grant it roles on other resources, you bind it to a workload (a VM, a Function App, an AKS pod via workload identity), and the workload authenticates as that identity without holding any credential. The cost is structural, not financial. Identities are free. The cost shows up when you have so many that nobody can answer "what does this one do, and why does it exist."

We had been creating one UAMI per workload since 2022. The Bicep template that provisioned each microservice provisioned its own UAMI alongside, with a federated credential pointing at the workload's Kubernetes ServiceAccount, with role assignments tailored to whatever that workload happened to need on the day it was first deployed. Three years of that pattern, plus a few hundred PoCs that were never decommissioned, plus the occasional "let me just spin up a UAMI to test this in dev," and we were at 312.

The auditor's two statistics turned out to be conservative. After we instrumented properly, the real numbers were 49% inactive (no token issued in 90 days) and 21% with zero role bindings. About a tenth of the inventory existed solely because someone wrote mi-test-foo in a script three years ago and never cleaned up. Five identities were named after people who had left the company. Two had names that were just GUIDs, presumably because someone copy-pasted from the Azure CLI without reading the output.

Counting them: the Resource Graph query

The starting point of any work like this is honest inventory. The Azure portal can list UAMIs, but it cannot tell you which ones have role assignments, which ones have federated credentials, or when they last issued a token. Azure Resource Graph can answer the first two; the third requires a join against the Entra sign-in logs, which I will come to.

The query that produced our master spreadsheet:

Resources
| where type =~ "Microsoft.ManagedIdentity/userAssignedIdentities"
| project
    miName       = name,
    miId         = id,
    miPrincipalId= tostring(properties.principalId),
    miClientId   = tostring(properties.clientId),
    rg           = resourceGroup,
    sub          = subscriptionId,
    location     = location,
    tags         = tags
| join kind=leftouter (
    AuthorizationResources
    | where type =~ "Microsoft.Authorization/roleAssignments"
    | extend principalId = tostring(properties.principalId),
             roleDefId   = tostring(properties.roleDefinitionId),
             scope       = tostring(properties.scope)
    | summarize
        roleCount      = count(),
        roleScopes     = make_set(scope, 64),
        roleDefIds     = make_set(roleDefId, 64)
      by principalId
) on $left.miPrincipalId == $right.principalId
| extend
    hasRoles = iff(isnull(roleCount) or roleCount == 0, false, true)
| project miName, rg, sub, miPrincipalId, miClientId,
          hasRoles, roleCount, roleScopes, roleDefIds, tags
| order by hasRoles asc, miName asc

That left me with a CSV of 312 rows. Sixty-six of them had hasRoles = false. Another query against Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials got me the workload bindings:

Resources
| where type =~ "Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials"
| extend miName = split(id, "/")[8]
| project
    miName,
    fedCredName = name,
    issuer      = tostring(properties.issuer),
    subject     = tostring(properties.subject),
    audiences   = properties.audiences

The subject claim on a federated credential for an AKS workload identity has the shape system:serviceaccount:<namespace>:<sa-name>. That string told me, for every UAMI in the inventory, exactly which Kubernetes ServiceAccount in which cluster was trusting it. Fifty-three identities had federated credentials pointing at no-longer-existing ServiceAccounts (the workload had been deleted, the UAMI had not).

The last join was the sign-in data. UAMIs do not produce interactive sign-ins, but they do show up in the MicrosoftGraphActivityLogs table when the principal authenticates against Microsoft Graph, and in AzureDiagnostics for resource-level auth. The cheapest probe was to query the sign-in logs by ServicePrincipalId (which for a UAMI is its principalId) for the last 90 days:

SigninLogs
| where TimeGenerated > ago(90d)
| where ServicePrincipalId in (toscalar(
    Resources
    | where type =~ "Microsoft.ManagedIdentity/userAssignedIdentities"
    | summarize make_set(tostring(properties.principalId))
  ))
| summarize lastSignIn = max(TimeGenerated) by ServicePrincipalId

Joining that back against the master CSV gave us the inactivity column. The auditor's "47%" became "49%, plus 21% never authenticated at all because they had no role bindings to use, so the actual fraction of identities doing real work was 30%, ninety-three of the original 312."

The 14 access patterns

Ninety-three identities sounds like a lot still. We tagged each active UAMI with what it actually authenticated to: which Key Vaults it read from or wrote to, which storage accounts it touched, which Cosmos accounts it queried, which container registries it pulled from. We built a column per resource type and a set of distinct resources within each.

When we clustered the result by the exact set of (resourceType, action, scope-pattern) tuples, ninety-three identities collapsed into fourteen distinct profiles. Most of the duplication was that every team had created their own mi-app-foo-keyvault-read identity to read secrets from their team's vault, with Key Vault Secrets User at the vault scope, when in reality they all wanted the same thing: "the workload running in namespace X can read secrets from the vault named after namespace X."

The fourteen patterns we ended up naming:

Fleet identity	Access pattern
mi-app-payments-read-kv	Key Vault Secrets User on `kv-payments-prod`, `kv-payments-stage`
mi-app-payments-write-cosmos	Cosmos DB Built-in Data Contributor on `cosmos-payments-*`
mi-app-orders-read-kv	Key Vault Secrets User on `kv-orders-*`
mi-app-orders-write-cosmos	Cosmos DB Built-in Data Contributor on `cosmos-orders-*`
mi-app-search-write-asg	Storage Blob Data Contributor on `stsearchindex*`
mi-app-shared-read-kv-shared	Key Vault Secrets User on `kv-shared-platform`
mi-platform-aks-image-pull	AcrPull on `crplatformprod`
mi-platform-aks-csi-secrets	Key Vault Secrets User on `kv-shared-platform`
mi-platform-flux-read-acr	AcrPull on `crplatformprod`, Storage Blob Data Reader on flux state
mi-platform-obs-write-law	Monitoring Metrics Publisher on Log Analytics workspaces
mi-data-ingest-read-storage	Storage Blob Data Reader on `stdataingest*`
mi-data-ingest-write-cosmos	Cosmos DB Built-in Data Contributor on `cosmos-ingest-*`
mi-jobs-read-servicebus	Azure Service Bus Data Receiver on `sb-jobs-*`
mi-jobs-write-servicebus	Azure Service Bus Data Sender on `sb-jobs-*`

Each fleet identity is shared across the workloads that genuinely share the same access need. The hard rule we wrote into the design doc: a fleet identity gets exactly the union of role bindings that all of its workloads need, with no extras. If one workload needs a wider grant, it either gets a separate identity or its profile changes to match an existing one. No "while we are here, let me also give this identity read on the storage account."

Bicep, before and after

The old pattern, one UAMI per workload, looked like this:

@description('Microservice short name, e.g. payments-api')
param svcName string
@description('AKS OIDC issuer URL')
param oidcIssuer string
@description('Kubernetes namespace the workload runs in')
param k8sNamespace string

resource workloadIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = {
  name: 'mi-app-${svcName}'
  location: resourceGroup().location
}

resource fedCred 'Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials@2023-01-31' = {
  parent: workloadIdentity
  name: 'fc-${svcName}'
  properties: {
    issuer: oidcIssuer
    subject: 'system:serviceaccount:${k8sNamespace}:${svcName}'
    audiences: ['api://AzureADTokenExchange']
  }
}

resource kvSecretsUser 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: resourceGroup()
  name: guid(workloadIdentity.id, 'KeyVaultSecretsUser')
  properties: {
    principalId: workloadIdentity.properties.principalId
    principalType: 'ServicePrincipal'
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6'
    )
  }
}

output workloadIdentityClientId string = workloadIdentity.properties.clientId

Every microservice in the org carried a copy of this template. The new pattern moves the UAMI and its role assignments into a single platform module that is deployed once. The microservice module only contributes a federated credential to the appropriate fleet identity for its access profile, and the AKS deployment binds the workload to that fleet identity by client id.

The platform module that vends the fleet identities:

// modules/identity-fleet.bicep
param location string
param oidcIssuer string

var fleet = [
  {
    name: 'mi-app-payments-read-kv'
    profile: 'payments-read-kv'
  }
  {
    name: 'mi-app-payments-write-cosmos'
    profile: 'payments-write-cosmos'
  }
  // ... twelve more
]

resource fleetMis 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = [for f in fleet: {
  name: f.name
  location: location
  tags: {
    'fleet-profile': f.profile
    'owned-by': 'platform'
  }
}]

output fleetIds array = [for (f, i) in fleet: {
  name: f.name
  resourceId: fleetMis[i].id
  clientId: fleetMis[i].properties.clientId
  principalId: fleetMis[i].properties.principalId
}]

A separate module per profile attaches role bindings to the right fleet identity at the right scope. For example, modules/profile-payments-read-kv.bicep grants Key Vault Secrets User on the two payments vaults to mi-app-payments-read-kv and nothing else. That keeps the role-assignment surface area auditable per profile.

The microservice module shrinks to this:

@description('The fleet identity this workload should authenticate as')
param fleetIdentityName string
@description('Kubernetes namespace the workload runs in')
param k8sNamespace string
@description('Kubernetes ServiceAccount name')
param k8sServiceAccount string
@description('AKS OIDC issuer URL')
param oidcIssuer string

resource fleetMi 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' existing = {
  name: fleetIdentityName
  scope: resourceGroup('rg-platform-identities-prod')
}

resource fedCred 'Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials@2023-01-31' = {
  parent: fleetMi
  name: 'fc-${k8sNamespace}-${k8sServiceAccount}'
  properties: {
    issuer: oidcIssuer
    subject: 'system:serviceaccount:${k8sNamespace}:${k8sServiceAccount}'
    audiences: ['api://AzureADTokenExchange']
  }
}

output workloadIdentityClientId string = fleetMi.properties.clientId

No more per-workload identity. No more per-workload role assignments. The microservice contributes one federated credential to the fleet identity, names which fleet to bind to, and consumes the client id in its Kubernetes manifest. The AKS workload-identity binding in the pod spec uses the fleet identity's client id via the standard service account annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments-api
  namespace: payments
  annotations:
    azure.workload.identity/client-id: "<fleet client id from bicep output>"

And the pod template carries the workload-identity label that triggers token injection:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: payments
spec:
  selector:
    matchLabels: { app: payments-api }
  template:
    metadata:
      labels:
        app: payments-api
        azure.workload.identity/use: "true"
      ...

The migration playbook, per workload

We needed to move ninety-six workloads from per-workload UAMIs to fleet UAMIs without taking any of them offline. The playbook had five steps and we ran it in a pipeline. Note that this pipeline is in Azure Pipelines, no GitHub Actions involved.

# azure-pipelines/identity-migration.yml
parameters:
  - name: workload
    type: string
  - name: targetFleet
    type: string
  - name: namespace
    type: string
  - name: serviceAccount
    type: string

stages:
  - stage: AddFederatedCredentialToFleet
    jobs:
      - job: AddFedCred
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: 'sc-platform-identities'
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                az identity federated-credential create \
                  --identity-name "${{ parameters.targetFleet }}" \
                  --resource-group rg-platform-identities-prod \
                  --name "fc-${{ parameters.namespace }}-${{ parameters.serviceAccount }}" \
                  --issuer "$(AKS_OIDC_ISSUER)" \
                  --subject "system:serviceaccount:${{ parameters.namespace }}:${{ parameters.serviceAccount }}" \
                  --audiences api://AzureADTokenExchange

  - stage: SwitchServiceAccountAnnotation
    dependsOn: AddFederatedCredentialToFleet
    jobs:
      - job: PatchSA
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: 'sc-platform-aks'
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                FLEET_CLIENT_ID=$(az identity show \
                  --name "${{ parameters.targetFleet }}" \
                  --resource-group rg-platform-identities-prod \
                  --query clientId -o tsv)
                kubectl annotate sa ${{ parameters.serviceAccount }} \
                  -n ${{ parameters.namespace }} \
                  azure.workload.identity/client-id="${FLEET_CLIENT_ID}" \
                  --overwrite
                kubectl rollout restart deployment \
                  -n ${{ parameters.namespace }} \
                  -l app=${{ parameters.workload }}
                kubectl rollout status deployment \
                  -n ${{ parameters.namespace }} \
                  -l app=${{ parameters.workload }} --timeout=5m

  - stage: ValidateNewIdentityInUse
    dependsOn: SwitchServiceAccountAnnotation
    jobs:
      - job: Probe
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: 'sc-platform-aks'
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                POD=$(kubectl get pod -n ${{ parameters.namespace }} \
                  -l app=${{ parameters.workload }} \
                  -o jsonpath='{.items[0].metadata.name}')
                kubectl exec -n ${{ parameters.namespace }} "$POD" -- \
                  printenv AZURE_CLIENT_ID
                # Trigger a real call against a known scope, verify success
                kubectl exec -n ${{ parameters.namespace }} "$POD" -- \
                  /app/probe --scope https://vault.azure.net/.default

  - stage: RemoveOldFederatedCredential
    dependsOn: ValidateNewIdentityInUse
    jobs:
      - deployment: Cleanup
        environment: 'platform-identity-cleanup'
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureCLI@2
                  inputs:
                    azureSubscription: 'sc-platform-identities'
                    scriptType: bash
                    scriptLocation: inlineScript
                    inlineScript: |
                      OLD_MI="mi-app-${{ parameters.workload }}"
                      az identity federated-credential delete \
                        --identity-name "$OLD_MI" \
                        --resource-group rg-platform-identities-prod \
                        --name "fc-${{ parameters.workload }}" \
                        --yes

The fourth stage sits behind an environment gate. We added that gate after the payment service incident; without it, the pipeline raced ahead and tore down the old federated credential before we had a chance to spot a problem.

The actual sequencing for each workload was: add the federated credential to the fleet identity, then re-annotate the ServiceAccount with the new client id and roll the pods, then run a probe that does one real call (read a secret from the vault, write a document to Cosmos, whatever the workload's main authenticated operation is), then wait 24 hours of normal traffic, then remove the old federated credential, then 24 hours later delete the old UAMI.

The payment service near-miss at 14:08

Tuesday afternoon, week two, we were running the playbook on payments-api. By 14:00 the federated credential was added to mi-app-payments-read-kv. At 14:07 we re-annotated the ServiceAccount, the new pods rolled out clean, the probe call against the vault came back 200, the pipeline ticked over to "validated, holding 24h."

At 14:08 the on-call channel lit up. Payments was throwing 5xx on every request that needed a secret read. The error logs:

azure-keyvault: HTTP 403 from kv-payments-prod
{"error":{"code":"Forbidden","message":"The user, group or application 'appid=<fleet-client-id>;oid=<fleet-principal-id>;...' does not have secrets get permission on key vault 'kv-payments-prod;location=eastus2'. For help resolving this issue, please see https://go.microsoft.com/fwlink/?linkid=2125287"}}

The fleet identity had the Key Vault Secrets User role on kv-payments-prod at the vault scope. We had checked that explicitly in the design. The role was there. So why 403?

The vault was on the legacy access-policy model, not RBAC. Two of our older vaults had never been migrated to RBAC-based authorization. The old per-workload UAMI had an entry on the access-policy list (because the original deployment had added one). The new fleet UAMI had a role assignment, which the vault's authorization model was ignoring entirely.

We caught the failure at 14:08, identified the cause at 14:10, and added the fleet identity to the vault's access policy list manually at 14:12. Four minutes of partial outage on the payments tier. Roughly 380 requests failed in that window. The blast radius was small because we had moved a single service, not the whole estate, and because the alerting fired the moment the error rate crossed 1%.

The fix went into the migration checklist immediately:

# Run this BEFORE switching any workload that talks to Key Vault
VAULT=$1
MODE=$(az keyvault show -n "$VAULT" --query "properties.enableRbacAuthorization" -o tsv)
if [ "$MODE" != "true" ]; then
  echo "ERROR: vault $VAULT is on access policies, not RBAC. Migrate the vault first."
  exit 1
fi

We also added a "canary workload first" rule. From week three onwards, every new access profile got migrated on a low-stakes workload before any business-critical one. The two profiles that touched access-policy vaults got their vaults converted to RBAC first, in a separate change, with its own approval. The lesson, which I had known abstractly but had not internalised at this scope: when you change the principal, you must verify the principal's effective access against the resource's actual authorization model, not the model you assumed it was using.

The RBAC reconciliation script

Once the migration was running, drift became the next problem. A team that needed a new permission would add a role assignment to their workload's old UAMI by hand, the migration pipeline would still be using the old role-binding list to design the fleet identity, and we would end up with a fleet identity short on permissions.

The reconciliation script runs nightly. For each fleet identity, it computes the union of role bindings required by all the workloads bound to it (by looking at the workloads' contracts in source control) and compares against what the identity actually has. Any diff is flagged.

#!/usr/bin/env bash
set -euo pipefail

FLEET_RG="rg-platform-identities-prod"
WORKLOAD_REPO="/var/lib/recon/workloads"

for fleet in $(az identity list -g "$FLEET_RG" --query '[].name' -o tsv); do
  echo "=== reconciling $fleet ==="

  PRINCIPAL=$(az identity show -n "$fleet" -g "$FLEET_RG" --query principalId -o tsv)

  # Actual: what the identity currently has
  az role assignment list --assignee "$PRINCIPAL" --all \
    --query '[].{role:roleDefinitionName,scope:scope}' -o json \
    | jq -S '.' > "/tmp/recon-${fleet}-actual.json"

  # Expected: declared in workload contracts that point at this fleet
  jq -s --arg fleet "$fleet" '
    [ .[] | select(.fleet == $fleet) | .roles[] ]
    | unique_by({role, scope})
  ' "$WORKLOAD_REPO"/*.json | jq -S '.' > "/tmp/recon-${fleet}-expected.json"

  diff -u "/tmp/recon-${fleet}-expected.json" "/tmp/recon-${fleet}-actual.json" \
    > "/tmp/recon-${fleet}.diff" || true

  if [ -s "/tmp/recon-${fleet}.diff" ]; then
    echo "DRIFT on $fleet, see /tmp/recon-${fleet}.diff"
    cat "/tmp/recon-${fleet}.diff" | head -40
  else
    echo "ok: $fleet matches expected"
  fi
done

The diff lands in the platform team's morning report. In the eight weeks we have been running it, we have caught seven drift events. Three were legitimate additions a team had made via the UI; we folded them into the workload contract and re-applied. Four were stale role assignments left from migrated workloads that had been removed; we cleaned them up.

The cleanup pipeline and the 24-hour grace

Deleting a UAMI is straightforward only if you have already removed every federated credential and every role assignment on it. If you have not, the delete fails with:

Code: BadRequest
Message: User-Assigned Managed Identity '/subscriptions/.../mi-app-payments-old' cannot be deleted because it is in use. The following resources reference it: ...

Sometimes the "in use" reference is a federated credential you forgot. Sometimes it is a VM you did not know was bound to it. Sometimes it is a Function App you cannot easily inspect. The cleanup pipeline does this in stages with a 24-hour grace between each:

Delete all federated credentials on the old UAMI.
Wait 24 hours. Re-run a probe that confirms no workload is still authenticating as it (search the sign-in logs by principalId).
Remove all role assignments on the old UAMI.
Wait 24 hours. Confirm no 403 errors have spiked on any of the previously-targeted resources.
Delete the UAMI itself.

The grace period is the part I had to argue for. The platform team wanted to compress it. The argument that won the day was that the migration cost us one outage of four minutes because of an assumption we did not test. A 24-hour soak at each removal step is cheap insurance against the next assumption we have not thought of.

The federated-credential limit, week two

Every user-assigned managed identity has a hard limit of 20 federated credentials, documented on Microsoft Learn. With ninety-six workloads collapsing onto fourteen fleet identities, the average load was about seven workloads per identity, which is fine. The distribution was the problem. Two of our identities, mi-platform-aks-image-pull and mi-app-shared-read-kv-shared, were on the receiving end of the long tail. Every workload in the cluster needed AcrPull. Every workload in the shared platform namespace needed read on the shared vault.

By the second Friday of the migration, mi-platform-aks-image-pull had eighteen federated credentials and a queue of seven more workloads waiting to be migrated. The nineteenth credential added fine. The twentieth was the limit. The twenty-first failed:

Code: BadRequest
Message: The maximum number of federated identity credentials allowed for a user-assigned managed identity has been reached.

The mitigation was to split the identity. We created mi-platform-aks-image-pull-a and mi-platform-aks-image-pull-b, with identical role assignments, and distributed the federated credentials across the two so each had headroom. The choice of which workload binds to which is arbitrary; we hashed the namespace name modulo two. From that point on, any fleet identity that was projected to exceed fifteen federated credentials got pre-split when it was provisioned, giving a five-credential buffer for growth.

The design doc now says: cap workloads-per-fleet-identity at 15. Beyond that, split. The cap is not a performance limit; it is a guardrail against the federated-credential ceiling. Two fleet profiles ended up as -a/-b pairs. Nothing else changed in the access pattern, because the roles attached to mi-platform-aks-image-pull-a are exactly the roles attached to mi-platform-aks-image-pull-b; they are the same profile, sharded only for credential capacity.

Troubleshooting, the full list

AADSTS70021: No matching federated identity record found is the canonical "the federated credential on the identity does not match the token the workload is presenting." Compare the credential's subject (system:serviceaccount:<ns>:<sa>) against what the pod's projected token claims. The two strings must match byte for byte. Capitalisation, trailing characters, the namespace name, everything. The token's subject is fixed by the Kubernetes side; the credential's subject is what you control. Fix the credential.

403: identity does not have permission to KeyVault 'kv-payments-prod' for action 'Microsoft.KeyVault/vaults/secrets/getSecret/action' after a fresh migration almost always means one of two things. Either the vault is on access-policy authorization and your new identity has only an RBAC role (fix: add the identity to the access policy or convert the vault to RBAC), or the role you granted is not actually sufficient for the action (fix: Key Vault Secrets User is right for read; Key Vault Secrets Officer is needed for write).

User-Assigned Managed Identity cannot be deleted because it is in use means there is a reference somewhere you have not removed. The error message usually lists the referencing resources. If it does not, run:

az resource list \
  --query "[?identity.userAssignedIdentities.\"$MI_RESOURCE_ID\"!=null].{name:name,type:type,id:id}" \
  -o table

That walks every resource in the subscription looking for the identity in the identity.userAssignedIdentities map.

token request failed: identity not found from the AKS workload-identity webhook means the pod has the azure.workload.identity/use: "true" label and the ServiceAccount has an annotation, but the annotation does not reference an existing identity. Usually a typo in the client id, or the annotation pointing at a UAMI that was deleted prematurely. Re-annotate.

error: federated credential with name 'fc-...' already exists is the migration pipeline being run twice on the same workload. The second run should be a no-op; harden the pipeline to use az identity federated-credential create ... --only-show-errors --output none and treat the "already exists" return code as success.

Where we ended up

Fourteen user-assigned managed identities. Every one of them in use by at least four workloads. Every one of them with explicit, declared role bindings that match a checked-in profile. Zero identities with no role assignments. Zero identities that have not authenticated in the last seven days. The auditor's note from February is closed.

The Resource Graph query that produced the original spreadsheet now runs nightly. The current row count is fourteen. We could check that into a wallpaper if we were the kind of team that did wallpapers.

The reflective part is about what changed in how we think, not just how the inventory looks. The old pattern, one identity per workload, felt like good hygiene because each workload was "isolated." It was not isolated. Ninety-three of the 312 identities were doing one of fourteen things, and the granularity of separation we were paying for was illusory: any workload that wanted to break out of its lane could ask its platform team for a new role assignment, and the team would add it, because no one was reading the cumulative permission surface across the fleet. The fleet model forces that reading. To add a workload to mi-app-payments-read-kv, you have to assert that the workload's access need is exactly Key Vault Secrets User on the payments vaults. If it needs more, it cannot join the fleet, and the gap surfaces in the design conversation rather than in a quiet az role assignment create six weeks later. Identity profiles became a thing we discuss before we provision, not a side-effect of provisioning.

The other thing that changed: the auditor's followup six months later took eight minutes. They asked for the Resource Graph query, ran it themselves, counted fourteen, asked how the role bindings were managed, read the workload-contract repo, and closed the finding. The whole exchange existed because the previous exchange had been a 41-page PDF with one red circle on page seven. The work between the two conversations was less about identity hygiene than about making the hygiene visible enough to answer for. Fourteen named profiles in a Bicep module is a thing a human can read and reason about. Three hundred and twelve UAMIs in a portal blade is not. The migration was not, fundamentally, a security project; it was a legibility project. Security was the side effect.