Skip to content
OObaro.Olori
All articles
Azure DevOps

Subscription vending in 9 minutes: an Azure DevOps Pipeline that lands a subscription end to end

A new product team waited fourteen working days for an Azure subscription. The ticket bounced between Cloud Centre of Excellence, FinOps, Security, and Networking. Eight months later, the same request flows through one PR template, one Azure DevOps pipeline, and a Bicep landing-zone stamp. Forty-seven subs have been vended. Time from request to first deploy went from fourteen days to nine minutes.

21 min read 312 viewsAzure landing zonesAzure PipelinesBicepPlatform engineering

The internal ticket I pulled up was opened on a Monday in late June and closed on the second Thursday of July. Fourteen working days. The request was three lines long: a new product team needed an Azure subscription, a hub-peered VNet, a Log Analytics workspace pointed at the platform's monitoring sub, and a Service Connection their Azure DevOps project could deploy through. The ticket bounced between four queues. Cloud Centre of Excellence took it first to validate naming and management group placement. FinOps took it next to confirm the cost-centre code existed in our chargeback system. Security took it after that to assign the right Defender plans and policy initiative. Networking took it last to allocate a /22 from the IP plan and stand up the peering. Each hand-off had its own SLA, its own queue, its own Friday-afternoon person who was on leave the week we needed them. The team that filed the ticket built nothing for fourteen days. Their kick-off slipped a sprint.

That was the eleventh ticket in five months that looked exactly like that. I sat down with our principal engineer the Friday it closed and we agreed: the ticket was the problem. Not any one queue, not any one hand-off. The shape of "fill in a form, wait, escalate, wait" was the wrong shape. Eight months later, the same request flows through a single pull request on the platform repo. The PR template is the front door. Merge to main triggers an Azure Pipelines run. Nine minutes after merge, the requesting team gets a Teams message containing their new subscription id, their Service Connection name, their VNet address space, the link to their pre-scaffolded repo, and a single-page wiki on what to deploy next. Forty-seven subscriptions have been vended this way. Time from "I need a sub" to "you can deploy" went from fourteen working days to nine minutes. The platform team spends approximately zero hours a week on subscription provisioning.

This is the whole build. The PR template that became the form, the pipeline that orchestrates the work, the Bicep that stamps the landing zone, the federated credential dance that gives the team a secret-free Service Connection on their brand-new sub, and the specific gotcha (subscription alias creation is eventually consistent) that ate two evenings before we wrote the polling step that fixed it.

The before state, in numbers

The fourteen-day figure was the median across the eleven tickets I went back and timed. The distribution was wider than I expected. The fastest had been six days; the slowest, twenty-two. Six was when every queue had capacity on the same week and the team filing the ticket had answered every question on the first reply. Twenty-two was a holiday-adjacent month where the FinOps reviewer was on leave and nobody had been deputised. The variance itself was the operational pain. Product teams could not plan against "between six and twenty-two days," so they treated subscription requests as a Q-minus-one task and filed them weeks ahead of when they were actually needed, which meant by the time the sub landed nobody on the team remembered exactly what they had asked for.

The other thing the audit found: of the eleven subscriptions vended manually, eight had configuration drift from the documented landing zone. Three had the wrong Defender plans assigned. Two had Log Analytics pointed at the wrong workspace. One had been peered to the dev hub instead of the prod hub. None of these were anyone's mistake exactly; they were the natural cost of a manual handover where step seventeen lived in a runbook page that had been edited four times and never re-tested. If the work is manual, drift is inevitable. The only way to eliminate drift is to eliminate the manual.

The PR template that became the front door

We put the form in source control. The platform repo gained a directory called subscription-requests/ and a PR template at .github/PULL_REQUEST_TEMPLATE/new-subscription.md (the repo is on Azure Repos but the template format is the same). Filing a request means cloning the repo, copying a template file into the right folder, filling it in, and opening a PR. The PR itself becomes the audit record. Approvals on the PR are the audit trail. The merge is the trigger.

The template, the bit teams actually see:

# New subscription request

> Fill every field. Pipelines validates them; missing or wrong values fail the run.

## Business

- **Requesting team**: <slug, e.g. `payments-platform`>
- **Cost centre**: <six-digit code from FinOps register, e.g. `CC-104388`>
- **Product owner email**: <work email; gets the post-vend Teams notification>
- **Sensitivity classification**: `internal | confidential | restricted`

## Technical

- **Environment**: `dev | test | preprod | prod`
- **Networking tier**: `spoke-standard | spoke-isolated | spoke-public`
- **Region**: `uksouth | westeurope | eastus2` (one only)
- **Estimated monthly spend (USD)**: <integer; informs the alert thresholds>
- **Workload hint**: `aks | appservice | functions | data | analytics | other`

## Sign-off

- [ ] FinOps approver: cost centre is active
- [ ] Security approver: classification matches the workload's data category
- [ ] Platform approver: capacity exists in the requested region's hub

The three sign-off checkboxes are wired to branch protection rules. The PR cannot merge until a member of each of the three groups has approved. That is the same gate the old ticket queue gave us, except it happens on the PR in parallel, not sequentially across queues. We measured: median time-to-three-approvals dropped from eleven days to forty minutes because reviewers no longer needed to context-switch between Service Now and our actual work.

The PR template is rendered into a real YAML document by a tiny Python script in the pre-build step. The script reads the markdown checkboxes and structured headings, parses them, and writes out subscription-requests/<slug>-<env>.yaml. That file is the input to the pipeline. The markdown is for humans; the YAML is for the pipeline. Both live in the same PR.

# subscription-requests/payments-platform-prod.yaml
schemaVersion: 2
team: payments-platform
costCentre: CC-104388
productOwner: jane.doe@contoso.com
sensitivity: confidential
environment: prod
networkingTier: spoke-standard
region: uksouth
estimatedMonthlyUsd: 12000
workloadHint: aks
approvers:
  finops: alex.k@contoso.com
  security: priya.s@contoso.com
  platform: obaro.o@contoso.com

That YAML is what the pipeline reads, validates, and acts on. If anyone wants to know why a subscription has the policy assignments it has, the answer is "look at the YAML in the platform repo." There is no other source of truth.

The pipeline, in four stages

The Azure Pipelines file lives at .azure/pipelines/vend-subscription.yml. It runs on merge to main of any commit that touches subscription-requests/. The four stages are Validate, Provision, Wire, and Notify. Each stage is gated by the one before it. The whole thing finishes in about nine minutes on a warm hub, ten on a cold one.

trigger:
  branches:
    include: [main]
  paths:
    include:
      - subscription-requests/**

pool:
  vmImage: ubuntu-latest

variables:
  - group: vending-platform
  - name: serviceConnection
    value: sc-platform-vending-root

stages:
  - stage: Validate
    displayName: 'Validate request'
    jobs:
      - job: ParseAndCheck
        steps:
          - checkout: self
            fetchDepth: 2
          - task: Bash@3
            displayName: 'Detect changed request file'
            inputs:
              targetType: inline
              script: |
                set -euo pipefail
                CHANGED=$(git diff --name-only HEAD~1 HEAD -- subscription-requests/ | grep '\.yaml$' | head -1)
                if [ -z "$CHANGED" ]; then echo "No request file changed"; exit 1; fi
                echo "##vso[task.setvariable variable=requestFile;isOutput=true]$CHANGED"
                cat "$CHANGED"
            name: detect
          - task: AzureCLI@2
            displayName: 'Validate cost centre, naming, region'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                python3 scripts/validate-request.py \
                  --file "$(detect.requestFile)" \
                  --finops-registry "$FINOPS_REGISTRY_URL" \
                  --naming-convention scripts/naming.json \
                  --hub-capacity scripts/hub-capacity.json

  - stage: Provision
    displayName: 'Create subscription'
    dependsOn: Validate
    jobs:
      - job: VendSub
        steps:
          - checkout: self
          - task: AzureCLI@2
            displayName: 'az account subscription alias create'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                REQ=$(cat $(detect.requestFile))
                TEAM=$(yq -r '.team' <<< "$REQ")
                ENV=$(yq -r '.environment' <<< "$REQ")
                ALIAS="sub-${TEAM}-${ENV}"
                BILLING_SCOPE="/providers/Microsoft.Billing/billingAccounts/${BILLING_ACCOUNT_ID}/enrollmentAccounts/${ENROLLMENT_ACCOUNT_ID}"

                az account subscription alias create \
                  --name "$ALIASn" \
                  --billing-scope "$BILLING_SCOPE" \
                  --display-name "${TEAM} ${ENV}" \
                  --workload Production

                SUB_ID=$(az account subscription alias show --name "$ALIAS" --query properties.subscriptionId -o tsv)
                echo "##vso[task.setvariable variable=newSubId;isOutput=true]$SUB_ID"
            name: vend
          - task: AzureCLI@2
            displayName: 'Wait for sub propagation'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                SUB="$(vend.newSubId)"
                for i in $(seq 1 30); do
                  if az role assignment list --scope "/subscriptions/$SUB" >/dev/null 2>&1; then
                    echo "subscription $SUB reachable after ${i} polls"
                    exit 0
                  fi
                  echo "poll ${i}: not yet"; sleep 5
                done
                echo "subscription $SUB never became reachable"; exit 1

  - stage: Wire
    displayName: 'Apply landing zone'
    dependsOn: Provision
    jobs:
      - job: LandingZone
        steps:
          - checkout: self
          - task: AzureCLI@2
            displayName: 'az deployment sub create (landing zone)'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                SUB="$(vend.newSubId)"
                az account set --subscription "$SUB"
                az deployment sub create \
                  --location uksouth \
                  --template-file infra/landing-zone/main.bicep \
                  --parameters @"$(detect.requestFile)" \
                  --parameters newSubId="$SUB"
          - task: AzureCLI@2
            displayName: 'Vend the Service Connection identity'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                bash scripts/vend-service-connection.sh \
                  --sub "$(vend.newSubId)" \
                  --team "$(yq -r '.team' $(detect.requestFile))" \
                  --env "$(yq -r '.environment' $(detect.requestFile))"

  - stage: Notify
    displayName: 'Hand off to team'
    dependsOn: Wire
    jobs:
      - job: PostSummary
        steps:
          - task: AzureCLI@2
            displayName: 'Scaffold repo and post Teams summary'
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                bash scripts/scaffold-team-repo.sh "$(detect.requestFile)" "$(vend.newSubId)"
                bash scripts/post-teams-summary.sh "$(detect.requestFile)" "$(vend.newSubId)"

The shape, before the details: Validate refuses bad requests cheaply, before we have created anything in Azure. Provision uses the EA subscription alias API to create the subscription against our enrollment account. Wire deploys the landing-zone Bicep at subscription scope, then vends the federated-identity-backed Service Connection. Notify scaffolds the team's deploy repo and posts a markdown summary to the team's Teams channel via an incoming webhook connector.

The validation in Stage 1 is doing real work, not theatre. The cost-centre check hits an internal FinOps API and refuses any code that is not flagged active for the current fiscal year. The naming check enforces <team>-<env>-<region> casing and length limits so resource group names stay under the 90-character limit Azure imposes. The hub capacity check reads a JSON manifest in the repo (refreshed by a separate nightly job) and refuses requests for a region whose hub VNet has fewer than two free /22s left. None of these are slow; the whole Validate stage runs in 40 seconds. The point is that bad requests fail before we have spent any Azure money or polluted any state.

The Bicep that stamps the landing zone

The landing-zone module is what Microsoft would call a subscription vending implementation. The interesting choices are which assignments to make global (every sub gets them, no exceptions) versus which to drive off the request YAML.

// infra/landing-zone/main.bicep
targetScope = 'subscription'

@description('The newly-vended subscription id, passed by the pipeline')
param newSubId string

@description('Parsed values from the request YAML')
param team string
param environment string
param networkingTier string
param region string
param sensitivity string
param productOwner string
param costCentre string

var mgPath = environment == 'prod'
  ? '/providers/Microsoft.Management/managementGroups/mg-platform-prod'
  : '/providers/Microsoft.Management/managementGroups/mg-platform-nonprod'

// Move the sub into the right management group (this happens out-of-band via the
// management group association API; included here as a module call for clarity)
module mgAssociate 'modules/mg-associate.bicep' = {
  name: 'mg-associate'
  params: {
    subscriptionId: newSubId
    managementGroupId: environment == 'prod' ? 'mg-platform-prod' : 'mg-platform-nonprod'
  }
}

// Baseline resource groups
resource rgPlatform 'Microsoft.Resources/resourceGroups@2024-07-01' = {
  name: 'rg-${team}-${environment}-platform-${region}'
  location: region
  tags: {
    costCentre: costCentre
    sensitivity: sensitivity
    productOwner: productOwner
    managedBy: 'platform-vending'
  }
}

resource rgWorkload 'Microsoft.Resources/resourceGroups@2024-07-01' = {
  name: 'rg-${team}-${environment}-workload-${region}'
  location: region
  tags: {
    costCentre: costCentre
    sensitivity: sensitivity
    productOwner: productOwner
    managedBy: 'platform-vending'
  }
}

// Networking: the tier choice drives the spoke shape
module spoke 'modules/spoke-network.bicep' = {
  name: 'spoke-network'
  scope: rgPlatform
  params: {
    tier: networkingTier
    region: region
    hubVnetResourceId: hubVnetIdFor(region, environment)
    teamSlug: team
    envSlug: environment
  }
}

// Log Analytics, pointed at the platform monitoring workspace via diagnostic settings
module observability 'modules/observability.bicep' = {
  name: 'observability'
  scope: rgPlatform
  params: {
    teamSlug: team
    envSlug: environment
    region: region
    platformWorkspaceResourceId: platformWorkspaceId
  }
}

// Defender plans, baseline policy assignments
module policy 'modules/policy-baseline.bicep' = {
  name: 'policy-baseline'
  params: {
    subscriptionId: newSubId
    region: region
    environment: environment
    sensitivity: sensitivity
  }
}

The modules/policy-baseline.bicep is where the real defensive work happens. Every vended sub gets, without exception, four Azure Policy assignments:

// modules/policy-baseline.bicep
targetScope = 'subscription'

param region string
param environment string
param sensitivity string

var allowedLocations = environment == 'prod' ? [ 'uksouth' ] : [ 'uksouth', 'westeurope' ]

resource allowedLocationsAssignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'pa-allowed-locations'
  properties: {
    displayName: 'Allowed locations for resources'
    policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/e56962a6-4747-49cd-b67b-bf8b01975c4c'
    parameters: {
      listOfAllowedLocations: { value: allowedLocations }
    }
    enforcementMode: 'Default'
  }
}

resource requiredTagsAssignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'pa-required-tags'
  properties: {
    displayName: 'Required tags: costCentre, productOwner, sensitivity'
    policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/871b6d14-10aa-478d-b590-94f262ecfa99'
    parameters: {
      tagName: { value: 'costCentre' }
    }
    enforcementMode: 'Default'
  }
}

resource defenderForServers 'Microsoft.Security/pricings@2024-01-01' = {
  name: 'VirtualMachines'
  properties: {
    pricingTier: 'Standard'
    subPlan: 'P2'
  }
}

resource defenderForStorage 'Microsoft.Security/pricings@2024-01-01' = {
  name: 'StorageAccounts'
  properties: {
    pricingTier: sensitivity == 'restricted' ? 'Standard' : 'Free'
  }
}

resource defenderForKeyVault 'Microsoft.Security/pricings@2024-01-01' = {
  name: 'KeyVaults'
  properties: { pricingTier: 'Standard' }
}

The sensitivity classification on the request drives whether Defender for Storage is on Standard or Free. The cost difference matters: for an analytics team with hundreds of accounts at the Standard price, this adds up. For a team handling restricted data, the protection is mandatory regardless of cost. The YAML drives this directly. Nobody on the platform team has to remember the policy.

The three networking tiers, and why

The networking tier choice on the request was the second-most-debated line. The first iteration of the vending pipeline had no tier at all; every sub got the same hub-peered, internet-via-Firewall spoke. Then a marketing team needed a public-facing site that needed direct egress to the public internet without going through the corporate Firewall, because the egress IPs were going to be advertised in their marketing copy and needed to be stable. Then a data engineering team needed a sub with no internet at all (egress for their pipelines went via Private Endpoints; the security team's threat model treated any internet egress as a finding). So we ended up with three tiers, all expressed in one Bicep module:

// modules/spoke-network.bicep
@allowed([ 'spoke-standard', 'spoke-isolated', 'spoke-public' ])
param tier string
param region string
param hubVnetResourceId string
param teamSlug string
param envSlug string

resource vnet 'Microsoft.Network/virtualNetworks@2024-05-01' = {
  name: 'vnet-${teamSlug}-${envSlug}-${region}'
  location: region
  properties: {
    addressSpace: {
      addressPrefixes: [ allocateSpokeRange(teamSlug, envSlug, region) ]
    }
    subnets: [
      {
        name: 'snet-workload'
        properties: {
          addressPrefix: allocateSubnetRange(teamSlug, envSlug, region, 'workload')
          routeTable: tier == 'spoke-public' ? null : { id: hubRouteTableId }
          networkSecurityGroup: { id: baselineNsgId }
        }
      }
    ]
  }
}

resource peeringToHub 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2024-05-01' = if (tier != 'spoke-public') {
  parent: vnet
  name: 'peer-to-hub'
  properties: {
    remoteVirtualNetwork: { id: hubVnetResourceId }
    allowVirtualNetworkAccess: true
    allowForwardedTraffic: true
    useRemoteGateways: tier == 'spoke-isolated' ? false : true
  }
}

Spoke-Standard is the default: hub-peered, egress via the corporate Firewall sitting in the hub, ingress via Front Door or Application Gateway in the hub. Spoke-Isolated has the peering but the workload subnet route table forces 0.0.0.0/0 to a black-hole next-hop; the only way out is Private Endpoints, which is what the data team wanted. Spoke-Public has no peering at all; the spoke is on its own, with its own outbound public IP, and the marketing team's site sits there. We have one Spoke-Public sub in production and one in test. Forty of the forty-seven are Spoke-Standard. Six are Spoke-Isolated. The split feels right; one tier would have been wrong, five tiers would have been more than we need.

Vending the Service Connection, with no secrets

This is the part the requesting team cares about most, because it is the moment their pipeline can actually deploy. The script scripts/vend-service-connection.sh does four things, in order: create a new managed identity in the new subscription, give it Contributor on the workload resource group, create an Azure DevOps Service Connection bound to that identity using Workload Identity Federation, and write the federated credential back on the identity. The whole thing finishes in about 40 seconds.

#!/usr/bin/env bash
set -euo pipefail
NEW_SUB="$1"; TEAM="$2"; ENV="$3"
ORG="contoso"
PROJECT_NAME="$TEAM"

az account set --subscription "$NEW_SUB"

MI_NAME="mi-deploy-${TEAM}-${ENV}"
RG_NAME="rg-${TEAM}-${ENV}-workload-uksouth"
MI_ID=$(az identity create -g "$RG_NAME" -n "$MI_NAME" --query principalId -o tsv)
MI_CLIENT_ID=$(az identity show -g "$RG_NAME" -n "$MI_NAME" --query clientId -o tsv)

# Role assignment can hit the eventual-consistency window even after the
# Validate stage poll; tolerate it.
for i in $(seq 1 6); do
  if az role assignment create \
       --assignee-object-id "$MI_ID" \
       --assignee-principal-type ServicePrincipal \
       --role Contributor \
       --scope "/subscriptions/${NEW_SUB}/resourceGroups/${RG_NAME}" >/dev/null 2>&1; then
    break
  fi
  echo "role assignment attempt ${i} failed; retry in 5s"; sleep 5
done

PROJECT_ID=$(az devops project show --project "$PROJECT_NAME" --org "https://dev.azure.com/${ORG}" --query id -o tsv)

cat > /tmp/connection.json <<EOF
{
  "name": "sc-${TEAM}-${ENV}",
  "type": "azurerm",
  "url": "https://management.azure.com/",
  "authorization": {
    "scheme": "WorkloadIdentityFederation",
    "parameters": {
      "tenantid": "$(az account show --query tenantId -o tsv)",
      "serviceprincipalid": "${MI_CLIENT_ID}",
      "scope": "/subscriptions/${NEW_SUB}"
    }
  },
  "data": {
    "environment": "AzureCloud",
    "scopeLevel": "Subscription",
    "subscriptionId": "${NEW_SUB}",
    "subscriptionName": "${TEAM} ${ENV}",
    "creationMode": "Manual"
  },
  "isShared": false,
  "isReady": true,
  "serviceEndpointProjectReferences": [
    { "projectReference": { "id": "${PROJECT_ID}", "name": "${PROJECT_NAME}" }, "name": "sc-${TEAM}-${ENV}" }
  ]
}
EOF

CREATED=$(az rest --method post \
  --uri "https://dev.azure.com/${ORG}/_apis/serviceendpoint/endpoints?api-version=7.1-preview.4" \
  --headers "Content-Type=application/json" \
  --body @/tmp/connection.json)

ISSUER=$(jq -r '.authorization.parameters.workloadIdentityFederationIssuer' <<< "$CREATED")
SUBJECT=$(jq -r '.authorization.parameters.workloadIdentityFederationSubject' <<< "$CREATED")

az identity federated-credential create \
  --identity-name "$MI_NAME" \
  --resource-group "$RG_NAME" \
  --name "ado-fed-cred" \
  --issuer "$ISSUER" \
  --subject "$SUBJECT" \
  --audiences "api://AzureADTokenExchange"

echo "Service Connection sc-${TEAM}-${ENV} created and federated."

There is no --password on az ad sp create. There is no client secret anywhere in this flow. The team's pipeline will authenticate to their new sub via a token Azure DevOps mints at run time, exchanged through the federated credential for an Entra-issued access token. The full mechanics of this pattern are described in the Azure DevOps service connection docs and the Azure DevOps REST API for service endpoints. The relevant outcome: the new sub is deployable, and there is nothing to rotate.

The team's scaffolded repo lands in their Azure DevOps project with a minimal azure-pipelines.yml that references the new connection by name. They open a PR against main, see it run, watch the first deploy go green, and they are off. From their perspective the platform team did nothing; the merge of the request PR did everything.

The gotcha: subscription alias creation is eventually consistent

The first time we ran the pipeline end-to-end against a real billing scope, the Validate and Provision stages passed and the Wire stage's first step (the landing-zone deployment) failed with SubscriptionNotFound on a role assignment 17 seconds after the sub was reported created. We re-ran the same pipeline against the same sub id 90 seconds later and it succeeded. The subscription had been created; the control plane had not yet propagated it to the regional endpoints. The az account subscription alias create call had returned a sub id, but that id was not yet usable for role assignment, deployment, or even tag operations.

This is documented behaviour. The EA subscription creation page mentions that propagation can take up to two minutes, but the language is gentle and the example code does not poll. In practice we saw a propagation window of 30 to 90 seconds; the Wire stage's first deployment would fail somewhere inside that window if we did not wait.

The fix is the polling step that lives between Provision and Wire (it is in the YAML above, named "Wait for sub propagation"). It calls az role assignment list --scope /subscriptions/<newSubId> in a loop with a five-second sleep, up to 30 iterations. The first call that returns successfully (even with an empty list) signals that the new sub is real enough for the next deployment. Every vend we have run since adding this polling step has hit success between poll 4 and poll 12, which is 20 to 60 seconds of wait. That wait is the single biggest contributor to the nine-minute end-to-end time. Everything else is fast.

A second propagation gotcha sat behind this one. After the sub is reachable for role assignments, the management group association is also eventually consistent. We had the landing-zone Bicep doing its first Microsoft.Authorization/policyAssignments call within seconds of the sub being moved into the right management group, and occasionally the policy assignment would fire before the management-group inheritance had taken effect, producing a PolicyInheritanceConflict that was non-deterministic across runs. The fix was less elegant: a 20-second hard sleep between the management-group association step and the policy assignment step. There is no clean API to poll for "management group inheritance has fully propagated"; 20 seconds was the empirical floor. I am not proud of the sleep; it works.

Troubleshooting: the errors we actually saw

SubscriptionAliasCreationFailed: The provided billing scope is not authorized for this principal was the first one and the most confusing. The identity running the vending pipeline (the platform vending Service Connection's federated identity) needs the Enrollment Account Subscription Creator role on the enrollment account in the EA, not just owner at a management group. The fix was a one-time role assignment that the EA admin had to make from the Cost Management portal; we documented it in the platform runbook. Without it, every vending attempt failed with the same misleading "not authorized" message that I initially misread as a problem with the management group permissions.

BillingScopeNotAuthorized is the related sibling, and it has nothing to do with the principal's authorization. It means the billing scope id in the API call is malformed or refers to an enrollment account in a different tenant. The fix is to read the billing scope id off the Cost Management portal directly, not derive it. The format is /providers/Microsoft.Billing/billingAccounts/<billingAccountId>/enrollmentAccounts/<enrollmentAccountId> and the two ids are not the same id; if you copy the billing account id into the enrollment account slot, this error fires.

RoleAssignmentExists - principal 11111111-2222-3333-4444-555555555555 already has Reader on /providers/Microsoft.Management/managementGroups/mg-platform-prod was the error we hit when re-running a vending pipeline after a partial failure. The principal already had the role from the first run; the second run's role assignment was a duplicate. The fix in the script is to wrap each az role assignment create in an if that first checks az role assignment list for an existing match at the target scope. We also tightened the pipeline's failure modes so partial failures clean up properly; the relevant teardown lives in a cleanup-failed-vend.sh that the pipeline triggers on stage failure.

Operation returned an invalid status code 'NotFound' from az deployment sub create is, nine times out of ten, the eventual-consistency window I described above firing before the polling caught it. If you see it once, look at the timestamp gap between alias creation and the failed deploy; if it is under 90 seconds, the polling step needs more iterations or a longer sleep.

InsufficientHubAddressSpace was a self-inflicted one. We had pre-calculated the hub's available /22s in a JSON manifest, but the manifest was stale by two months because the nightly refresh job had been silently failing. The vending pipeline cheerfully picked a range, deployed the peering, and a real engineer noticed three days later that the new spoke was overlapping with an existing one. The fix was to add a live check in the Validate stage that queries the hub's actual peerings via az network vnet peering list, rather than trusting the cached manifest. The cached manifest is still there as a fast-path; the live check is the safety net.

The audit story

Every action the vending pipeline takes is recorded in three places by virtue of how the pipeline is shaped, not by any deliberate auditing work. The PR in the platform repo is the source of truth for what was requested and who approved it. The Azure DevOps pipeline run log is the source of truth for what the pipeline did, when, with which inputs. The Activity Log on the management group is the source of truth for what Azure executed, against which scope, under which identity.

When the security team comes asking why subscription b7d3... exists and who authorised it, the answer is one URL: the merged PR. The PR contains the YAML, the three approvals, the pipeline run id, and the resulting subscription id. The Activity Log at the management group corroborates that the pipeline's identity (and only the pipeline's identity, with the smallest scope sufficient to do the work) acted on the management group. The conversation that used to take forty minutes of pulling screenshots from Service Now now takes one minute.

The Defender for Cloud secure score for vended subs is, by construction, identical across all forty-seven. We pulled the report after eight months. Every sub had the same baseline policies, the same Defender plans (modulated by sensitivity, as designed), the same NSG defaults, the same diagnostic settings pointing at the same Log Analytics workspace. Zero drift. The original audit finding (eight of eleven manually-vended subs had drift) became, on the next audit cycle, zero of forty-seven. That is the security argument for the work, and it landed.

Where we ended up

Forty-seven subscriptions in eight months. The pipeline has been run sixty-three times across that period; sixteen of those runs failed and rolled back cleanly, almost all of them on a Validate-stage rejection (wrong cost-centre code, region the team did not have hub capacity in, sensitivity classification that did not match the workload hint), which is exactly what the validation is for. Of the runs that passed Validate, only two failed downstream and both were on the propagation window before we added the polling step.

The platform team's calendar time on subscription provisioning, measured by ticket-handling hours: previously about twelve hours a week across four engineers. Currently about ninety minutes a month, almost all of which is the rare PR that needs a real conversation about whether a team should really have a Spoke-Public tier. That is a labour reduction of more than ninety-five percent. The four engineers spent the recovered time building the audit-fix pipeline that turns Defender findings into PRs against the relevant team's repo, which is its own write-up.

The product teams' calendar time on getting a subscription, end to end: previously about fourteen working days, now nine minutes for the pipeline plus however long it takes to gather three approvals on the PR (currently a forty-minute median). That ratio is the part of this I think about most. The work we eliminated was not the platform team's work; it was the requesting team's wait. Every product team that uses a vended sub got their two weeks back. The org has roughly thirty active product teams; conservatively, that is sixty team-weeks reclaimed per year. That is the size of a small product line, recovered from a single platform investment.

The reflective part: the easy story to tell is "we wrote a pipeline, it saved time, here are the metrics." The harder, truer story is that the form was the bottleneck, not the work. Subscription provisioning is not actually difficult. The Bicep that stamps a landing zone is two hundred lines. The pipeline is four stages. The federated credential dance is forty seconds of script. None of these are hard problems. What was hard was making the request into a software artifact (a PR) rather than a human-routed ticket. Once it was a PR, the obvious thing to do was run a pipeline against it, and the obvious place to deposit the answer was a Teams message. The hardness was organisational, not technical. The pipeline is the artefact; the artefact people will read is this article. But the actual product is the PR template. Without it, the pipeline would be a faster ticket. With it, the pipeline is a self-service product.

If I had to do this work again on a fresh org, I would start with the PR template before I wrote a single line of Bicep. The validation rules in Stage 1 are the part that prevents the new world from drifting back toward the old one. The Bicep is replaceable; the YAML schema is the contract. The pipeline implements the contract; the team implements the pipeline. As long as the contract is enforced, the rest is plumbing, and plumbing is the easy part.