Azure DevOps

Hub-and-spoke with private endpoints across three subscriptions: the Private DNS Zone wiring we got wrong

The ticket landed at 09:14 on a Thursday and it was the third one that month. An AKS pod in the workloads sub was getting a public IP back for a storage account that had public access disabled. The private endpoint existed. The DNS A record was right. Resolution still went public, because the Private DNS Zones in the platform sub were linked only to the hub VNet, not to the spokes. This is the rebuild: twenty-one private endpoints across three subs, every spoke linked to every zone, and a pipeline gate that fails any PR that adds a private endpoint without wiring its DNS.

17 Jan 2025 17 min read 312 viewsAzure NetworkingPrivate endpointsPrivate DNS ZonesHub-and-spoke

The ticket landed at 09:14 on a Thursday. "Can we whitelist mystorage.blob.core.windows.net in the firewall? Our AKS pods in aks-workloads-prod keep getting connection refused from storage." It was the third one that month. Two weeks earlier it had been Key Vault. Three weeks before that it had been SQL. Every time, a developer had hit a public IP from a workload that had been built specifically so it would not see public IPs, and every time the fix had been a temporary firewall rule that nobody owned and nobody removed.

The third ticket was the one I refused to close the same way. I pulled up the AKS node, exec'd into a debug pod, and ran nslookup mystorage.privatelink.blob.core.windows.net. It came back with a public IP in the 52.x range. The storage account had publicNetworkAccess: disabled. The private endpoint existed. The DNS A record in the private zone was correct. None of it mattered because the spoke VNet the AKS pod was sitting in had never been linked to the private DNS zone in the first place.

This article is the rebuild. Three Azure subscriptions, twenty-one private endpoints, one shared set of Private DNS Zones in the platform sub, every spoke linked to every zone, and a pipeline gate that fails any pull request that adds a private endpoint without the matching DNS wiring. Four months in, zero developer tickets about storage connectivity, and the words "whitelist storage in the firewall" have not been typed in our Teams channel since.

The architecture that was supposed to work

Three subscriptions, the standard Cloud Adoption Framework shape. sub-platform-hub holds the hub VNet, Azure Firewall, ExpressRoute gateway, and shared services. sub-workloads-prod and sub-workloads-nonprod each hold a spoke VNet for their AKS clusters and the supporting PaaS resources. Spokes are peered to the hub with useRemoteGateways: true, and the hub is peered back with allowGatewayTransit: true. The Firewall in the hub is the default route for everything egressing the spokes.

Bicep for the hub VNet, the bit that matters:

// modules/hub-vnet.bicep
param location string = 'westeurope'
param hubAddressSpace string = '10.0.0.0/22'

resource hubVnet 'Microsoft.Network/virtualNetworks@2024-01-01' = {
  name: 'vnet-hub-platform-weu'
  location: location
  properties: {
    addressSpace: {
      addressPrefixes: [ hubAddressSpace ]
    }
    subnets: [
      {
        name: 'AzureFirewallSubnet'
        properties: { addressPrefix: '10.0.0.0/26' }
      }
      {
        name: 'GatewaySubnet'
        properties: { addressPrefix: '10.0.0.64/27' }
      }
      {
        name: 'snet-dns-inbound'
        properties: {
          addressPrefix: '10.0.1.0/28'
          delegations: [
            {
              name: 'Microsoft.Network.dnsResolvers'
              properties: { serviceName: 'Microsoft.Network/dnsResolvers' }
            }
          ]
        }
      }
      {
        name: 'snet-shared-services'
        properties: { addressPrefix: '10.0.2.0/24' }
      }
    ]
  }
}

output hubVnetId string = hubVnet.id
output hubVnetName string = hubVnet.name

The two spoke VNets are deployed from the same module with different parameters, and each one sits in its own subscription. Bicep crosses the subscription boundary at the module instantiation layer, not inside the module itself:

// main.bicep, deployed from sub-platform-hub
targetScope = 'subscription'

param prodSubId string
param nonProdSubId string

module hub 'modules/hub-vnet.bicep' = {
  scope: resourceGroup('rg-platform-network-weu')
  name: 'hub-vnet'
}

module spokeProd 'modules/spoke-vnet.bicep' = {
  scope: resourceGroup(prodSubId, 'rg-workloads-prod-network-weu')
  name: 'spoke-prod-vnet'
  params: {
    addressSpace: '10.10.0.0/16'
    hubVnetId: hub.outputs.hubVnetId
    spokeName: 'vnet-spoke-prod-weu'
  }
}

module spokeNonProd 'modules/spoke-vnet.bicep' = {
  scope: resourceGroup(nonProdSubId, 'rg-workloads-nonprod-network-weu')
  name: 'spoke-nonprod-vnet'
  params: {
    addressSpace: '10.20.0.0/16'
    hubVnetId: hub.outputs.hubVnetId
    spokeName: 'vnet-spoke-nonprod-weu'
  }
}

The peering, the part that everyone gets right and that turns out to be only half the story:

// modules/peering.bicep
param hubVnetName string
param spokeVnetId string
param spokeName string

resource hubToSpoke 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2024-01-01' = {
  name: '${hubVnetName}/peer-to-${spokeName}'
  properties: {
    remoteVirtualNetwork: { id: spokeVnetId }
    allowVirtualNetworkAccess: true
    allowForwardedTraffic: true
    allowGatewayTransit: true
    useRemoteGateways: false
  }
}

The reverse peering on the spoke flips useRemoteGateways to true and allowGatewayTransit to false. The pattern is documented well by Microsoft for hub-and-spoke topology and virtual network peering, and that part of our design had been correct from day one.

What had not been correct was the assumption that peering carries DNS resolution with it. Peering does not carry DNS. Peering carries packets. DNS is a separate problem and we had solved it wrong.

How DNS was wired (and why pods were getting public IPs)

The original design landed the Private DNS Zones in the platform sub, in a resource group called rg-platform-dns. The zones were named correctly: privatelink.blob.core.windows.net, privatelink.vaultcore.azure.net, privatelink.database.windows.net, and so on. Each zone had a virtual network link, but the link pointed at the hub VNet only. The spoke VNets had a custom DNS server configured at the VNet level, pointing at an old domain controller in the shared services subnet of the hub.

That domain controller was forwarding everything it did not know to 8.8.8.8. Which meant a query from a pod in the prod spoke for mystorage.privatelink.blob.core.windows.net went:

Pod asks the AKS coreDNS for mystorage.blob.core.windows.net.
AKS coreDNS forwards to the VNet's configured DNS server (the DC).
The DC has no zone for privatelink.blob.core.windows.net. It forwards upstream.
Public DNS returns a CNAME chain ending at the public IP for the storage account.
The pod opens a TCP connection to a public IP. The storage account has publicNetworkAccess: disabled. The connection is rejected. The developer files a ticket.

The trick that throws people off here is that the Private DNS Zone existed and had the right A records. You could go into the platform sub, open the zone, and see mystorage pointing at 10.10.5.4. None of that mattered because the query never reached the zone. It reached the DC instead, which is what the spoke VNet had been configured to use.

Microsoft's documentation on DNS resolution for private endpoints is explicit about this, and it took me a re-read to internalise: the Private DNS Zone is only consulted when the VNet doing the query has a virtualNetworkLinks entry pointing at the zone. Peering does not propagate that link. You can have ten VNets all peered to a hub, and if only the hub is linked to the zone, the other nine resolve publicly.

I confirmed it the slow way:

# from a debug pod in aks-workloads-prod
$ kubectl run -it --rm debug --image=nicolaka/netshoot -- bash

bash-5.1# nslookup mystorage.privatelink.blob.core.windows.net
Server:    10.0.2.4
Address:   10.0.2.4#53

Non-authoritative answer:
mystorage.privatelink.blob.core.windows.net  canonical name = blob.dub09prdstr11a.store.core.windows.net
Name:    blob.dub09prdstr11a.store.core.windows.net
Address: 52.239.221.4

The 10.0.2.4 server is the DC in the hub. The answer is a public IP. From the same pod, against Azure-provided DNS:

bash-5.1# nslookup mystorage.privatelink.blob.core.windows.net 168.63.129.16
Server:    168.63.129.16
Address:   168.63.129.16#53

** server can't find mystorage.privatelink.blob.core.windows.net: NXDOMAIN

168.63.129.16 is the Azure-provided DNS endpoint that every VNet has access to. It returned NXDOMAIN because the prod spoke was not linked to the zone. Azure-provided DNS is only zone-aware for VNets that are linked. Until the link existed, the spoke was on its own.

The fix is two parts. Link every zone to every spoke. Stop pointing the spoke at the DC and let Azure-provided DNS do the work, optionally via a centralised DNS Private Resolver in the hub for cross-premises forwarding from the on-prem side. The Private DNS Zones themselves stay in the platform sub. The links cross subscriptions.

The corrected DNS wiring

I rewrote the DNS module to take an array of every spoke VNet id and link every zone to every one of them. The crucial property is registrationEnabled: false. Auto-registration is a feature for VMs joining a zone, not for private endpoints, and turning it on would cause every NIC in the linked VNet to get an A record in the zone, which is not what we want.

// modules/private-dns.bicep
targetScope = 'resourceGroup'

@description('All spoke VNet resource ids, across subscriptions.')
param spokeVnetIds array

@description('Hub VNet resource id, for completeness.')
param hubVnetId string

var allLinkedVnets = union([ hubVnetId ], spokeVnetIds)

var privateLinkZones = [
  'privatelink.blob.core.windows.net'
  'privatelink.dfs.core.windows.net'
  'privatelink.file.core.windows.net'
  'privatelink.queue.core.windows.net'
  'privatelink.table.core.windows.net'
  'privatelink.vaultcore.azure.net'
  'privatelink.database.windows.net'
  'privatelink.azurecr.io'
  'privatelink.documents.azure.com'
  'privatelink.servicebus.windows.net'
  'privatelink.monitor.azure.com'
  'privatelink.oms.opinsights.azure.com'
  'privatelink.ods.opinsights.azure.com'
  'privatelink.agentsvc.azure-automation.net'
]

resource zones 'Microsoft.Network/privateDnsZones@2024-06-01' = [for zoneName in privateLinkZones: {
  name: zoneName
  location: 'global'
}]

resource links 'Microsoft.Network/privateDnsZones/virtualNetworkLinks@2024-06-01' = [for item in flatten([for (zoneName, i) in privateLinkZones: [for vnetId in allLinkedVnets: {
  zoneName: zoneName
  zoneIndex: i
  vnetId: vnetId
}]]): {
  parent: zones[item.zoneIndex]
  name: 'link-${uniqueString(item.vnetId)}'
  location: 'global'
  properties: {
    registrationEnabled: false
    virtualNetwork: {
      id: item.vnetId
    }
  }
}]

The flatten over a nested comprehension is unattractive but readable once you sit with it. It produces a one-dimensional array where every (zone, vnet) pair becomes one link resource. With three VNets and fourteen zones, that is forty-two link resources, each one a single API call at deploy time. Bicep handles the parallelism well; the whole module deploys in about ninety seconds.

Critically, the same module also covers the storage variants. Blob and DFS are different zones. Cosmos has documents.azure.com. ACR is azurecr.io. Service Bus and Event Hubs share servicebus.windows.net. Miss one and that one resource type resolves publicly while everything else looks fine, which is the kind of bug that takes a week to find because the obvious checks all pass. The full list of Private DNS Zone names is on Microsoft Learn and I keep it open in a browser tab any time I add a new service to the stack.

A worked example: storage account with a private endpoint

This is the pattern every PaaS resource now follows in our codebase. The resource itself, its private endpoint, and the DNS A record that pins the endpoint NIC into the right zone.

// modules/storage-with-pe.bicep
param storageName string
param location string = resourceGroup().location
param peSubnetId string
param privateDnsZoneIdBlob string

resource sa 'Microsoft.Storage/storageAccounts@2024-01-01' = {
  name: storageName
  location: location
  sku: { name: 'Standard_ZRS' }
  kind: 'StorageV2'
  properties: {
    publicNetworkAccess: 'Disabled'
    allowBlobPublicAccess: false
    minimumTlsVersion: 'TLS1_2'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
    }
  }
}

resource pe 'Microsoft.Network/privateEndpoints@2024-01-01' = {
  name: 'pe-${storageName}-blob'
  location: location
  properties: {
    subnet: { id: peSubnetId }
    privateLinkServiceConnections: [
      {
        name: 'plsc-blob'
        properties: {
          privateLinkServiceId: sa.id
          groupIds: [ 'blob' ]
        }
      }
    ]
  }
}

resource peDnsGroup 'Microsoft.Network/privateEndpoints/privateDnsZoneGroups@2024-01-01' = {
  parent: pe
  name: 'default'
  properties: {
    privateDnsZoneConfigs: [
      {
        name: 'config-blob'
        properties: {
          privateDnsZoneId: privateDnsZoneIdBlob
        }
      }
    ]
  }
}

The privateDnsZoneGroups sub-resource is what actually registers the A record. Without it, the private endpoint exists, the NIC has a private IP, but there is no DNS entry anywhere and resolution still goes public. This is the part that most homegrown modules forget. The portal does it for you when you click through, which is why this works in dev environments and then quietly breaks in production where everything is Bicep.

The privateDnsZoneIdBlob parameter is the full resource id of the zone in the platform sub, passed in from the parent module. Because it is a cross-subscription reference, the Bicep deployment needs to be running with a service principal that has, at minimum, Private DNS Zone Contributor on the zone's resource group and Network Contributor on the spoke's subnet resource group. We grant both via Workload Identity Federation, scoped tightly to those two RGs.

What the resolution chain looks like once it works

After the fix, the same pod, the same query:

bash-5.1# nslookup mystorage.privatelink.blob.core.windows.net
Server:    168.63.129.16
Address:   168.63.129.16#53

Non-authoritative answer:
Name:    mystorage.privatelink.blob.core.windows.net
Address: 10.10.5.4

And the same query for the non-privatelink hostname, which is what application code actually uses:

bash-5.1# nslookup mystorage.blob.core.windows.net
Server:    168.63.129.16
Address:   168.63.129.16#53

mystorage.blob.core.windows.net  canonical name = mystorage.privatelink.blob.core.windows.net.
Name:    mystorage.privatelink.blob.core.windows.net
Address: 10.10.5.4

The CNAME chain is what makes this work. Azure's public DNS returns a CNAME from mystorage.blob.core.windows.net to mystorage.privatelink.blob.core.windows.net. The Private DNS Zone, linked to your VNet, resolves the privatelink hostname to the private IP. Application code never has to know whether it is talking to a private endpoint or not; the same hostname works in both shapes, which is the whole point of the design.

The 10.10.5.4 address is the NIC for the private endpoint, sitting in the snet-pe subnet of the prod spoke. From the pod, a TCP connect to port 443 on that IP works. The storage account sees the connection coming from the private endpoint and accepts it. publicNetworkAccess: disabled is doing its job because nothing is trying to reach the public endpoint any more.

The DNS server problem on the VNet

There was one more thing to fix. The spoke VNets had custom DNS servers configured at the VNet level, pointing at the old DC. Even after linking the zones, queries would still go to the DC first. The DC would not know about the privatelink zones. The DC would forward upstream. Public IP, again.

The fix is to either set the VNet DNS to Azure-provided (which is the default) or to point it at an Azure DNS Private Resolver running in the hub. We went with the resolver because the on-prem side needs to resolve some of these names too, and Private Resolver gives us a fixed inbound endpoint IP we can put in the on-prem DNS forwarder.

The Bicep for the resolver, in the hub:

resource resolver 'Microsoft.Network/dnsResolvers@2022-07-01' = {
  name: 'dnspr-hub-weu'
  location: location
  properties: {
    virtualNetwork: { id: hubVnetId }
  }
}

resource inboundEndpoint 'Microsoft.Network/dnsResolvers/inboundEndpoints@2022-07-01' = {
  parent: resolver
  name: 'inbound'
  location: location
  properties: {
    ipConfigurations: [
      {
        privateIpAllocationMethod: 'Dynamic'
        subnet: { id: '${hubVnetId}/subnets/snet-dns-inbound' }
      }
    ]
  }
}

After deployment, the inbound endpoint comes up with a static-looking IP (it is technically dynamic but does not change). On the spoke VNet, we set DNS servers to that IP plus 168.63.129.16 as a backup:

resource spokeVnetUpdate 'Microsoft.Network/virtualNetworks@2024-01-01' existing = {
  name: 'vnet-spoke-prod-weu'
}

// Updated via a separate deployment step
// dhcpOptions.dnsServers: [ '10.0.1.4', '168.63.129.16' ]

Once the spoke VNet's DNS is pointed at the resolver inbound endpoint, and the resolver itself is on the hub which is linked to the zones, queries from anywhere in the spoke resolve correctly. The on-prem side then forwards privatelink zones to 10.0.1.4 over ExpressRoute, and on-prem machines see the same private IPs as cloud workloads. Symmetry is the whole game.

The pipeline gate that stops the next ticket

The most important thing I built after the fix was the policy gate. Bicep is forgiving. A developer can add a Microsoft.Network/privateEndpoints resource and forget the privateDnsZoneGroups sub-resource, and Azure will happily deploy it. The deployment will succeed. The resource will exist. Resolution will go public. The next ticket will arrive in a fortnight when somebody notices.

The gate runs on every PR in the platform repo. It compiles the Bicep to ARM JSON via az bicep build, loads the resulting template, walks the resources, and fails the build if any private endpoint is missing its DNS zone group, or if the zones it references are not also linked to all three spoke VNets.

The script, in Python because the team reads Python better than Bash and the JSON parsing is cleaner:

#!/usr/bin/env python3
# scripts/policy-gate-pe-dns.py
import json, sys, glob, subprocess

REQUIRED_SPOKE_VNETS = {
    "vnet-spoke-prod-weu",
    "vnet-spoke-nonprod-weu",
    "vnet-hub-platform-weu",
}

def compile_bicep(path: str) -> dict:
    out = subprocess.run(
        ["az", "bicep", "build", "--file", path, "--stdout"],
        capture_output=True, text=True, check=True
    )
    return json.loads(out.stdout)

def find_resources(template: dict, type_name: str) -> list[dict]:
    return [
        r for r in template.get("resources", [])
        if r.get("type", "").lower() == type_name.lower()
    ]

def check_template(path: str) -> list[str]:
    errors = []
    tpl = compile_bicep(path)
    pes = find_resources(tpl, "Microsoft.Network/privateEndpoints")

    for pe in pes:
        pe_name = pe.get("name", "<unnamed>")
        children = [
            r for r in tpl.get("resources", [])
            if r.get("type", "").lower().startswith(
                "microsoft.network/privateendpoints/privatednszonegroups"
            )
            and pe_name in r.get("name", "")
        ]
        if not children:
            errors.append(
                f"{path}: privateEndpoint '{pe_name}' has no "
                f"privateDnsZoneGroups child. Resolution will go public."
            )

    links = find_resources(
        tpl, "Microsoft.Network/privateDnsZones/virtualNetworkLinks"
    )
    zones_linked_to = {}
    for link in links:
        parent = link.get("name", "").split("/")[0]
        vnet_ref = (
            link.get("properties", {})
            .get("virtualNetwork", {})
            .get("id", "")
        )
        for v in REQUIRED_SPOKE_VNETS:
            if v in vnet_ref:
                zones_linked_to.setdefault(parent, set()).add(v)

    for zone, vnets in zones_linked_to.items():
        missing = REQUIRED_SPOKE_VNETS - vnets
        if missing:
            errors.append(
                f"{path}: zone '{zone}' is not linked to {sorted(missing)}. "
                f"Spoke workloads will resolve publicly."
            )

    return errors

def main():
    all_errors = []
    for path in glob.glob("infra/**/*.bicep", recursive=True):
        try:
            all_errors.extend(check_template(path))
        except subprocess.CalledProcessError as e:
            all_errors.append(f"{path}: bicep build failed: {e.stderr}")

    if all_errors:
        print("Private DNS gate failed:")
        for e in all_errors:
            print(f"  - {e}")
        sys.exit(1)

    print("Private DNS gate passed.")

if __name__ == "__main__":
    main()

The gate is wired into Azure Pipelines as a step that runs on every PR build. It does not run during merge to main, only at PR time, so the feedback gets to the developer while they are still holding the change in their head. The whole script runs in under three minutes against the full repo (forty-odd Bicep files).

The first run after I added it found four files that had been merged with missing DNS wiring. None of them were in active use, two were templates copy-pasted from internal samples, one was a leftover from a proof of concept. We fixed them in a single PR. The gate has fired in anger three times since, all PRs where a developer added a new resource type (Cognitive Services, then later AI Search, then a new Event Grid topic) and did not realise it needed its own privatelink zone. Each time, the gate told them exactly which zone was missing and where to add the link. Each time, the PR was fixed and merged within an hour.

A scary moment, and the new ordering rule

The scariest day in this whole project was about a week before we cut the gate over. I was migrating an older storage account from "public with firewall rules" to "private endpoint only". I added the PE, watched it come up healthy in the portal, set publicNetworkAccess: disabled on the storage account, and walked off to make coffee. The workload that depended on it broke about ninety seconds later. The error logs:

azure.core.exceptions.HttpResponseError: (403) Public network access is disabled.
This request is not authorized to perform this operation.
RequestId: 7c4e3a92-901e-002e-1f4d-8a3a7b000000
Status: 403

The PE existed, the DNS zone existed, but at that moment the spoke VNet had not yet been linked to the zone (I had not landed the new DNS module yet; this was during the transition). So the application was still resolving the storage hostname to the public IP, but the public IP now refused the connection. The workload was down for the eight minutes it took me to add the missing zone link and let DNS propagate.

The new rule, written into the runbook, is the order: private endpoint first, DNS link verified second (nslookup from a debug pod, against the actual application hostname), traffic cutover third, public access disabled last. Doing it in any other order leaves a window where the workload can fail.

Microsoft's private endpoint documentation covers this in the migration guidance, but it is buried halfway through a page that most people skim. I rewrote our internal runbook to put the ordering rule on the first line, in bold, so nobody could miss it the way I had.

Troubleshooting checklist

These are the actual errors and the actual causes, in the order I now check them when somebody reports that a private endpoint is not working:

Symptom: nslookup returns a public IP. Cause: the calling VNet is not linked to the Private DNS Zone, or the VNet has a custom DNS server that does not forward to Azure-provided DNS, or there is no privateDnsZoneGroups child on the private endpoint so no A record exists. Run az network private-dns link vnet list -g rg-platform-dns -z privatelink.blob.core.windows.net and confirm the spoke VNet id appears.

Symptom: nslookup returns NXDOMAIN for the privatelink hostname. Cause: the zone exists but has no A record for that resource, usually because the PE was deployed without a privateDnsZoneGroups. Check az network private-endpoint dns-zone-group list -g <pe-rg> --endpoint-name <pe-name>. If it is empty, that is the bug.

Symptom: nslookup returns the right private IP but the application gets name resolution timed out. Cause: the VNet DNS server is unreachable, or the VNet is using a custom DNS server that has no route back to Azure-provided DNS. Check az network vnet show -g <rg> -n <vnet> --query dhcpOptions.dnsServers. If it points at an IP that is not Azure-provided and not your resolver, that is the bug.

Symptom: 403 Public network access is disabled from the storage SDK. Cause: the application is reaching the public IP because DNS is resolving publicly, but the storage account's public access is disabled. The PE exists but DNS is not wired. Cause and fix as in the first symptom.

Symptom: it worked yesterday, now it does not. Cause: somebody changed the VNet DNS server setting, or somebody deleted a virtual network link, or the zone was moved to a different RG. Check the activity log: az monitor activity-log list --resource-id <zone-id> --start-time $(date -u -d '24 hours ago' +%FT%TZ). The change will be there.

The pattern in all of them is the same: confirm what DNS is actually doing from inside the affected VNet, not from your laptop. Your laptop has different DNS. Run nslookup from a pod in the spoke, or from a VM in the spoke, with no custom resolver overrides. The answer that pod gets is the answer the application gets.

Where we ended up

Twenty-one private endpoints across the three subs. Storage (blob, dfs, file, queue, table) on a shared account per environment. Key Vault for secrets. SQL for the legacy tier of one workload. Cosmos for the new tier. ACR shared across both environments. Service Bus, Event Grid, App Configuration, Application Insights, Log Analytics, Container Apps environment for one greenfield service, and the managed online endpoint of one AI workload. Every one of them resolves to a private IP from every workload in every spoke. None of them require firewall rules. None of them require IP allowlists on the resource itself. The networkAcls on every account are defaultAction: Deny.

The firewall in the hub is doing exactly what it should: egress filtering for the small set of public destinations the workloads still legitimately need (package mirrors, an external API or two, Azure's own management endpoints). The application paths to PaaS are private and the firewall never sees them, which is the design intent and which had been quietly violated by misconfigured DNS for the first year of the platform's life.

Four months since the cutover. Zero tickets about "can we whitelist X in the firewall". Zero outages caused by a missing private endpoint. Three PRs caught by the gate before they could become tickets. The on-call rotation for the platform team used to spend roughly forty minutes a week on storage-connectivity questions; that line item is no longer on the report.

The most important thing I learned from the whole exercise is that Azure networking has two layers that look like one layer if you do not pay attention. The packet plane (VNets, subnets, peerings, firewall rules, NSGs) is the layer everyone draws on the whiteboard. The control plane for hostnames (Private DNS Zones, VNet links, the dhcpOptions of each VNet) is the layer nobody draws and is the one that breaks the design when it is wrong. The first time you wire them together is the first time you understand that private endpoints without DNS are just NICs with no name, and the firewall whitelist tickets are the symptom of a DNS configuration nobody owns. Own the DNS. The tickets stop.