Multi-region active-active on AKS and Cosmos: Front Door, conflict resolution, and the two policies we got wrong
An invoice was created in both UK South and West Europe 240 milliseconds apart, with different totals, and our Last-Writer-Wins policy silently picked the wrong one. The customer noticed three days later. Here is the full active-active build, two failed conflict resolution policies, and the third one that has run for fourteen months without a customer-visible incident.
A customer's invoice was created twice. Not in two database rows, in two regions: UK South wrote it at 14:22:08.114 UTC and West Europe wrote it at 14:22:08.354 UTC, a gap of 240 milliseconds. Both writes used the same invoiceId (the client generated a deterministic GUID from the order number), both went through our ASP.NET API on AKS, both landed in Cosmos DB. The two documents differed by exactly one field: totals.grandTotal. One said 1284.50, the other said 1284.49. The difference was a currency rounding bug in an upstream pricing service that two pods on two continents had hit a millisecond apart with two slightly different inputs. Our Cosmos container had Last-Writer-Wins enabled on _ts. Cosmos kept the West Europe document because its _ts was later by a few hundred milliseconds. The customer noticed three days later when they reconciled their statement against the invoice PDF. The PDF, generated by a separate worker reading a cached snapshot from UK South before replication settled, said 1284.50. The invoice in our system said 1284.49. We were one cent wrong, on one invoice, on one day, and the support thread that followed took six engineers eleven hours to close.
The active-active build that produced that incident is the same one that, in Q3 2025, survived a full UK South control-plane outage with a 4 minute 12 second end-to-end failover. The same architecture caused the bug and recovered from a region loss. The difference is that we got conflict resolution wrong twice before getting it right, and a lot of what I want to write here is about those two wrong rounds. The architecture itself is in the documentation. The mistakes are not.
Why active-active, and the March 2024 hour we do not talk about
The system, an ASP.NET API on AKS fronted by Azure Front Door, ran in a single region (UK South) for two years. It was fine. We had an RPO and RTO target on paper that nobody had ever tested. In March 2024, UK South had an outage that took our API offline for fifty-three minutes during the business day. We did not lose any data; we lost availability. The CTO's question afterwards, which I am paraphrasing slightly, was: "Why is our 99.95% commitment a region's commitment?"
The answer, the only honest one, was that we had built around a single region because the cost model and the operational simplicity favoured it. The CTO accepted that for about thirty seconds and then said the system needed to stay up if a region goes. The work to get there took six months and produced the architecture below.
What I want to be clear about: active-active is not a free upgrade from active-passive. You inherit a different class of bug. You stop worrying about "did the secondary catch up" and you start worrying about "did the two writes collide". The first class is easier to reason about. The second is the one I want to talk about.
The shape of the build, end to end
Two AKS clusters, one in UK South, one in West Europe, both running the same ASP.NET 8 API. A Cosmos DB account with multi-region writes enabled, both regions configured as write regions, the read consistency set to Session. Service Bus with geo-DR pairing, primary in UK South. Storage Accounts with RA-GRS for the static assets, separate per-region containers for cached blobs. Front Door (Standard tier) routing all public traffic with priority and weight rules and health probes hitting a deep-health endpoint that I will spend a section on.
The clusters themselves are intentionally identical, deployed from the same Bicep with a region parameter:
param location string
param environmentName string
param aksVersion string = '1.30.4'
param nodeCount int = 6
param logAnalyticsWorkspaceId string
var clusterName = 'aks-${environmentName}-${location}'
resource aks 'Microsoft.ContainerService/managedClusters@2024-05-01' = {
name: clusterName
location: location
identity: { type: 'SystemAssigned' }
sku: { name: 'Base', tier: 'Standard' }
properties: {
kubernetesVersion: aksVersion
dnsPrefix: clusterName
enableRBAC: true
aadProfile: {
managed: true
enableAzureRBAC: true
tenantID: tenant().tenantId
}
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'cilium'
loadBalancerSku: 'standard'
outboundType: 'userAssignedNATGateway'
}
agentPoolProfiles: [
{
name: 'system'
count: 3
vmSize: 'Standard_D4ds_v5'
mode: 'System'
availabilityZones: ['1', '2', '3']
osDiskSizeGB: 128
osDiskType: 'Ephemeral'
enableAutoScaling: true
minCount: 3
maxCount: 5
}
{
name: 'workload'
count: nodeCount
vmSize: 'Standard_D8ds_v5'
mode: 'User'
availabilityZones: ['1', '2', '3']
osDiskSizeGB: 256
osDiskType: 'Ephemeral'
enableAutoScaling: true
minCount: 6
maxCount: 20
}
]
addonProfiles: {
omsagent: {
enabled: true
config: { logAnalyticsWorkspaceResourceID: logAnalyticsWorkspaceId }
}
}
}
}
output clusterFqdn string = aks.properties.fqdn
output principalId string = aks.identity.principalId
The deployment to a region is the same Bicep with location: 'uksouth' on one stage and location: 'westeurope' on another. The two clusters share nothing at the data plane; what makes the system active-active is what sits behind them.
The Cosmos account, with the bit that costs you sleep
Multi-region writes is one property on the Cosmos account, but it is the property that changes the whole operational shape of the system. The Bicep is short, the consequences are not. The official guidance on multi-region writes is correct about what the feature does; what it does not say loudly enough is that turning it on means you are now in a distributed-database posture and your conflict resolution policy is part of your domain model.
param accountName string
param primaryRegion string = 'uksouth'
param secondaryRegion string = 'westeurope'
resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
name: accountName
location: primaryRegion
kind: 'GlobalDocumentDB'
properties: {
databaseAccountOfferType: 'Standard'
enableMultipleWriteLocations: true
enableAutomaticFailover: false
consistencyPolicy: {
defaultConsistencyLevel: 'Session'
maxIntervalInSeconds: 5
maxStalenessPrefix: 100
}
locations: [
{
locationName: primaryRegion
failoverPriority: 0
isZoneRedundant: true
}
{
locationName: secondaryRegion
failoverPriority: 1
isZoneRedundant: true
}
]
backupPolicy: {
type: 'Continuous'
continuousModeProperties: { tier: 'Continuous30Days' }
}
capabilities: []
}
}
resource db 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases@2024-05-15' = {
parent: cosmos
name: 'billing'
properties: { resource: { id: 'billing' } }
}
resource invoicesContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-05-15' = {
parent: db
name: 'invoices'
properties: {
resource: {
id: 'invoices'
partitionKey: { paths: ['/tenantId'], kind: 'Hash' }
conflictResolutionPolicy: {
mode: 'Custom'
conflictResolutionProcedure: '/dbs/billing/colls/invoices/sprocs/resolveInvoiceConflict'
}
}
}
}
That conflictResolutionProcedure field is the entire subject of this article. We have changed its value three times.
The 20-minute soak, and why we never deploy two regions at once
Before any of this is interesting, the deployment pipeline. It is a per-region gated rollout. Never both regions at the same time. UK South goes first. We watch. Twenty minutes later, if nothing has broken, West Europe goes. The rollback path is region-isolated: if West Europe deploys badly we can pin Front Door to UK South while we sort it out, and vice versa.
trigger:
branches: { include: [main] }
variables:
imageTag: $(Build.BuildId)
acrName: 'acrbillingprod'
stages:
- stage: Build
jobs:
- job: BuildAndPush
pool: { vmImage: ubuntu-latest }
steps:
- task: AzureCLI@2
inputs:
azureSubscription: 'sc-billing-build'
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
az acr build \
--registry $(acrName) \
--image billing-api:$(imageTag) \
--file ./src/Billing.Api/Dockerfile \
./src
- stage: DeployUKS
dependsOn: Build
displayName: 'Deploy to UK South'
jobs:
- deployment: ApplyUKS
environment: 'billing-prod-uksouth'
pool: { vmImage: ubuntu-latest }
strategy:
runOnce:
deploy:
steps:
- task: KubernetesManifest@1
displayName: 'helm upgrade --install (uksouth)'
inputs:
action: 'deploy'
kubernetesServiceConnection: 'k8s-billing-uksouth'
namespace: 'billing'
manifests: '$(System.DefaultWorkingDirectory)/charts/billing/rendered-uksouth.yaml'
- task: AzureCLI@2
displayName: 'Smoke test deep-health (uksouth)'
inputs:
azureSubscription: 'sc-billing-uksouth'
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
for i in {1..30}; do
code=$(curl -s -o /dev/null -w "%{http_code}" https://billing-uksouth.internal.contoso.com/health/deep)
if [ "$code" = "200" ]; then exit 0; fi
sleep 10
done
echo "deep-health never returned 200" && exit 1
- stage: SoakUKS
dependsOn: DeployUKS
displayName: 'Soak UK South for 20 minutes'
jobs:
- job: Wait
pool: server
steps:
- task: Delay@1
inputs: { delayForMinutes: '20' }
- stage: DeployWEU
dependsOn: SoakUKS
displayName: 'Deploy to West Europe'
jobs:
- deployment: ApplyWEU
environment: 'billing-prod-westeurope'
pool: { vmImage: ubuntu-latest }
strategy:
runOnce:
deploy:
steps:
- task: KubernetesManifest@1
displayName: 'helm upgrade --install (westeurope)'
inputs:
action: 'deploy'
kubernetesServiceConnection: 'k8s-billing-westeurope'
namespace: 'billing'
manifests: '$(System.DefaultWorkingDirectory)/charts/billing/rendered-westeurope.yaml'
The soak is not theatre. It is the window in which a bad deploy has been allowed to break exactly one region. Twice in 2025 we caught a regression in UK South during the soak and never deployed it to West Europe; the traffic was already on Front Door's secondary route and the customer impact was zero. The cost of the twenty minutes is two deploys in the dashboard instead of one. The cost of skipping it would be both regions broken simultaneously, and Front Door cannot route around that.
Front Door, and the deep-health endpoint
Front Door's job in this build is simple to describe and surprisingly slippery to implement. Two origins, priority 1 and priority 2, equal weight inside each priority. Traffic prefers priority 1 (UK South). If priority 1's health probe fails, traffic moves to priority 2 (West Europe). Origins are documented on Microsoft Learn and the configuration itself is uneventful.
resource profile 'Microsoft.Cdn/profiles@2024-02-01' = {
name: 'afd-billing-prod'
location: 'global'
sku: { name: 'Standard_AzureFrontDoor' }
}
resource originGroup 'Microsoft.Cdn/profiles/originGroups@2024-02-01' = {
parent: profile
name: 'og-billing-api'
properties: {
loadBalancingSettings: {
sampleSize: 4
successfulSamplesRequired: 3
additionalLatencyInMilliseconds: 50
}
healthProbeSettings: {
probePath: '/health/deep'
probeRequestType: 'GET'
probeProtocol: 'Https'
probeIntervalInSeconds: 30
}
sessionAffinityState: 'Disabled'
}
}
resource originUKS 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
parent: originGroup
name: 'origin-uksouth'
properties: {
hostName: 'billing-uksouth.internal.contoso.com'
httpsPort: 443
originHostHeader: 'billing-uksouth.internal.contoso.com'
priority: 1
weight: 1000
enabledState: 'Enabled'
}
}
resource originWEU 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
parent: originGroup
name: 'origin-westeurope'
properties: {
hostName: 'billing-westeurope.internal.contoso.com'
httpsPort: 443
originHostHeader: 'billing-westeurope.internal.contoso.com'
priority: 2
weight: 1000
enabledState: 'Enabled'
}
}
The slippery part was the probe path. The first version of /healthz returned 200 if the pod was up. That was useless. In June 2024 we had a thirty-minute window where UK South's Cosmos write region was effectively down (a control-plane regional issue) but the pods kept returning 200 because the pods were fine. The API was 500ing on every write, Front Door thought UK South was healthy, traffic stayed there, and we ate twenty-eight minutes of error spike before someone manually drained the origin.
The fix was a deep-health endpoint that exercises the dependencies on the path of a real write request:
[ApiController]
[Route("health")]
public sealed class DeepHealthController(
CosmosClient cosmos,
ServiceBusClient sb,
IHttpClientFactory http,
ILogger<DeepHealthController> log) : ControllerBase
{
[HttpGet("deep")]
public async Task<IActionResult> Deep(CancellationToken ct)
{
var sw = Stopwatch.StartNew();
var results = new Dictionary<string, object>();
try
{
var container = cosmos.GetContainer("billing", "healthprobe");
var probeDoc = new
{
id = Guid.NewGuid().ToString(),
tenantId = "_probe",
stamp = DateTimeOffset.UtcNow,
source = Environment.GetEnvironmentVariable("REGION") ?? "unknown"
};
var write = await container.CreateItemAsync(
probeDoc, new PartitionKey("_probe"), cancellationToken: ct);
results["cosmos_write_ms"] = write.Diagnostics.GetClientElapsedTime().TotalMilliseconds;
results["cosmos_region"] = write.Headers["x-ms-cosmos-llsn"] ?? "";
await using var sender = sb.CreateSender("health-probe");
var msg = new ServiceBusMessage(BinaryData.FromString("ping")) { TimeToLive = TimeSpan.FromMinutes(1) };
var sbStart = Stopwatch.GetTimestamp();
await sender.SendMessageAsync(msg, ct);
results["sb_send_ms"] = Stopwatch.GetElapsedTime(sbStart).TotalMilliseconds;
var client = http.CreateClient("pricing");
var depStart = Stopwatch.GetTimestamp();
using var resp = await client.GetAsync("/health", ct);
resp.EnsureSuccessStatusCode();
results["pricing_dep_ms"] = Stopwatch.GetElapsedTime(depStart).TotalMilliseconds;
results["total_ms"] = sw.ElapsedMilliseconds;
return Ok(results);
}
catch (Exception ex)
{
log.LogError(ex, "deep-health failed after {Ms} ms", sw.ElapsedMilliseconds);
results["error"] = ex.Message;
results["total_ms"] = sw.ElapsedMilliseconds;
return StatusCode(503, results);
}
}
}
That endpoint writes a real document to Cosmos, sends a real Service Bus message (with a one-minute TTL so the probe doc does not pile up), and calls one downstream dependency. If any of those fail, Front Door's probe gets a 503 within the timeout and the origin is marked unhealthy on the next sample. The probe interval is 30 seconds with successfulSamplesRequired: 3 out of sampleSize: 4, so a full origin drop takes about two minutes to register, which is the bound on our failover latency from the Front Door side.
Round 1: LWW on _ts, and the 240ms invoice
The first conflict resolution policy was the default suggestion: Last-Writer-Wins on _ts. It is one line of Bicep and it works for nine out of ten document types. It does not work for invoices, and the invoice case is the one we shipped to production.
The original CreateInvoiceAsync was an idempotent-create pattern that assumed a single region:
public async Task<Invoice> CreateInvoiceAsync(InvoiceRequest req, CancellationToken ct)
{
var container = _cosmos.GetContainer("billing", "invoices");
var invoice = new Invoice
{
Id = req.InvoiceId,
TenantId = req.TenantId,
OrderId = req.OrderId,
Totals = req.Totals,
CreatedAt = DateTimeOffset.UtcNow,
SourceRegion = Environment.GetEnvironmentVariable("REGION") ?? "unknown"
};
try
{
var resp = await container.CreateItemAsync(
invoice, new PartitionKey(req.TenantId), cancellationToken: ct);
return resp.Resource;
}
catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.Conflict)
{
var read = await container.ReadItemAsync<Invoice>(
req.InvoiceId, new PartitionKey(req.TenantId), cancellationToken: ct);
return read.Resource;
}
}
On a single region this is correct. Both pods on both continents called CreateItemAsync at roughly the same instant. Both got a 201, in their own region, before the replication had a chance to surface the other write. From the application's point of view, both writes succeeded. Inside Cosmos, the second write to land caused a conflict, and the LWW policy on _ts resolved it to whichever document had the larger server-side timestamp.
The problem with _ts is it has nothing to do with business correctness. In our case, West Europe's _ts was 240ms later. The correct invoice was UK South's (the pricing service had returned the right total to that pod). LWW silently chose the wrong one.
The error you see in the logs is not dramatic. The application never observed the conflict; both regions returned a 201 to their respective callers. The conflict surfaced four hours later when the audit job ran and started raising warnings:
Microsoft.Azure.Cosmos.CosmosException: Response status code does not indicate success: Conflict (409); Substatus: 1002
ActivityId: 4a1c-...
RequestCharge: 7.46
RetryAfter: 00:00:00
DiagnosticsContext: { "Summary": { "GatewayCallDurationInMs": "12.4" }, ... }
That trace is from the audit job re-reading the document and noticing the Totals mismatch with a sibling record in the upstream pricing log. By the time we saw it, the wrong document had been LWW-resolved and the right document was effectively gone (still readable from the conflict feed for 7 days, but the API only ever reads from the resolved view).
Round 2: Custom conflict resolution in a stored procedure, and how it broke reservations
The first attempt at a fix was to move to a Custom conflict resolution policy with a stored procedure. The intuition was: write business logic into the resolver, prefer the document whose sourceRegion matched the document's tenantHomeRegion, fall back to idempotencyKey ordering. Cosmos exposes a stored procedure hook for exactly this.
function resolveInvoiceConflict(incomingItem, existingItem, isTombstone, conflictingItems) {
var context = getContext();
var response = context.getResponse();
var winner = existingItem;
if (incomingItem && incomingItem.tenantHomeRegion === incomingItem.sourceRegion) {
winner = incomingItem;
} else if (existingItem && existingItem.tenantHomeRegion === existingItem.sourceRegion) {
winner = existingItem;
} else {
var iKey = incomingItem ? incomingItem.idempotencyKey : null;
var eKey = existingItem ? existingItem.idempotencyKey : null;
if (iKey && eKey && iKey < eKey) {
winner = incomingItem;
}
}
if (conflictingItems && conflictingItems.length > 0) {
for (var i = 0; i < conflictingItems.length; i++) {
var c = conflictingItems[i];
if (c.tenantHomeRegion === c.sourceRegion && c._ts > winner._ts) {
winner = c;
}
}
}
response.setBody(winner);
}
This worked for invoices. We tested it with a paired pair of writes, the resolver picked the one whose sourceRegion matched the tenant's home region, and the invoice case stopped occurring. We deployed it on a Tuesday in October 2024.
On Friday we got a different ticket. The reservation flow, a separate domain that wrote to a different container but used the same conflict resolution policy by accident (I copy-pasted the Bicep block), started producing reservations that referenced a deleted slot. The pattern: a reservation document had a slotId field, the slot was held in a separate slots container, and the stored procedure resolving a reservation conflict could not see the slot. The stored procedure runs inside Cosmos with no cross-container reads. Whatever it decides, it decides on the basis of the two-or-three documents in front of it. The resolver had no way to know that the slot the "winning" reservation referenced had been deleted by the other region's transaction.
The error from the worker that consumed the reservation queue was:
Microsoft.Azure.Cosmos.CosmosException: Response status code does not indicate success: NotFound (404); Substatus: 1003
ActivityId: 8b7e-...
RequestCharge: 2.91
Path: /dbs/billing/colls/slots/docs/slot-2024-10-25T14:00:00Z-room-3
The reservation pointed at a slot that no longer existed. The stored procedure had picked the wrong sibling because it could not see what the sibling referenced. We rolled the reservation container back to LWW on _ts within ninety minutes and accepted the much rarer "two reservations for the same slot" race in exchange for the much commoner "no dangling references" property.
Round 3: read the conflict feed from the app, resolve in domain terms
The shape that ended up working, and that we still run, is to keep the Cosmos conflict resolution policy permissive (LWW on a custom path that the app sets deliberately) and to push real conflict resolution into a background worker that reads the conflict feed and applies business rules with access to the rest of the data model.
The container policy becomes:
conflictResolutionPolicy: {
mode: 'LastWriterWins'
conflictResolutionPath: '/_ts'
}
So Cosmos always makes a deterministic choice and never blocks the write. The corrected CreateInvoiceAsync writes a domain envelope and emits a Service Bus message for the reconciler:
public async Task<Invoice> CreateInvoiceAsync(InvoiceRequest req, CancellationToken ct)
{
var container = _cosmos.GetContainer("billing", "invoices");
var region = Environment.GetEnvironmentVariable("REGION") ?? "unknown";
var invoice = new Invoice
{
Id = req.InvoiceId,
TenantId = req.TenantId,
OrderId = req.OrderId,
Totals = req.Totals,
IdempotencyKey = req.IdempotencyKey,
SourceRegion = region,
TenantHomeRegion = req.TenantHomeRegion,
WriteCausationId = req.CausationId,
CreatedAt = DateTimeOffset.UtcNow
};
try
{
var resp = await container.CreateItemAsync(
invoice, new PartitionKey(req.TenantId),
new ItemRequestOptions { EnableContentResponseOnWrite = true },
cancellationToken: ct);
await _outbox.PublishAsync(new InvoiceWrittenEvent
{
InvoiceId = invoice.Id,
TenantId = invoice.TenantId,
SourceRegion = region,
WrittenAt = invoice.CreatedAt
}, ct);
return resp.Resource;
}
catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.Conflict)
{
var read = await container.ReadItemAsync<Invoice>(
req.InvoiceId, new PartitionKey(req.TenantId), cancellationToken: ct);
await _outbox.PublishAsync(new InvoiceConflictObservedEvent
{
InvoiceId = invoice.Id,
TenantId = invoice.TenantId,
ObservingRegion = region,
ObservedAt = DateTimeOffset.UtcNow
}, ct);
return read.Resource;
}
}
The reconciler is a separate hosted service that reads the conflict feed for the container, fetches both the resolved document and the conflicting sibling, looks at the related order in the orders container and the pricing log entries that produced each total, applies a business rule (prefer the document whose total matches the upstream pricing log, fall back to home-region preference, fall back to lower total to be customer-favourable), and either confirms the resolved document or writes a corrected one with a correctionOf field pointing at the previous version.
public class ConflictReconciler(
CosmosClient cosmos,
IPricingLog pricing,
ILogger<ConflictReconciler> log) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken ct)
{
var conflicts = cosmos.GetDatabase("billing").GetContainer("invoices.conflicts");
var invoices = cosmos.GetDatabase("billing").GetContainer("invoices");
var iterator = conflicts.GetItemQueryIterator<ConflictEntry>(
"SELECT * FROM c WHERE c.resolved = false");
while (!ct.IsCancellationRequested)
{
while (iterator.HasMoreResults)
{
var batch = await iterator.ReadNextAsync(ct);
foreach (var entry in batch)
{
try
{
var resolved = await invoices.ReadItemAsync<Invoice>(
entry.DocumentId, new PartitionKey(entry.TenantId), cancellationToken: ct);
var sibling = entry.Sibling;
var pricingTruth = await pricing.GetTotalAsync(entry.OrderId, ct);
var correct = resolved.Resource.Totals.GrandTotal == pricingTruth
? resolved.Resource
: sibling.Totals.GrandTotal == pricingTruth
? sibling
: (resolved.Resource.Totals.GrandTotal < sibling.Totals.GrandTotal
? resolved.Resource : sibling);
if (correct.Id != resolved.Resource.Id || correct.Totals.GrandTotal != resolved.Resource.Totals.GrandTotal)
{
var correction = correct with
{
CorrectionOf = resolved.Resource.Id,
CorrectedAt = DateTimeOffset.UtcNow
};
await invoices.UpsertItemAsync(
correction, new PartitionKey(correction.TenantId), cancellationToken: ct);
log.LogWarning("Reconciled invoice {Id}: was {Was}, now {Now}",
correct.Id, resolved.Resource.Totals.GrandTotal, correction.Totals.GrandTotal);
}
entry.Resolved = true;
await conflicts.UpsertItemAsync(entry, new PartitionKey(entry.TenantId), cancellationToken: ct);
}
catch (Exception ex)
{
log.LogError(ex, "Reconciler failed for conflict {EntryId}", entry.Id);
}
}
}
await Task.Delay(TimeSpan.FromSeconds(15), ct);
}
}
}
That worker runs in both regions but only one instance writes corrections at any time, guarded by a Cosmos lease container the same way the change feed processor uses leases. The corrections themselves are visible to the API and to the PDF generator; the PDF generator's cache is keyed on the document's _etag so a correction invalidates the cached PDF and a fresh one is generated on next read.
This is the shape we have run on for fourteen months. We have observed sixty-two conflicts in production over that period; the reconciler has corrected nineteen of them (the rest were noise, the resolved version was already the correct one). Zero customer-visible incidents in the same window.
The split-brain query in Application Insights
Spotting split-brain after the fact is what saved us from the next class of incident. The query is run every five minutes via an Azure Monitor scheduled alert and pages on any non-zero result:
let window = 10m;
dependencies
| where timestamp > ago(window)
| where type == "Azure DocumentDB"
| where target has "billing.documents.azure.com"
| where name == "Create invoice"
| extend region = tostring(customDimensions.region)
| extend invoiceId = tostring(customDimensions.invoiceId)
| summarize regions = make_set(region), writes = count() by invoiceId, bin(timestamp, 1s)
| where array_length(regions) >= 2
| project timestamp, invoiceId, regions, writes
| order by timestamp desc
If that query returns anything, we have just observed a real cross-region race on a single invoiceId within a one-second bucket. The query has fired forty-one times in the last twelve months. Each firing is matched against the reconciler's log to confirm the conflict was caught and resolved.
Service Bus sessions and the bit we gave up on
Service Bus geo-DR (documented here) handles namespace failover. What it does not handle is session state. A message that was delivered to a session-aware receiver in UK South, with the session lock held by a pod that is now offline, is not magically resumed in West Europe. The session lock is region-local. On failover, the lock is lost.
The error in the worker logs after a forced failover test:
Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Reason: SessionLockLost.
ErrorSource: Receive, Status: 0, ServiceBusErrorCode: SessionLockLost
at Azure.Messaging.ServiceBus.ServiceBusSessionReceiver.RenewSessionLockAsync(...)
We accepted this. The mitigation is two pieces: receivers are idempotent (every message handler is keyed on a messageId and checked against a Cosmos record before any side effect), and per-session high-water marks are written to Cosmos so a resumed session on the other region can fast-forward past already-processed messages. The cost is roughly a 4 RU/s Cosmos write per message in the session protocol. The benefit is that a region failover does not corrupt session state, it just slows the affected sessions by however long it takes the high-water mark check to catch up, which in practice is under thirty seconds.
The Q3 2025 failover, four minutes and twelve seconds
In Q3 2025, UK South had a control-plane outage that lasted roughly twenty-six minutes. Customer traffic on our Front Door endpoint saw a 503 spike for ninety-eight seconds while the probe failed three samples and Front Door drained the UK South origin. Then traffic shifted to West Europe. Background workers in UK South stopped processing; the equivalent workers in West Europe continued. Service Bus failed over within its geo-DR pairing about ninety seconds later. The total time from the first failed probe to a clean, single-region steady state was 4 minutes 12 seconds. We did not lose any data. The reconciler caught two conflicts during the window from in-flight writes that had been mid-replication. Customer impact: a brief 503 spike on the dashboard and three support tickets, all closed within the day.
The system did the thing the CTO asked for in March 2024. It stayed up when a region went.
A reflective coda, slightly longer than usual
What I think I now understand, after two failed conflict policies and one that worked, is that the conflict resolution choice is not a database setting. It is a domain choice. LWW is a database setting; it is the right answer for a session document or a cache entry or a pageview log, anything where the question "which value is correct" has a clean temporal answer. For an invoice, or a reservation, or any document where correctness is a function of business state the database cannot see, LWW is a coin toss and the database has no idea it is flipping the coin.
The stored procedure attempt was worse than LWW for a particular reason that I missed at the time: it gave us the false confidence of "we wrote logic for this case" while still being blind to the rest of the model. A stored procedure that runs in Cosmos can see what is in front of it. It cannot see what is in the orders container, or what the pricing service said an hour ago, or what the downstream PDF generator already cached. Anything that needs that visibility belongs in the application, not in a sproc, and the conflict feed is the right primitive for putting it there.
The pipeline shape (twenty-minute soak, region-isolated rollback, deep-health probes) caught two regressions before they reached the second region. The deep-health endpoint cost about eighteen lines of C# and one new Cosmos container for probe documents. The audit conversation with the security team about "can you prove the system survives a region loss" now closes in one screenshot, because the Q3 failover is a real event with real timestamps and a real customer-impact report.
If I were starting again on a new active-active system I would do one thing differently. I would build the conflict reconciler first, before turning on multi-region writes, and run it in a no-op mode against synthetic conflicts in a staging environment for at least four weeks. The reconciler is the hardest part of the system to write correctly because it has to encode business rules that are not always stable, and it is the part of the system that you most want to have running confidently before you need it. We got there in the end. We did not get there in a straight line.
The 240ms invoice is still on the wall of my home office. I printed the Cosmos diagnostics for both writes side by side and stuck them up there in November 2024. It is a useful reminder that distributed systems do not get more forgiving with experience; they just get more interesting failure modes, and the work is to keep finding the ones that hurt customers and to keep moving the hurting parts of the system into places where you can see them coming.