A semantic cache in Azure Redis Enterprise: 38% of OpenAI calls served from cache, and the near-miss that taught us to key by tenant
The March invoice was £18,300 for Azure OpenAI and the CFO had three words in the subject line: explain this please. Eight weeks later the same product was running at £11,400 a month with 38% of completions served from a semantic cache in Azure Cache for Redis Enterprise. This is the build, the threshold tuning that took us from a hit rate that looked great to a hit rate that was actually safe, and the QA test that caught the cache returning one customer's outstanding balance to another customer who asked a similar-shaped question.
The March invoice landed on a Tuesday and the CFO forwarded it to me with three words in the subject line: "explain this please." Azure OpenAI: £18,300 for the month. Up from £4,200 in November. Up from £11,600 in February. The product was the same product, the customer count was up about 20% in that window, and the per-token price had not moved. The bill had, though, and the trajectory said April would be north of £22K if nothing changed.
The product is a customer support copilot. A logged-in customer asks a question in natural language, we retrieve from their tenant-scoped knowledge base, we call GPT-4o with the retrieved context, we stream the answer back. The pattern is unremarkable. The volume is what shifted: we went from roughly 18,000 chat turns a day in November to about 71,000 by early March. The unit economics held, but the absolute number stopped looking like a feature cost and started looking like a line item the CFO knew by name.
By May the same product was serving more traffic, with the same SLAs, for £11,400. The thing that closed the gap was a semantic cache in Azure Cache for Redis Enterprise, sitting in front of Azure OpenAI, taking 38% of the calls before they ever reached the model. This is the build, the tuning that took us from a hit rate that looked great but was unsafe to a hit rate that is lower on paper but actually correct, and the near-miss where the cache almost returned one customer's outstanding balance to another customer who asked a similar-shaped question. Caught in QA, not in prod. The fix changed the key schema.
The premise, and the back-of-envelope that justified the work
Customer support questions cluster. We pulled a week of prompts from Application Insights, ran them through a quick clustering pass, and looked at the shape. The top 50 question intents accounted for about 62% of all traffic. "What is my outstanding balance" and its rewordings ("how much do I owe", "show my balance", "what's left to pay this month") were collectively 8% on their own. Same answer, give or take phrasing.
The implication: if we could detect that two prompts are semantically equivalent and serve the cached completion when we have one, we skip the LLM call entirely. The cache cost is a fraction of the LLM cost (embedding generation is roughly £0.00002 per call against £0.012 per chat turn at our token shape), and the latency is roughly a tenth.
The back of envelope said: if hit rate lands at 35%, savings are about £6,400 a month at current volume, paying back the engineering cost in about three weeks. If hit rate lands at 50%, savings are about £9K a month. Anything above 25% was worth the work. That was the threshold I went to the CFO with, and that was the budget I asked for.
Why Azure Cache for Redis Enterprise and not the lower SKUs
The cache key for a semantic cache is a vector. Not a string, not a hash, a 1,536-dimensional float vector from text-embedding-3-small. To find a cached entry, you do not look up by key, you do a nearest-neighbour search across the vector space and ask "is anything in here within cosine distance X of this incoming prompt's vector." Plain Redis cannot do that. The RediSearch module can, and the RediSearch module ships on the Enterprise tier of Azure Cache for Redis, not on Basic, Standard, or Premium.
Two things on the Enterprise tier are non-negotiable for this workload. First, RediSearch with vector index support. Second, active-active or zone-redundant replication so the cache survives a node failure without a cold start. The cache is not the primary system of record (the LLM is, sort of), but if the cache disappears at 9am on a Monday the OpenAI bill goes back to £600/day until it warms up again. That window matters.
Sizing: we landed on an E10 instance, two shards, zone-redundant. About 12.6GB of usable memory. At our average vector entry size (1,536 floats at 4 bytes each, plus the completion text and metadata, around 8KB total per entry) that gives us room for about 1.5M cached completions before eviction kicks in. We are nowhere close. Steady-state cache size after eight months is around 320K entries, dominated by the long tail of unique questions that get cached once and rarely hit again.
The Bicep that provisions it:
@description('Name of the Redis Enterprise cluster')
param clusterName string = 'redis-copilot-cache-prod'
@description('Region (must be one that supports Enterprise tier with active geo-replication)')
param location string = 'uksouth'
@description('Subnet for private endpoint')
param privateEndpointSubnetId string
@description('Customer-managed key URI for encryption at rest')
param cmkKeyUri string
@description('Managed identity that has access to the CMK key vault')
param cmkIdentityId string
resource cluster 'Microsoft.Cache/redisEnterprise@2024-09-01-preview' = {
name: clusterName
location: location
sku: {
name: 'Enterprise_E10'
capacity: 2
}
identity: {
type: 'UserAssigned'
userAssignedIdentities: {
'${cmkIdentityId}': {}
}
}
properties: {
minimumTlsVersion: '1.2'
encryption: {
customerManagedKeyEncryption: {
keyEncryptionKeyUrl: cmkKeyUri
keyEncryptionKeyIdentity: {
identityType: 'userAssignedIdentity'
userAssignedIdentityResourceId: cmkIdentityId
}
}
}
}
}
resource database 'Microsoft.Cache/redisEnterprise/databases@2024-09-01-preview' = {
parent: cluster
name: 'default'
properties: {
clientProtocol: 'Encrypted'
port: 10000
clusteringPolicy: 'EnterpriseCluster'
evictionPolicy: 'AllKeysLRU'
modules: [
{
name: 'RediSearch'
}
]
persistence: {
aofEnabled: false
rdbEnabled: true
rdbFrequency: '6h'
}
geoReplication: {
groupNickname: 'copilot-cache'
linkedDatabases: []
}
}
}
resource privateEndpoint 'Microsoft.Network/privateEndpoints@2024-01-01' = {
name: '${clusterName}-pe'
location: location
properties: {
subnet: {
id: privateEndpointSubnetId
}
privateLinkServiceConnections: [
{
name: '${clusterName}-pls'
properties: {
privateLinkServiceId: cluster.id
groupIds: ['redisEnterprise']
}
}
]
}
}
output redisHost string = cluster.properties.hostName
output redisPort int = database.properties.port
Three details. evictionPolicy: 'AllKeysLRU' is what stops the cache filling up forever; least-recently-used entries get pushed out when memory pressure hits. modules: [{ name: 'RediSearch' }] is the line without which the whole exercise does not work; RediSearch is opt-in per database. rdbEnabled: true with a 6-hour cadence gives us snapshot recovery if the cluster goes sideways. AOF is off because the cache is not durable storage and the write-amplification cost is not worth it.
The Python middleware: embed, search, decide
The serving path is a thin wrapper around the model call. For a given incoming user prompt, we compute an embedding, run a vector similarity search against Redis with a tenant filter, and either return the cached completion or fall through to OpenAI and write back.
Embedding model choice: text-embedding-3-small for the cache key, Azure OpenAI's smallest embedding model. The retrieval side of the copilot uses text-embedding-3-large for higher recall against the knowledge base, but the cache only needs to be good enough at clustering semantically-equivalent prompts, and 3-small is roughly four times cheaper per call. That cost difference matters because the embedding is computed on every incoming question, hit or miss.
The index gets created once at deploy time:
from redis import Redis
from redis.commands.search.field import TagField, TextField, VectorField, NumericField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
r = Redis(
host=REDIS_HOST,
port=10000,
ssl=True,
password=REDIS_ACCESS_KEY,
decode_responses=False,
)
INDEX_NAME = "idx:cache"
PREFIX = "cache:"
schema = (
TagField("tenant_id"),
TagField("prompt_version"),
TextField("prompt_text"),
VectorField(
"prompt_embedding",
"FLAT",
{
"TYPE": "FLOAT32",
"DIM": 1536,
"DISTANCE_METRIC": "COSINE",
"INITIAL_CAP": 100000,
"BLOCK_SIZE": 1000,
},
),
NumericField("created_at"),
)
definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
try:
r.ft(INDEX_NAME).create_index(fields=schema, definition=definition)
except Exception as e:
if "Index already exists" not in str(e):
raise
The index is FLAT not HNSW for our scale. HNSW is faster at retrieval for very large vector sets but FLAT is exact, simple, and at 320K vectors the latency is still well under 15ms p95 on the E10 SKU. We will revisit when the index crosses about 5M entries.
The HSET key schema is the load-bearing decision. Every cache entry is stored at a key of shape cache:{tenant_hash}:{prompt_version}:{sha256(prompt_text)}. The tenant prefix in the key is not just a query filter, it is part of the key itself. That distinction was the difference between "looks fine in dev" and "almost shipped a data-leak." More on that shortly.
import hashlib, json, time
import numpy as np
from openai import AzureOpenAI
from redis.commands.search.query import Query
aoai = AzureOpenAI(
api_version="2024-10-21",
azure_endpoint=AOAI_ENDPOINT,
api_key=AOAI_KEY,
)
SIMILARITY_THRESHOLD = 0.94
PROMPT_VERSION = "v7" # bump on system-prompt or retrieval changes
def _tenant_hash(tenant_id: str) -> str:
return hashlib.sha256(f"tenant::{tenant_id}".encode()).hexdigest()[:16]
def _embed(text: str) -> np.ndarray:
resp = aoai.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return np.array(resp.data[0].embedding, dtype=np.float32)
def _cache_key(tenant_id: str, prompt_text: str) -> str:
th = _tenant_hash(tenant_id)
pt = hashlib.sha256(prompt_text.encode()).hexdigest()
return f"cache:{th}:{PROMPT_VERSION}:{pt}"
def semantic_lookup(tenant_id: str, prompt_text: str):
vec = _embed(prompt_text)
th = _tenant_hash(tenant_id)
q = (
Query(f"(@tenant_id:{{{th}}} @prompt_version:{{{PROMPT_VERSION}}})"
"=>[KNN 5 @prompt_embedding $vec AS score]")
.return_fields("completion", "score", "prompt_text", "model")
.sort_by("score")
.dialect(2)
.paging(0, 5)
)
res = r.ft(INDEX_NAME).search(q, query_params={"vec": vec.tobytes()})
for doc in res.docs:
# RediSearch returns cosine distance, not similarity. Smaller is closer.
cosine_distance = float(doc.score)
cosine_similarity = 1.0 - cosine_distance
if cosine_similarity >= SIMILARITY_THRESHOLD:
return {
"hit": True,
"completion": doc.completion.decode(),
"similarity": cosine_similarity,
"cached_prompt": doc.prompt_text.decode(),
}
return {"hit": False, "embedding": vec}
def semantic_write(tenant_id: str, prompt_text: str, embedding: np.ndarray,
completion: str, model: str):
key = _cache_key(tenant_id, prompt_text)
th = _tenant_hash(tenant_id)
r.hset(key, mapping={
"tenant_id": th,
"prompt_version": PROMPT_VERSION,
"prompt_text": prompt_text,
"prompt_embedding": embedding.tobytes(),
"completion": completion,
"model": model,
"created_at": int(time.time()),
})
# Optional TTL: 30 days for non-financial intents, 24 hours for balance-shaped
r.expire(key, 60 * 60 * 24 * 30)
The KNN search syntax (=>[KNN 5 @prompt_embedding $vec AS score]) is RediSearch's vector query dialect. The @tenant_id:{th} filter restricts the search space to that tenant before nearest-neighbour ranking. The dialect(2) is mandatory; without it RediSearch falls back to an older parser that does not understand vector hybrid queries.
The wrapper around this in the chat pipeline:
def chat(tenant_id: str, prompt_text: str):
start = time.time()
lookup = semantic_lookup(tenant_id, prompt_text)
if lookup["hit"]:
emit_telemetry("cache.semantic_lookup", {
"result": "hit",
"similarity_score": lookup["similarity"],
"tenant_id": _tenant_hash(tenant_id),
"latency_ms": int((time.time() - start) * 1000),
"prompt_version": PROMPT_VERSION,
})
return lookup["completion"]
completion = call_openai_with_retrieval(tenant_id, prompt_text)
semantic_write(
tenant_id=tenant_id,
prompt_text=prompt_text,
embedding=lookup["embedding"],
completion=completion,
model="gpt-4o-2024-08-06",
)
emit_telemetry("cache.semantic_lookup", {
"result": "miss",
"tenant_id": _tenant_hash(tenant_id),
"latency_ms": int((time.time() - start) * 1000),
"prompt_version": PROMPT_VERSION,
})
return completion
emit_telemetry writes an Application Insights custom event. We use a custom event rather than a metric because the cardinality of tenant_id (low hundreds in our case) is fine on events but expensive on metrics, and we want to slice hit rate by tenant for billing-recovery conversations.
The similarity threshold, and what 0.85 cost us before we found it
The first version shipped with SIMILARITY_THRESHOLD = 0.85. In dev that looked superb. Hit rate 47% on a replayed week of traffic, p50 latency on hits 220ms versus roughly 1.8s uncached. The CFO note I sketched in my head said savings of about £8,800 a month. I almost shipped.
The reason I did not is that one of the test engineers (Priya, who has saved this product more than once) wrote a regression harness that took 200 known prompts with known correct answers and ran them through the full path with a clean cache, then a warm cache, then a warm cache with rephrased versions. The pass criterion was not just "did we get an answer" but "is the answer semantically equivalent to the ground-truth answer for that prompt." She used a separate judge model to score equivalence. At 0.85 threshold, 6% of cache hits were returning completions that were related to the question but not actually the right answer. The classic shape was a customer asking "when is my next payment due" and the cache returning a completion built for "when was my last payment" because the embeddings of those questions are uncomfortably close in cosine space.
A 6% wrong-answer rate is not acceptable. It is genuinely worse than no cache because the customer gets confidently incorrect information at lower latency, which is the worst combination of properties a system can have.
We pushed the threshold up. 0.90: hit rate 41%, false-positive rate 1.8%. Still too high. 0.92: 39%, 0.9%. Closer. 0.94: 38%, 0.4%. That was the band where the regression suite went green and stayed green across three different prompt sets. We landed there.
The thing I underweighted: cosine similarity at 0.94 sounds high. In a 1,536-dimensional vector space with reasonably trained embeddings, 0.94 is the difference between "what is my balance" and "show me my balance," and is meaningfully not the same as "what was my balance last month." 0.85 will conflate the two; 0.94 will not. The exact threshold is workload-specific. Run a regression harness against it before you trust the number.
The near-miss
We ran an internal QA pass before the public rollout. The harness was 800 prompts, 12 simulated tenants, prompts split across realistic intents.
Test 314 was: tenant T-04 asks "what is my outstanding balance." Expected behaviour: cache miss (the test starts clean), call OpenAI with T-04's data, return T-04's balance. Test 315 was: tenant T-09 asks "show my outstanding balance." Expected behaviour: cache miss (different tenant, the entry from 314 should not be visible), call OpenAI with T-09's data, return T-09's balance.
Test 315 failed. The completion returned to T-09 contained T-04's outstanding balance. Different number, different currency, different customer.
The cause: in that build the cache key was cache:{prompt_hash} and the tenant_id was only a filter on the vector search, applied through the @tenant_id constraint in the RediSearch query. The intent was that the filter would restrict candidates before the KNN pass, and tenant T-09's search would never see T-04's entry. The actual behaviour: a bug in how the filter was being escaped meant that for prompts where the tenant_id contained certain characters after hashing (a dash, in this case), the tag filter parsed incorrectly and matched all tags. The KNN pass then ran across all entries, picked T-04's as nearest neighbour, and returned its completion.
This is the worst possible class of bug. The system was returning the right shape of answer to the wrong customer, with no error, no log line, no exception. Without the regression test it would have shipped. The fix had to be defence-in-depth, not just bug-fix-the-escape.
What changed:
The tenant_id became part of the key prefix, not just a filter. The key is now
cache:{tenant_hash}:{prompt_version}:{prompt_hash}. Theprefix=[PREFIX]on the index definition combined with per-tenant key prefixes means that the index is built per tenant in effect, because the@tenant_idfilter combined with the key prefix gives two independent isolation guarantees.The filter on
@tenant_idwas kept, but a second check was added in the Python layer after the RediSearch result returns. We re-parse the key of the returned doc and assert that the tenant prefix matches the incoming tenant. If it does not, we treat the result as a miss and log an alert.The regression harness was extended to specifically cover cross-tenant collision tests. 12 tenants, with at least one prompt per tenant that is a near-rewording of a prompt in another tenant. The test asserts that the cached completion never crosses tenant boundaries. The build fails if any cross-tenant return is detected.
def semantic_lookup(tenant_id: str, prompt_text: str):
vec = _embed(prompt_text)
th = _tenant_hash(tenant_id)
q = (
Query(f"(@tenant_id:{{{th}}} @prompt_version:{{{PROMPT_VERSION}}})"
"=>[KNN 5 @prompt_embedding $vec AS score]")
.return_fields("completion", "score", "prompt_text", "tenant_id", "__key")
.sort_by("score")
.dialect(2)
.paging(0, 5)
)
res = r.ft(INDEX_NAME).search(q, query_params={"vec": vec.tobytes()})
expected_prefix = f"cache:{th}:"
for doc in res.docs:
# Defence in depth: the key prefix must match, and the tenant_id field must match.
if not doc.id.startswith(expected_prefix):
log_tenant_collision_alert(doc.id, th)
continue
if doc.tenant_id.decode() != th:
log_tenant_collision_alert(doc.id, th)
continue
cosine_similarity = 1.0 - float(doc.score)
if cosine_similarity >= SIMILARITY_THRESHOLD:
return {"hit": True, "completion": doc.completion.decode(),
"similarity": cosine_similarity}
return {"hit": False, "embedding": vec}
log_tenant_collision_alert writes a SEV-2 to PagerDuty. It has fired zero times in production. The harness has reproduced it twice when we deliberately broke the key prefix to test the assertion path; both times the alert fired and the request fell through to OpenAI rather than serving the wrong completion.
The lesson, more general: when isolation matters, the isolation boundary should be enforced at the lowest layer you can put it at. A query filter is a fine optimisation. It is a terrible security boundary on its own. The key prefix made the boundary structural; the filter is now belt-and-braces.
The pipeline that deploys it, with a regression gate
Every change to the caching layer goes through an Azure DevOps pipeline that deploys the Bicep, applies the Python middleware, and then runs the 200-prompt regression suite against the deployed environment before promoting to prod. The hit-rate assertion is the gate.
trigger:
branches:
include: [main]
paths:
include:
- infra/cache/**
- app/middleware/semantic_cache/**
variables:
serviceConnection: 'sc-copilot-prod-uksouth'
rgName: 'rg-copilot-prod-uksouth'
stages:
- stage: DeployCache
jobs:
- job: DeployBicep
steps:
- checkout: self
- task: AzureCLI@2
displayName: 'Bicep deploy: Redis Enterprise + RediSearch'
inputs:
azureSubscription: $(serviceConnection)
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
az deployment group create \
--resource-group $(rgName) \
--template-file ./infra/cache/main.bicep \
--parameters @./infra/cache/prod.bicepparam
- stage: RegressionTest
dependsOn: DeployCache
jobs:
- job: RunSuite
steps:
- checkout: self
- task: UsePythonVersion@0
inputs:
versionSpec: '3.11'
- script: |
pip install -r ./tests/regression/requirements.txt
python ./tests/regression/run_cache_suite.py \
--prompts ./tests/regression/known_prompts.jsonl \
--tenants 12 \
--expected-hit-rate-min 0.32 \
--expected-hit-rate-max 0.44 \
--max-cross-tenant-collisions 0 \
--max-false-positive-rate 0.01
displayName: 'Run 200-prompt regression suite'
- stage: PromoteToProd
dependsOn: RegressionTest
condition: succeeded()
jobs:
- deployment: Promote
environment: copilot-prod
strategy:
runOnce:
deploy:
steps:
- script: echo "Cache version promoted"
The expected-hit-rate-min: 0.32 and expected-hit-rate-max: 0.44 band is wide enough to accommodate normal variation in the prompt set, narrow enough to catch a deploy that has accidentally invalidated the cache. Below 0.32 means something has broken; above 0.44 on this fixed suite means we have probably lowered the threshold somewhere and need to look. The job fails on either side, hard.
The max-cross-tenant-collisions: 0 is the gate that exists because of the near-miss. Any cross-tenant return is a build failure.
KQL: cache hit rate by hour, and the August invalidation
Application Insights logs the cache.semantic_lookup custom event for every request. Two queries we keep on a workbook:
customEvents
| where timestamp > ago(24h)
| where name == "cache.semantic_lookup"
| extend result = tostring(customDimensions.result)
| summarize hits = countif(result == "hit"),
total = count()
by bin(timestamp, 1h)
| extend hit_rate = round(100.0 * hits / total, 1)
| project timestamp, hit_rate, hits, total
| order by timestamp asc
The shape is reassuringly boring most days. Hovering around 36% to 40%, dipping slightly in the small hours when traffic is more long-tail, peaking around 41% in mid-afternoon when the support intents cluster more tightly.
The exception was August. On 2025-08-12 we shipped a new system prompt for the copilot. Different framing, slightly different output format, the kind of change that the product team called "tone tweak." Hit rate fell off a cliff the next morning. 38% on the 11th, 9% on the 13th, climbing back through the week to about 22% by the 17th.
The cause was subtle and obvious in hindsight. The cache stored completions written by the old system prompt. The new system prompt produced answers in a slightly different shape (different intro phrasing, different sign-off). The semantic match on the incoming question still worked; the customer was still asking "what is my balance." But the cached completion in some cases would have looked off if served against the new prompt's expected style. We had set the threshold at 0.94 specifically to be strict, so most of those completions still served (the answer text is the same even if the framing differs), but the new prompt also slightly changed how the model interpreted certain intents, and the regression harness flagged a 1.2% jump in semantic-mismatch on the post-deploy suite.
Fix: include PROMPT_VERSION in the key and in the search filter, and bump it on every system-prompt or retrieval-pipeline change. Old cache entries do not match the new prompt-version filter; they sit unused and get evicted by LRU within a few days. The cache rebuilds itself naturally as new traffic comes in.
PROMPT_VERSION = "v7" # bump when system prompt or retrieval pipeline changes
The bump is a one-line change committed alongside the prompt change. Forgetting to bump is the failure mode now. We added a CI check that grep's the prompt files and the PROMPT_VERSION constant in the same commit, and fails if one changed without the other.
Troubleshooting
RediSearch error: Index not found: idx:cache is the error you see when the index has not yet been created in a fresh database or has been dropped by an FT.DROPINDEX. The Bicep deploys the database, not the index; the index is created by an init container on the application's first start. If the init container did not run (we missed this on a clean DR rehearsal), FT.SEARCH against the missing index returns this error immediately and the middleware logs it as a cache.index_missing event. The fix is to re-run the init container; the long-term fix we landed on was to make the application's startup health check assert index existence before reporting ready.
MOVED 5798 10.0.1.4:6380 is RedisCluster's response when a key lives on a different shard than the one your client queried. On Enterprise tier with clustering, every shard owns a slot range, and a client without cluster awareness will get this back on roughly half its requests. The Python redis library handles this if you connect with RedisCluster(...) instead of plain Redis(...). We had this on day one when someone copy-pasted the connection code from a non-clustered staging environment; the symptom was every other request returning a redirect. The fix is the client class, not the server.
WRONGTYPE Operation against a key holding the wrong kind of value is what RediSearch returns when an FT.SEARCH hits a key that exists at the indexed prefix but is not a HASH (for example, someone has put a string at cache:tenant123:v7:abc for debugging reasons). The cache layer should never write non-HASH values at the indexed prefix. We hit this once when a colleague used redis-cli to set a debug key for testing and forgot to clean it up; the index then refused to serve any search that traversed the corrupted entry. DEL on the offending key, and reindex (FT.SYNUPDATE on the affected document) cleared it.
(error) Cannot allocate memory is the LRU eviction working as designed when memory pressure spikes faster than eviction can keep up. The trigger in our case was a backfill job that wrote 80K entries in five minutes; eviction kept up at steady state but the burst momentarily pushed memory above the threshold and writes started failing. We added rate limiting to the backfill (1K writes per second) and the issue did not recur.
MISSING parameter for query from RediSearch when calling KNN search means the dialect(2) modifier is absent. Without dialect 2, the older parser does not understand $vec as a parameter and treats it as a literal token. The error is unhelpful but the fix is one line.
The cost math, eight months in
Pre-cache, March 2025: £18,300 of Azure OpenAI consumption, 71K daily chat turns. Per-turn cost roughly £0.0083 averaged across input and output tokens.
Post-cache, steady state by May 2025: £11,400 of consumption at slightly higher traffic (74K daily turns by then). Hit rate 38%. The cache itself, Redis Enterprise E10 zone-redundant, lands at about £820/month on its own. Embedding costs for the cache lookup path add about £180/month at 74K turns/day on text-embedding-3-small. Net change: roughly £6,900 saved per month, which compounds with traffic growth because every additional cached intent rides at near-zero marginal cost.
The latency story is the part the product team cared about, not the CFO. p50 latency on cache hits sits at 240ms (most of it is the embedding call, which is itself ~120ms). p50 on uncached calls is roughly 1,800ms. p95 on hits is 380ms; on misses it is 3,400ms. The 38% of traffic that hits the cache returns roughly seven times faster, which is visible in the product as snappier responses on the questions customers ask most. That has shown up in CSAT in a way I would not have predicted: the questions a customer asks most often are also the ones they care most about getting quickly. The cache effectively prioritised the high-frequency intents without us having to design that explicitly.
What changed about how we treat the cache
In month one, the cache was a cost optimisation that we would have happily turned off if it caused trouble. By month four, it was a load-bearing component of the chat path. By month eight, the cache has its own oncall rotation, its own runbook, its own dashboard, and a defined RTO of 30 minutes (cold cache is acceptable; cache being entirely unreachable is not, because the LLM bill at full uncached load would more than double overnight).
The mental shift was: this is not an optimisation layer, it is part of the product. Optimisations can fail open. Product layers cannot. The dashboards now show cache hit rate alongside chat turn volume and CSAT, because those three numbers are linked. A drop in hit rate is a budget signal, a latency signal, and a possible safety signal (because the regression suite is the thing that calibrated the threshold).
The piece I find myself coming back to, the one I would tell anyone building this for the first time, is the near-miss. The bug that returned T-04's balance to T-09 was not in the model. The model was correct, with the data it was given. The bug was in our cache layer, and specifically in the assumption that a query filter was enough to enforce a tenant boundary. The fix, putting the tenant into the key prefix, into the filter, and into a post-query assertion, made the boundary structural in three places. We did not need three. We have three because the cost of getting it wrong once is the kind of cost that ends products, not the kind of cost that ends quarters. The CFO would have understood, eventually, an extra £6,900 a month. They would not have understood "we returned the wrong customer's balance because we wanted to save money." Those two outcomes were a single regression test apart. The regression test is the cheapest insurance policy I have ever bought.
The other lesson, smaller but real: the prompt-version key took a week to write and saved us from a recurring class of incident that I had not even fully articulated before August's deploy made it impossible to miss. The system prompt is part of the cache key now because the answer is a function of the prompt, the question, and the data, and changing any of those should invalidate the entry. Obvious in retrospect. Not obvious in the planning meeting where we were arguing about hit-rate targets.
We are still on the same E10 instance, still under 25% memory utilisation, still serving 38% of traffic from cache. The CFO has not asked about the OpenAI invoice since June. That, more than anything, is the metric I track.