Azure AI

Confluence RAG on Azure AI Search: chunking, semantic ranker, and the eval harness that dragged hallucination from 14% to 2.3%

A platform engineer asked the internal copilot how to rotate a PagerDuty integration key and got a confident answer pointing at a 2021 runbook for a service that no longer existed. Hallucination was at 14% in the first eval pass against 11,400 Confluence pages. Twelve weeks later it was 2.3%. This is the full rebuild: hybrid search, the chunking strategy that put the breadcrumb inside the vector, the semantic ranker that earned its 620ms, and the nightly eval harness that proved every move.

23 Sept 2025 21 min read 412 viewsAzure AI SearchAzure OpenAIRAGSemantic ranker

A platform engineer asked our internal copilot, in plain text, "how do I rotate the PagerDuty integration key for the payments service." The model came back with a four-paragraph answer that was confident, well-structured, and pointed at a runbook called rotate-pagerduty-key.md which had been written in 2021 for a service that no longer existed. The current runbook lived two spaces over in Confluence, with a slightly different title, and described an entirely different rotation flow against PagerDuty's v3 API. The engineer followed the wrong runbook, generated a key in a sandbox account that did not federate back to production, paged the on-call at 3:14am for a service that was not actually down, and filed a Jira ticket the next morning asking us to "either fix the AI or turn it off."

That was week one. The hallucination rate measured by our first eval pass was 14%. By week twelve it was 2.3%. The system now answers around 40 to 60 questions a day, p95 latency is 1.8 seconds, and the same engineer who filed that ticket is one of the heaviest users. This is the whole rebuild, from the naive first cut to the chunking strategy that fixed grounding to the eval harness that proved it.

The corpus and the first naive cut

The source is a Confluence export pulled via the Space Export API: 11,400 pages, 380MB of HTML, spread across 47 spaces. Operations runbooks, architecture decision records, on-call rotations, post-mortems back to 2019, half-finished design docs, and a corner of the corpus we still call "the wiki graveyard" where pages link to other pages that link to pages that 404 against services decommissioned three years ago. The graveyard matters because pages in it look authoritative when retrieved out of context. A 2021 runbook on PagerDuty rotation reads identically to a 2025 runbook on PagerDuty rotation until you check the breadcrumb.

The first cut was the version every team builds in week one and then regrets: dump page text from the Confluence export, strip HTML to plain text, embed the entire page as a single vector against text-embedding-3-large, push it to an Azure AI Search index with a single content field and a single contentVector field, and at query time pull top-k 3 by cosine similarity, stuff them into the prompt, ask the model. The whole thing took an afternoon. The eval told us how bad it was.

# v0 indexer: do not ship this
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex, SearchField, SearchFieldDataType,
    VectorSearch, VectorSearchAlgorithmConfiguration, HnswAlgorithmConfiguration,
    VectorSearchProfile,
)
from openai import AzureOpenAI

client = AzureOpenAI(api_version="2024-10-21", azure_endpoint=AOAI_ENDPOINT)

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(model="text-embedding-3-large", input=text[:8000])
    return resp.data[0].embedding

# embed full page, push one doc per page
for page in pages:
    doc = {
        "id": page.id,
        "content": page.text,
        "contentVector": embed(page.text),
        "title": page.title,
    }
    search_client.upload_documents(documents=[doc])

The first eval pass against 240 ground-truth questions: hit rate 41%, hallucination 14%, MRR 0.29. Hit rate is the fraction of questions where at least one of the retrieved top-k documents actually contained the answer; MRR is the mean reciprocal rank of the first correct document. Hallucination is the fraction of answers where the judge model marked the response as ungrounded against the retrieved chunks. 41% hit rate means most of the time the right page was not even in the model's context window. The 14% hallucination was the model confidently filling that vacuum.

The diagnosis was obvious once we looked at the embeddings. Confluence pages are big. The median page is 2,200 tokens; the long tail goes to 18,000. A single embedding vector for an 18,000 token page is a heavy average. The signal for "this paragraph in this page talks about PagerDuty integration key rotation" gets smoothed flat by the 17,000 surrounding tokens about something else. The vector ends up close to "general operations runbook" and not close to "PagerDuty rotation specifically." That is why a question about rotating the PagerDuty key matched a 2021 runbook just as well as the current one: at page granularity both are general operations runbooks, and the only thing the embedding sees is the average semantic weight.

Index schema, second iteration

The fix was a real Azure AI Search schema designed for chunk-level retrieval with both lexical and vector fields, a vector profile pointing at HNSW, and a semantic configuration we could call out at query time. The full schema, in the JSON form we deploy via the management API (Azure AI Search vector search, semantic ranker):

{
  "name": "confluence-chunks-v3",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true, "filterable": true },
    { "name": "page_id", "type": "Edm.String", "filterable": true, "facetable": true },
    { "name": "chunk_index", "type": "Edm.Int32", "filterable": true },
    { "name": "space", "type": "Edm.String", "filterable": true, "facetable": true },
    { "name": "breadcrumb", "type": "Edm.String", "searchable": true },
    { "name": "page_title", "type": "Edm.String", "searchable": true, "filterable": true },
    { "name": "section_heading", "type": "Edm.String", "searchable": true },
    { "name": "content", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft" },
    { "name": "url", "type": "Edm.String", "filterable": true },
    { "name": "updated_at", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true },
    { "name": "token_count", "type": "Edm.Int32", "filterable": true },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "dimensions": 3072,
      "vectorSearchProfile": "hnsw-default"
    }
  ],
  "vectorSearch": {
    "algorithms": [
      {
        "name": "hnsw-cosine",
        "kind": "hnsw",
        "hnswParameters": { "m": 4, "efConstruction": 400, "efSearch": 500, "metric": "cosine" }
      }
    ],
    "profiles": [
      { "name": "hnsw-default", "algorithm": "hnsw-cosine" }
    ]
  },
  "semantic": {
    "configurations": [
      {
        "name": "conf-semantic",
        "prioritizedFields": {
          "titleField": { "fieldName": "page_title" },
          "prioritizedContentFields": [{ "fieldName": "content" }],
          "prioritizedKeywordsFields": [
            { "fieldName": "breadcrumb" },
            { "fieldName": "section_heading" }
          ]
        }
      }
    ]
  },
  "scoringProfiles": [
    {
      "name": "recency-boost",
      "functions": [
        {
          "type": "freshness",
          "fieldName": "updated_at",
          "boost": 1.6,
          "interpolation": "logarithmic",
          "freshness": { "boostingDuration": "P365D" }
        }
      ]
    }
  ]
}

The vector dimension is 3072 because text-embedding-3-large produces a 3072-dimension vector. The HNSW parameters are tuned for the corpus size; efConstruction: 400 and efSearch: 500 are higher than the defaults and add some indexing cost but bring recall@10 from roughly 0.91 to 0.97 on our ground-truth set. The en.microsoft analyzer was chosen over en.lucene because the Microsoft analyzer handles operations-jargon plurals slightly better in practice ("runbooks" vs "runbook" tokenize consistently). The semantic configuration names three field categories with different weights at rerank time, and the scoring profile applies a logarithmic freshness boost across a one-year window so the 2021 graveyard runbooks lose ground to the 2025 ones at the lexical-scoring step before they ever reach the reranker.

The chunking pipeline that fixed grounding

The single biggest move from 41% hit rate to 73% was the chunking strategy. Iteration one was the obvious one: a 512-token recursive splitter with no overlap. That moved hit rate to 54% and hallucination to 11%. Better, but the wrong-runbook problem persisted, because a 512-token chunk of "rotate the integration key by going to the PagerDuty admin console and selecting your service" reads identically whether the parent page is from 2021 or 2025. The chunk has no context.

Iteration two was the version that actually shipped. Two passes over each page. The first pass is a header-aware recursive split using langchain_text_splitters.RecursiveCharacterTextSplitter with the Confluence HTML preserved long enough to identify H1, H2, H3 boundaries; we split on those boundaries first and only fall back to character splits inside a section that exceeds 700 tokens. The second pass attaches the page's breadcrumb (space name, parent page chain, page title) as a metadata prefix that is itself embedded into the same vector as the chunk body. Every chunk carries its own provenance into the vector space.

import re
import tiktoken
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter

tok = tiktoken.encoding_for_model("text-embedding-3-large")
MAX_TOKENS = 700
OVERLAP = 80

splitter = RecursiveCharacterTextSplitter(
    chunk_size=MAX_TOKENS,
    chunk_overlap=OVERLAP,
    length_function=lambda s: len(tok.encode(s)),
    separators=["\n## ", "\n### ", "\n#### ", "\n\n", "\n", ". ", " "],
)

def html_to_sectioned_markdown(html: str) -> list[tuple[str, str]]:
    """Return list of (section_heading_path, markdown_body).

    section_heading_path is the H1>H2>H3 chain leading to this section.
    """
    soup = BeautifulSoup(html, "lxml")
    # strip Confluence macros that produce noise
    for macro in soup.select("ac\\:structured-macro, ac\\:placeholder"):
        macro.decompose()

    sections, stack = [], []
    current_md = []

    def flush():
        if current_md:
            path = " > ".join(h for _, h in stack)
            sections.append((path, "\n".join(current_md).strip()))
            current_md.clear()

    for el in soup.body.descendants if soup.body else []:
        if el.name in ("h1", "h2", "h3"):
            level = int(el.name[1])
            flush()
            while stack and stack[-1][0] >= level:
                stack.pop()
            stack.append((level, el.get_text(strip=True)))
        elif el.name == "p":
            current_md.append(el.get_text(" ", strip=True))
        elif el.name == "pre":
            current_md.append("```\n" + el.get_text() + "\n```")
        elif el.name == "li":
            current_md.append(f"- {el.get_text(' ', strip=True)}")
    flush()
    return sections

def chunk_page(page):
    breadcrumb = " > ".join([page.space_name, *page.ancestor_titles, page.title])
    sections = html_to_sectioned_markdown(page.html)
    out = []
    for section_path, body in sections:
        if not body:
            continue
        pieces = splitter.split_text(body) if len(tok.encode(body)) > MAX_TOKENS else [body]
        for i, piece in enumerate(pieces):
            heading = f"{breadcrumb} > {section_path}" if section_path else breadcrumb
            prefix = f"[Source: {heading}]\n"
            out.append({
                "id": f"{page.id}::s{len(out):04d}",
                "page_id": page.id,
                "chunk_index": len(out),
                "space": page.space_key,
                "breadcrumb": breadcrumb,
                "page_title": page.title,
                "section_heading": section_path,
                "content": prefix + piece,
                "url": page.url,
                "updated_at": page.updated_at.isoformat(),
                "token_count": len(tok.encode(piece)),
            })
    return out

Two details inside this function carried disproportionate weight. The Source: prefix that is glued onto every chunk body before embedding means the breadcrumb is inside the vector itself. A chunk about PagerDuty rotation under the 2025 payments space embeds differently from the identical body under the 2021 graveyard space, because the prefix differs. The same paragraph in two places no longer collapses to the same point in vector space. The other detail is the separator ordering passed to RecursiveCharacterTextSplitter. The default separators put \n\n first, which means the splitter falls back to paragraph breaks before it considers H2/H3 boundaries. Inverting the order and putting Markdown header tokens first means we always split on headings when we can.

The embedding step runs in batches against text-embedding-3-large (Azure OpenAI embeddings). The batch size is 16 because that is where the latency-vs-throughput curve flattens for this model on our deployment:

from openai import AzureOpenAI
import os, time, itertools

aoai = AzureOpenAI(
    api_version="2024-10-21",
    azure_endpoint=os.environ["AOAI_ENDPOINT"],
    azure_deployment="embeddings-3-large",
)

def batched(it, n):
    it = iter(it)
    while batch := list(itertools.islice(it, n)):
        yield batch

def embed_chunks(chunks, batch_size=16):
    for batch in batched(chunks, batch_size):
        texts = [c["content"] for c in batch]
        for attempt in range(5):
            try:
                resp = aoai.embeddings.create(model="text-embedding-3-large", input=texts)
                for c, d in zip(batch, resp.data):
                    c["contentVector"] = d.embedding
                break
            except Exception as e:
                if "RateLimitExceeded" in str(e) and attempt < 4:
                    time.sleep(2 ** attempt)
                    continue
                raise
        yield from batch

Total embedding cost across 11,400 pages, after chunking down to roughly 168,000 chunks: $46 on text-embedding-3-large at the September 2025 pricing. The index occupies 1.2GB. Reindex throughput from a single worker is about 1,400 chunks per minute. The full corpus reindexes from scratch in under three hours; incrementals run in minutes off a Confluence webhook.

After iteration two, the numbers moved: hit rate 73%, hallucination 6%, MRR 0.61. The wrong-runbook class of failure dropped from 22% of bad answers to 4%, because the breadcrumb-in-the-vector trick made the 2021 PagerDuty chunk and the 2025 PagerDuty chunk distinguishable points in embedding space.

Hybrid retrieval with the semantic ranker

The next move was hybrid retrieval. Pure vector search misses on questions where the user uses an exact term we have indexed, but the term happens to be rare enough that the model never quite learned its semantic neighborhood. Internal service names are the obvious case: a question about pmts-ledger-svc should hit the page that mentions pmts-ledger-svc by name, even if that page is otherwise about Kubernetes deployment patterns and the vector wants to drift toward the abstract topic.

Hybrid retrieval in Azure AI Search combines BM25 lexical scoring with the vector similarity score using Reciprocal Rank Fusion. RRF takes the rank of a document in each subsystem and combines them. A document ranked third by BM25 and fourth by vector ends up well-placed in the fused list even if neither system put it first. Then the semantic ranker reorders the top 50 using a deeper transformer model that does cross-attention over query and document. The semantic ranker call is the one expensive piece of the retrieval pipeline; it adds roughly 600ms to p95 latency. On our eval it is worth every millisecond.

from azure.search.documents.models import VectorizedQuery

def hybrid_search(query: str, top_k: int = 8, space_filter: str | None = None):
    query_vector = embed_query(query)

    vector_query = VectorizedQuery(
        vector=query_vector,
        k_nearest_neighbors=50,
        fields="contentVector",
        exhaustive=False,
    )

    filter_clause = f"space eq '{space_filter}'" if space_filter else None

    results = search_client.search(
        search_text=query,
        vector_queries=[vector_query],
        select=["id", "page_id", "content", "page_title", "breadcrumb", "url", "updated_at"],
        filter=filter_clause,
        query_type="semantic",
        semantic_configuration_name="conf-semantic",
        query_caption="extractive",
        query_answer="extractive|count-3",
        top=top_k,
        scoring_profile="recency-boost",
    )

    out = []
    for r in results:
        out.append({
            "id": r["id"],
            "page_id": r["page_id"],
            "content": r["content"],
            "title": r["page_title"],
            "breadcrumb": r["breadcrumb"],
            "url": r["url"],
            "search_score": r["@search.score"],
            "reranker_score": r["@search.reranker_score"],
            "caption": r["@search.captions"][0].text if r.get("@search.captions") else None,
        })
    return out

Two parameters in this call are load-bearing. k_nearest_neighbors=50 on the vector query is the size of the candidate pool the semantic ranker gets to chew on; the docs default this to a lower number and we found 50 was the inflection point where rerank quality stopped improving. query_type="semantic" with semantic_configuration_name="conf-semantic" is what activates the L2 reranker. The reranker_score returned per result is a 0 to 4 floating-point number that we use both for downstream grounding gates and for telemetry: every retrieval logs the distribution of reranker scores it got back, and when the top result's score is under 1.5 we treat it as a low-confidence retrieval and route the question differently.

The eval after hybrid retrieval came in: hit rate 84%, hallucination 4%, MRR 0.71. The Confluence service-name questions stopped missing, the recency-boost scoring profile moved 2024 and 2025 docs above their 2019 namesakes consistently, and the reranker's role was visible in the score distribution: questions where the vector retrieval got the right doc into the top 50 but ranked it tenth were getting reranked into the top 3 about 80% of the time.

The grounding prompt that did the last percent

The remaining hallucination, after retrieval was sorted, was the model fabricating plausible-sounding answers even when the retrieved chunks did not contain the answer. The fix is unglamorous but worked. The prompt does three things: it concatenates the retrieved chunks with their chunk IDs as visible anchors, it requires the model to cite chunk IDs inline in its answer, and it instructs the model to refuse with a specific phrase if it cannot ground every claim.

SYSTEM = """You answer questions about Acme's internal operations using ONLY the
context passages provided. Each passage is preceded by a tag like [chunk_id: abc::s0007].

Rules:
1. Cite chunk_ids inline using the format [^abc::s0007] after each claim that
   relies on a passage.
2. If the context does not contain enough information to answer the question,
   reply with exactly: "I don't know based on what I found." and nothing else.
3. Do not use prior knowledge. If the context says nothing about it, you say nothing about it.
4. If the question is about a procedure, prefer passages from the most recently
   updated page, which you can see in the breadcrumb.
"""

def build_user_message(query: str, retrieved: list[dict]) -> str:
    parts = []
    for r in retrieved:
        if r["reranker_score"] < 1.5:
            continue  # below confidence floor
        parts.append(
            f"[chunk_id: {r['id']}]\n"
            f"breadcrumb: {r['breadcrumb']}\n"
            f"updated_at: {r['updated_at']}\n"
            f"---\n{r['content']}\n"
        )
    if not parts:
        return f"NO PASSAGES RETRIEVED.\n\nQuestion: {query}"
    return "Context passages:\n\n" + "\n\n".join(parts) + f"\n\nQuestion: {query}"

def answer(query: str):
    retrieved = hybrid_search(query, top_k=8)
    resp = aoai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": build_user_message(query, retrieved)},
        ],
        temperature=0.1,
        max_tokens=900,
    )
    return resp.choices[0].message.content, retrieved

Two things matter inside this code. The reranker_score < 1.5 filter is the grounding gate; it drops chunks that the semantic ranker did not actually like. Without it, the model gets a context window stuffed with low-relevance text and still tries to use it. With it, the model often gets fewer chunks but better ones, and the refusal path kicks in cleanly when nothing relevant came back. The [^chunk_id] citation convention turns the answer into something we can mechanically verify: a postprocessor extracts every cited chunk_id, looks up the chunk text, and checks that the cited claim is in the chunk. If a citation does not check out, we flag the answer in telemetry and the eval harness counts it as hallucinated even if a human reader would have called it correct.

The grounding instruction alone, before any other change, moved hallucination from 4% to 2.3%. The model genuinely uses the refusal phrase. It says "I don't know based on what I found" about three to four times a day, which is exactly what we want; the alternative is making something up.

The eval harness

None of the iteration above would have been measurable without a real eval harness. 240 ground-truth questions, written by 14 engineers across the operations and platform orgs, every question with a reference answer and the canonical chunk ID that should ground that answer. The harness runs nightly against the staging index and on demand against any branch of the pipeline that an engineer is working on. The judge model is gpt-4o with a 0-3 grounding rubric: 0 is fabricated, 1 is partially grounded, 2 is mostly grounded with a minor stretch, 3 is fully grounded with no claims outside the context.

import json, csv
from pathlib import Path

JUDGE_SYSTEM = """You grade the groundedness of an answer against retrieved context.

Score 0: The answer contains claims that are not supported by the context.
Score 1: The answer contains a mix of grounded and ungrounded claims.
Score 2: The answer is mostly grounded with at most one minor unsupported claim.
Score 3: Every claim in the answer is directly supported by the context.

Return JSON: {"score": <int>, "reasoning": "<short>", "unsupported_claims": ["..."]}"""

def judge_answer(question, answer, retrieved):
    context = "\n\n".join(f"[{r['id']}] {r['content']}" for r in retrieved)
    user = f"Question: {question}\n\nAnswer:\n{answer}\n\nContext:\n{context}"
    resp = aoai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": JUDGE_SYSTEM},
                  {"role": "user", "content": user}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

def run_eval(questions_path: Path, run_id: str):
    rows = []
    hits, mrr_sum, halluc = 0, 0.0, 0
    with open(questions_path) as f:
        questions = json.load(f)

    for q in questions:
        ans, retrieved = answer(q["question"])
        retrieved_ids = [r["id"] for r in retrieved]
        # hit: gold chunk anywhere in retrieved
        gold = q["gold_chunk_id"]
        hit_rank = next((i for i, rid in enumerate(retrieved_ids, 1) if rid == gold), None)
        if hit_rank:
            hits += 1
            mrr_sum += 1.0 / hit_rank
        # judge grounding
        verdict = judge_answer(q["question"], ans, retrieved)
        if verdict["score"] <= 1:
            halluc += 1
        rows.append({
            "qid": q["id"],
            "hit_rank": hit_rank or 0,
            "judge_score": verdict["score"],
            "answer": ans,
            "reasoning": verdict["reasoning"],
        })

    n = len(questions)
    summary = {
        "run_id": run_id,
        "hit_rate": hits / n,
        "mrr": mrr_sum / n,
        "hallucination": halluc / n,
        "n": n,
    }
    Path(f"runs/{run_id}.json").write_text(json.dumps({"summary": summary, "rows": rows}))
    return summary

The summary is posted to a Slack channel every morning at 7am. When a number moves more than 1.5 percentage points in either direction, the harness opens a Jira ticket against the platform team automatically and attaches the diff between the current run's rows and the prior run's rows. We caught a regression in week eight this way: a Confluence reindex job had silently failed for 14 hours, the corpus was stale, and the hit rate dropped from 84% to 79% overnight on a class of questions about a recently renamed service. The harness flagged it before any user did.

The 240 questions are versioned in git alongside the rest of the pipeline. The set is augmented quarterly: every time a user reports a bad answer, we add the question and the right answer to the corpus, and we add the question to the eval. By week twelve the eval set was 268 questions. By month five it was 340. The eval grows with the surface area of the system.

Troubleshooting

(InvalidRequestParameter) The vector query parameter 'kind' is required for vector queries. The Azure AI Search SDK changed its vector query shape between preview API versions. The fix is to use VectorizedQuery from the current SDK (which sets kind: "vector" on serialization) rather than hand-rolling the dict against an older preview. If you copied a snippet off Stack Overflow that builds the query as a literal dict, it will hit this error on the current 2024-07-01 API version. The fastest path through it is the typed model.

azure.core.exceptions.HttpResponseError: (RateLimitExceeded) Rate limit is exceeded. Try again in 8 seconds. on the embeddings call during the bulk reindex. The Azure OpenAI deployment was on the default 120K tokens/minute quota and the bulk reindex was trying to embed ~5,000 chunks per minute, which is roughly 1.6M tokens per minute. The fix is either to raise the quota (we did, to 600K tokens/minute on a dedicated deployment) or to throttle the indexer with a token-rate semaphore. We did both. The retry-with-exponential-backoff in embed_chunks above is the floor; the quota raise is what makes a full reindex finish in three hours rather than thirty.

The chunk-overlap bug. For about four days in week seven we had a class of queries where two retrieved chunks were textually identical, taking up a context window slot for nothing. Tracking it down: the header-aware splitter was emitting a section twice because the H2 element in some Confluence pages was wrapped inside both a <div class="contentLayout"> and a separate <ac:structured-macro> placeholder, and the BeautifulSoup descendants iterator was visiting the heading once through each parent. The fix was a seen_node_ids set in html_to_sectioned_markdown and a decompose() on the macro placeholders before iteration. Lesson, again, that Confluence HTML is not really HTML; it is an XML dialect with a generation of structured macros that look like content but are not.

The corrupted code block. The bulk reindex job kept failing on a single page out of 11,400. The exception was an lxml parser error on a malformed <ac:plain-text-body> inside a code macro in a 2019 retro about a Kafka outage. Someone had pasted a stack trace that included < and > characters and Confluence had eaten them in a way that produced unbalanced tags six years later when we tried to parse them. The fix was per-page error isolation: wrap the chunking call in a try/except that logs the page id, writes the raw HTML to a quarantine directory, and continues. The retro is still in the quarantine; nobody has needed to read it in six years.

Request body content is not a valid JSON document from the indexer push. The issue was a chunk whose content contained an unescaped byte (a NUL) that had leaked in from a Confluence export of a binary attachment description. Azure AI Search rejects JSON containing literal NULs. The fix in the indexer is a content = content.replace("\x00", "") immediately before serialization.

The semantic configuration 'conf-semantic' is not defined on the index. This happened after we updated the index schema and forgot to send the semantic block on the PUT. The index update API is full-replacement on the index definition, not a merge. If the body you send does not include the semantic block, it gets removed. The fix is to always send the full index definition through your management tooling and to lint it for the required sub-blocks.

Cost, latency, and where we ended up

The cost breakdown at steady state: embedding the full corpus once cost $46, incremental embeddings (about 200 page updates per week, roughly 3,000 chunks per week) run to under a dollar. Index storage is 1.2GB on the S1 tier of Azure AI Search, which is $250 a month, and we keep three indexes (prod, staging, dev) so call it $750. The LLM call dominates per-query cost: about $0.012 per question with the current prompt and average 6 retrieved chunks. At 50 questions a day average, that is $18 a month in LLM costs and the all-in monthly bill is under $800.

Latency, p50 / p95 / p99: 1.2s / 1.8s / 2.6s. The breakdown of p95: 60ms for the embedding of the query, 180ms for the hybrid retrieval call, 620ms for the semantic ranker, 940ms for the LLM completion. We measured what dropping the semantic ranker would cost in quality and what it would save in latency, and the eval said hallucination would rise from 2.3% to 3.7% and hit rate would fall from 84% to 79%. The 620ms is worth keeping.

Where we ended up. Hallucination at 2.3%, hit rate at 84%, MRR at 0.71. The system answers around 40 to 60 questions a day, spread across roughly 80 distinct users across operations, platform, payments, and the SRE on-call. The wrong-runbook class of failure that started the project no longer happens; the most recent occurrence in our incident log is six months old, and it was a question about a runbook that had not yet been written, where the model correctly said "I don't know based on what I found" and the engineer wrote the missing runbook that afternoon.

Reflective coda

The thing I underweighted at the start was how much of the work would be on the corpus, not on the model. The vector store and the LLM are the parts people talk about because they are new, and they were the parts I built first. The parts that actually moved the numbers were boring: the breadcrumb prefix that made identical-text chunks distinguishable in embedding space, the recency-boost scoring profile that let 2025 documents outrank their 2021 namesakes, the per-page error isolation that let one corrupted retro stop blocking the rebuild of 11,400 others. The chunking strategy carried more weight than the embedding model. The grounding prompt carried more weight than the reranker. None of that is what the demos focus on, but it is where the percentage points came from.

The eval harness was the multiplier on everything. Without it I would have shipped iteration one, declared victory because the hand-picked demo queries worked, and the team would have stopped using the copilot within a month when the long tail of real questions came back wrong. With it, we knew within an hour of each change whether the change was good or bad on a representative sample, and we could revert experiments cheaply when they regressed something we cared about. The harness paid for itself the first time it caught a regression nobody noticed, and it has paid for itself again every time since. If you build one of these systems and you do not build the eval, you are flying blind. The model will help you fool yourself; the harness is what stops you.

The hardest organizational lesson came at week six, when the eval said the system was at 6% hallucination, which is genuinely usable, and a vocal subset of the team wanted to ship to general availability. We did not. The 2.3% target was set against the threshold where the security team would sign off on the system answering questions in customer-facing escalation flows, and 6% would not have cleared that bar. Holding the line on the eval target rather than the calendar target was the right call, and it added four weeks. The audit conversation when we did ship took fifteen minutes instead of the day-long review we would have gotten at 6%. Four weeks of engineering bought us roughly seven weeks of audit time we did not spend. The compounding return on doing the eval work is the kind of thing that does not show up on a Gantt chart but shows up everywhere else.

The engineer who filed the original "fix the AI or turn it off" ticket asks the system about three things a week now. Last month he asked it how to rotate the PagerDuty integration key for the payments service. It returned the current runbook, with citations, with the breadcrumb showing it was from the 2025 operations space, and the answer began "According to the runbook at /spaces/ops-2025/pagerduty-rotation, you generate a new integration key in the PagerDuty admin console under Services > Integrations [^a4b1c::s0012]." He followed it. The rotation took eleven minutes. Nobody got paged at 3am.