Azure AI

Invoice extraction at 14,200 documents: Document Intelligence, gpt-4o for the missed fields, and an audit trail finance trusts

Azure AI Document Intelligence's prebuilt-invoice model extracted 91% of the fields the finance team needed across a 14,200-document evaluation set. The missing 9% included cost-centre code, per-line currency, and PO line items, the fields that between them tagged invoices worth £4.2M a year. Here is the hybrid pipeline that closed the gap, the structured-output schema that made every field auditable, and the 14,200-document regression gate that runs on every PR.

25 Nov 2025 17 min read 312 viewsDocument IntelligenceAzure OpenAIStructured outputsEvaluation

The first time I looked at the field-level accuracy report, I had to read it twice. Azure AI Document Intelligence's prebuilt-invoice model was extracting 91% of the fields the finance team needed across our 14,200-document evaluation set. That sounds like a victory until you look at which 9% were missing. The cost-centre code, the per-line currency on mixed-currency invoices, the PO line items on the multi-page jobs. Those three fields between them tagged invoices worth roughly £4.2M a year. The 91% was the easy 91%; the 9% was where the money lived.

This is the build log for the hybrid pipeline that closed the gap. Document Intelligence's prebuilt-invoice does the bulk extraction, the per-field confidence scores act as a routing gate, and an Azure OpenAI gpt-4o call with structured outputs fills in only the fields DI missed or got with low confidence. The whole thing is wrapped in a 14,200-document eval set that runs on every PR, an audit table that gives finance per-field provenance back to the source PDF, and a Pydantic schema that means nothing comes out of the pipeline as a stringified float pretending to be a decimal. Field-level accuracy on the held-out set, last run before writing this, was 98.7%.

Why the 9% was the part that mattered

The original ask from the finance director, almost a year ago now, was "we are spending too much human time keying invoices into the ERP, can the AI team do something." We pulled a representative sample of 14,200 invoices from the last twelve months of accounts payable. Eleven jurisdictions (UK, Ireland, Germany, France, Netherlands, Spain, Italy, Poland, Japan, Singapore, US), four currencies (GBP, EUR, USD, JPY), three layout families (single-page A4 with sender header, multi-page with appended PO, A4 invoice with stapled remittance advice). Three hundred and ninety of them were marked "manual review" in the existing process because nobody had ever figured out a rule for them.

I pointed Document Intelligence prebuilt-invoice at the whole set as a baseline. It produced clean extractions for vendor, invoice number, invoice date, due date, subtotal, total, tax, and most line items. The aggregate field-level accuracy across all extracted fields was 91.3%. The fields where it underperformed:

Cost-centre code: an internal code finance stamps on the invoice after receipt. Sometimes hand-written, sometimes a barcode sticker, sometimes typed into a remittance field. DI did not know what this field meant. Accuracy: 0%, because DI did not extract it at all.
Per-line currency: when an invoice's header was GBP but line items were tagged USD with a manual annotation, DI took the header. Accuracy on mixed-currency invoices: 12%.
PO line item match: when an invoice had an attached PO on page 3, DI extracted the invoice's line items fine, but did not associate them with the PO's line item identifiers. Accuracy: 4%, because most associations existed only in the visual layout (an arrow drawn between the invoice line and the PO line).

These were the fields that triggered manual review under the old process. They were also the fields that, when wrong, ended up causing duplicate payments, mis-categorised costs in the finance system, and the quarterly audit findings that the finance director was tired of explaining.

The hybrid pipeline, end to end

The shape is: DI extracts everything it can, every field comes back with a per-field confidence score, fields above a threshold pass straight through, fields below the threshold get sent to a second-stage LLM call with the rest of the DI output as context. The LLM only fills in the gaps; it does not re-extract fields DI already nailed. Everything that comes out of the pipeline carries provenance (which fields came from DI, which came from the LLM, what the confidence was, and the document hash) so finance can audit any number back to a specific page in a specific PDF.

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential

di_client = DocumentIntelligenceClient(
    endpoint="https://di-invoices-uksouth.cognitiveservices.azure.com/",
    credential=DefaultAzureCredential(),
)

def extract_with_di(pdf_bytes: bytes) -> dict:
    poller = di_client.begin_analyze_document(
        model_id="prebuilt-invoice",
        body=pdf_bytes,
        content_type="application/pdf",
    )
    result = poller.result()
    if not result.documents:
        return {"fields": {}, "raw": result.as_dict()}
    doc = result.documents[0]
    fields = {}
    for name, field in doc.fields.items():
        fields[name] = {
            "value": field.value_string or field.value_currency or field.value_date,
            "confidence": field.confidence,
            "bounding_regions": [r.as_dict() for r in (field.bounding_regions or [])],
        }
    return {"fields": fields, "raw": result.as_dict()}

Three things about this call. The prebuilt-invoice model id is the one the Document Intelligence prebuilt-invoice page documents; we tried prebuilt-document and prebuilt-layout early on and got worse extractions because they do not know what an invoice is. The per-field confidence comes back in field.confidence as a float between 0 and 1; DI does not advertise this prominently but it is the most important thing the API returns for our use case. The bounding_regions is the polygon on the page where the field came from; we store this for the audit trail, so a clicking through to the original PDF and seeing exactly where the number came from is a one-second operation.

The confidence gate

DI returns a confidence per field. We picked 0.85 as the routing threshold after looking at the calibration curve: above 0.85, DI was right 99.4% of the time on our eval set; below 0.85, it dropped to 73%. The 0.85 line is per-field, not per-document, which matters because an invoice can have an excellent vendor extraction and a terrible cost-centre extraction at the same time, and the gate handles each field independently.

HIGH_CONFIDENCE = 0.85
ALWAYS_LLM = {"cost_centre", "po_line_items", "per_line_currency"}

def needs_llm(field_name: str, extracted: dict) -> bool:
    if field_name in ALWAYS_LLM:
        return True
    if field_name not in extracted["fields"]:
        return True
    return extracted["fields"][field_name]["confidence"] < HIGH_CONFIDENCE

The ALWAYS_LLM set is the three fields DI does not extract at all. Everything else is routed by confidence. Across the eval set, this routes about 11% of fields to the LLM stage; the other 89% are taken directly from DI with no further processing.

Stage 2: the focused LLM call

The LLM call is deliberately small. The prompt contains the DI output (high-confidence fields included as confirmed context, low-confidence fields explicitly marked as "DI returned this with low confidence, please verify or correct"), the page text DI extracted, and the schema the model is expected to fill. The schema is enforced via Azure OpenAI structured outputs, which means the model cannot return a malformed JSON; the response is guaranteed to match the schema or the call errors.

from decimal import Decimal
from enum import Enum
from pydantic import BaseModel, Field
from typing import Optional

class Currency(str, Enum):
    GBP = "GBP"
    EUR = "EUR"
    USD = "USD"
    JPY = "JPY"

class CostCentreCode(BaseModel):
    code: str = Field(pattern=r"^[A-Z]{2}-\d{4}-[A-Z]{1,3}$")
    source_page: int
    source_polygon: Optional[list[float]] = None

class LineItem(BaseModel):
    description: str
    quantity: Decimal
    unit_price: Decimal
    line_total: Decimal
    line_currency: Currency
    po_line_id: Optional[str] = None

class InvoiceCorrections(BaseModel):
    cost_centre: Optional[CostCentreCode] = None
    line_items: list[LineItem]
    notes: Optional[str] = Field(
        None,
        description="If any DI field looks wrong, explain. Otherwise leave null.",
    )

A few things about this schema. The Decimal types are doing real work; if you let the LLM return floats here, you get 0.30000000000000004 for 0.3 periodically and finance reconciliation breaks. The pattern on the cost-centre code is the regex finance gave us for the internal coding format; the model is forbidden from returning anything that does not match. The Currency enum prevents the model from inventing currencies (we saw "GB-P" and "Eur" in early prototype runs). The source_page and source_polygon on CostCentreCode are the audit fields; we ask the model to point at where it found the value.

from openai import AzureOpenAI

aoai = AzureOpenAI(
    azure_endpoint="https://aoai-invoices-uksouth.openai.azure.com/",
    api_version="2024-10-21",
    azure_ad_token_provider=lambda: DefaultAzureCredential().get_token(
        "https://cognitiveservices.azure.com/.default"
    ).token,
)

def correct_with_llm(di_output: dict, page_text: str) -> InvoiceCorrections:
    completion = aoai.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": (
                    "You correct invoice extractions. The DI output is mostly right. "
                    "Fill the cost_centre, line_items, and any field DI flagged low confidence. "
                    "Do not re-extract fields DI already returned with confidence >= 0.85. "
                    "If you cannot find a field, return null. Do not hallucinate."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"DI fields:\n{di_output['fields']}\n\n"
                    f"Page text:\n{page_text}\n\n"
                    f"Return only the corrections."
                ),
            },
        ],
        response_format=InvoiceCorrections,
        temperature=0,
    )
    return completion.choices[0].message.parsed

The temperature=0 is non-negotiable. The response_format=InvoiceCorrections is what binds the LLM to the schema; the API will not return a payload that fails Pydantic validation. The parse method (the .beta.chat.completions.parse form) returns the parsed Pydantic instance directly. If the model genuinely cannot find a field, the schema permits None, so a refusal-to-hallucinate is the expected path.

The token usage on this call is small because we are not sending the whole PDF, just the DI structured output plus the relevant page text. On the eval set, the average prompt is 2,400 tokens, the average completion is 280 tokens. Cost per call against gpt-4o is roughly $0.008 input plus $0.004 output, which works out to about $0.004 per page when amortised across the 11% of fields that route to the LLM.

The custom-model detour

Before the hybrid approach, we tried training a custom extraction model. The thinking was: prebuilt-invoice does not know about cost-centre codes; if we hand-label 800 invoices with the cost-centre field, the model will learn that field and we will not need a second stage. Two of us spent a fortnight labelling 800 invoices in Document Intelligence Studio, training a neural model, and running it on the held-out portion of the eval set.

It got worse. Field accuracy on the cost-centre field rose to 71%, which sounds great until you look at where the other 29% went: the model produced confident-looking but wrong codes for invoices whose layout it had not seen during training. The 800 invoices were drawn predominantly from UK and Ireland; on the Japanese and Singapore invoices, the model invented codes that looked like cost centres but were actually just numerical strings near the right region of the page. Worse, training had degraded accuracy on the standard fields (vendor, total, tax) compared to prebuilt-invoice, because the custom model was learning our 800-invoice distribution rather than the general invoice distribution prebuilt-invoice was trained on.

We abandoned the custom model and kept the prebuilt + hybrid pattern. The training data was not diverse enough, and getting a diverse enough set would have meant labelling several thousand invoices spanning every jurisdiction, which was both more expensive than the LLM call and no longer faster to iterate on. The hybrid approach lets us improve cost-centre extraction by editing a prompt, not retraining a model. We kept the labelled data as part of the eval set, which is where it earned its keep.

The audit table

Every extraction writes a row per field into an Azure Data Explorer (Kusto) table called InvoiceExtractions. The schema is field-level, not document-level, which is the change finance asked for after the first prototype. Per-field provenance means an auditor can ask "where did this £24,317.50 come from on this invoice" and get back: source PDF hash, page number, polygon coordinates, which extractor produced the value (DI or LLM), what the confidence was, and what version of the pipeline ran.

InvoiceExtractions
| where ingestion_time() > ago(7d)
| where field_name in ("total_amount", "cost_centre", "po_line_items")
| extend match_with_ground_truth = iff(extracted_value == ground_truth_value, 1, 0)
| summarize 
    field_count = count(),
    accuracy = avg(match_with_ground_truth),
    avg_confidence = avg(confidence),
    di_share = avg(iff(extractor == "di", 1.0, 0.0)),
    llm_share = avg(iff(extractor == "llm", 1.0, 0.0))
    by field_name
| order by accuracy asc

The query lets the finance team watch per-field accuracy over time. If a field's accuracy starts to drift (say a vendor changed their invoice layout and DI's confidence on total_amount for that vendor starts dropping below 0.85 more often), the dashboard catches it before anyone in finance notices the increased manual review queue.

The audit row is written in the same transaction as the field is returned to the ERP. There is no separate "and now write the audit" step that could be missed; the extraction function returns a (value, audit_row) tuple and the caller cannot get the value without also receiving the row to log.

from dataclasses import dataclass
from datetime import datetime, timezone
import hashlib

@dataclass
class AuditRow:
    document_hash: str
    field_name: str
    extracted_value: str
    extractor: str  # "di" or "llm"
    confidence: float
    source_page: int
    source_polygon: list[float] | None
    pipeline_version: str
    extracted_at: datetime

def extract_field(name: str, di_output: dict, llm_output: InvoiceCorrections, pdf_bytes: bytes) -> tuple[object, AuditRow]:
    doc_hash = hashlib.sha256(pdf_bytes).hexdigest()
    if needs_llm(name, di_output):
        value = getattr(llm_output, name, None)
        return value, AuditRow(
            document_hash=doc_hash,
            field_name=name,
            extracted_value=str(value),
            extractor="llm",
            confidence=1.0 if value is not None else 0.0,
            source_page=getattr(value, "source_page", 0) if hasattr(value, "source_page") else 0,
            source_polygon=getattr(value, "source_polygon", None) if hasattr(value, "source_polygon") else None,
            pipeline_version="2026.04.r3",
            extracted_at=datetime.now(timezone.utc),
        )
    f = di_output["fields"][name]
    return f["value"], AuditRow(
        document_hash=doc_hash,
        field_name=name,
        extracted_value=str(f["value"]),
        extractor="di",
        confidence=f["confidence"],
        source_page=f["bounding_regions"][0]["page_number"] if f["bounding_regions"] else 0,
        source_polygon=f["bounding_regions"][0]["polygon"] if f["bounding_regions"] else None,
        pipeline_version="2026.04.r3",
        extracted_at=datetime.now(timezone.utc),
    )

The eval gate

The 14,200-document eval set is the regression gate on every pull request. Azure Pipelines pulls the set from blob storage, runs each document through the pipeline at the PR's commit, compares each extracted field against the ground truth, and reports per-field accuracy. The whole run takes about 90 minutes (parallelised across 12 agents) and costs roughly $200 in DI + AOAI charges.

trigger:
  branches:
    include: [main]
pr:
  branches:
    include: [main]

stages:
  - stage: Eval
    displayName: '14,200-doc eval gate'
    jobs:
      - job: RunEval
        strategy:
          parallel: 12
        timeoutInMinutes: 120
        steps:
          - checkout: self
          - task: AzureCLI@2
            displayName: 'Download eval shard'
            inputs:
              azureSubscription: sc-eval-prod
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                SHARD_ID=$(System.JobPositionInPhase)
                az storage blob download-batch \
                  --source eval-set \
                  --pattern "shard-${SHARD_ID}/*" \
                  --destination ./eval \
                  --account-name stinvoiceseval
          - task: AzureCLI@2
            displayName: 'Run pipeline on shard'
            inputs:
              azureSubscription: sc-eval-prod
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                python -m pipeline.eval_runner \
                  --input ./eval \
                  --ground-truth ./eval-truth.jsonl \
                  --output ./results-${System.JobPositionInPhase}.json \
                  --thresholds ./thresholds.yml
          - publish: ./results-$(System.JobPositionInPhase).json
            artifact: eval-results-$(System.JobPositionInPhase)

  - stage: Gate
    dependsOn: Eval
    jobs:
      - job: AggregateAndGate
        steps:
          - download: current
            artifact: '*'
          - task: PythonScript@0
            inputs:
              scriptSource: filePath
              scriptPath: ./pipeline/aggregate_eval.py
              arguments: '--results-dir $(Pipeline.Workspace) --thresholds ./thresholds.yml'

The thresholds file is the contract. If a field drops below its threshold, the build fails and the PR cannot merge.

# thresholds.yml
vendor: 0.99
invoice_number: 0.99
total_amount: 0.995
tax_amount: 0.99
line_items: 0.96
cost_centre: 0.94
currency: 0.99
per_line_currency: 0.95
po_line_items: 0.90

The thresholds were set by negotiating with the finance team. The number nobody wanted to drop below for total_amount was 99.5% because anything worse than that means more than one in 200 invoices going to the ERP has the wrong number on it, and that translates to roughly 280 incorrect entries a year, which is the level at which the AP team has to start adding back the manual review step. We have been above 99.5% on total_amount for 27 consecutive eval runs.

A specific gotcha: the Japanese-yen invoice that was actually US dollars

About six months in, an invoice came through the pipeline labelled in JPY because the vendor's header was in Japanese and DI inferred the currency from the header. The line items, however, had been manually annotated USD by the supplier (a Tokyo-based subsidiary of a US company that bills in USD). The pipeline picked up the JPY header, decided the line items were also JPY, and would have caused the ERP to convert ¥24,317.50 into £138 instead of the correct $24,317.50 worth of conversion at about £19,200.

The fix was a sanity-check rule, added after that incident. It compares the sum of line-item totals (in the line-item currency, converted to the header currency at the invoice date's FX rate) against the invoice total. If they disagree by more than 5%, the invoice is flagged for human review.

from decimal import Decimal

def sanity_check_currency(invoice: dict, fx_rates: dict[str, Decimal]) -> list[str]:
    issues = []
    header_currency = invoice["currency"]
    line_sum_in_header_ccy = Decimal("0")
    for item in invoice["line_items"]:
        if item.line_currency == header_currency:
            line_sum_in_header_ccy += item.line_total
        else:
            rate = fx_rates[f"{item.line_currency.value}_{header_currency.value}"]
            line_sum_in_header_ccy += item.line_total * rate
    total = Decimal(invoice["total_amount"])
    if total == 0:
        return ["Total is zero"]
    drift = abs(line_sum_in_header_ccy - total) / total
    if drift > Decimal("0.05"):
        issues.append(
            f"Line-item sum ({line_sum_in_header_ccy}) disagrees with header total ({total}) by {drift:.1%}"
        )
    return issues

The rule has fired 47 times in the last six months. Forty-three were genuine currency mismatches caught early; four were false positives caused by rounding on multi-line invoices with five or more line items. False-positive cost: a human spends 90 seconds on each one. False-negative cost, which is what we are protecting against, is a five-figure FX conversion error in the ERP, so we keep the 5% threshold even though it is a little loose.

Troubleshooting

InvalidContent: The image is not a supported format from DI almost always means the upstream system handed us a PDF with a corrupted header. The fix is pdftk input.pdf output normalised.pdf (rewrites the PDF structure) before sending. Roughly one in 2,000 invoices needed this on our set; we now run the normalisation unconditionally.

InvalidRequest: PDF file size exceeds the maximum allowed size of 500 MB from DI is rare in invoices (they are usually under 5 MB) but happened when an accounts payable clerk had scanned at 1200 dpi and bundled a year of invoices into one PDF. The split step before the pipeline handles this; per-document files are part of the contract.

TokenLimitExceeded from AOAI was the early-prototype failure mode when we tried to send the whole PDF text to the LLM. A 12-page invoice with appended PO and remittance is roughly 18,000 tokens of page text; with gpt-4o's 128k context this is no longer a hard ceiling, but the cost is bad and the model gets distracted. The hybrid approach sends only the DI structured output plus the page text for the page where the missing field is expected, which keeps the prompt under 3,000 tokens.

ConfidenceTooLow: cost_centre field confidence 0.34 is not a real error string from the API; it is the message our wrapper raises when even the LLM returns a low-confidence value. When DI cannot find the field and the LLM also cannot find it with confidence, the invoice routes to human review rather than getting a guessed value. About 1.3% of invoices hit this path. The alternative (returning the LLM's best guess) was tried for a week and produced enough wrong cost-centres in the ERP that we changed the policy.

AuthenticationError: Bearer token validation failed from AOAI happens when the managed identity has Cognitive Services User on the AOAI resource but not the Cognitive Services OpenAI User role specifically. The two roles sound similar; only the second one grants the data-plane access the SDK needs. Documented on the Azure OpenAI managed identity page; easy to miss when you have done a hundred other Cognitive Services configurations.

Cost

The per-invoice cost decomposes cleanly. DI prebuilt-invoice pricing is $0.01 per page; the average invoice in our set is 1.4 pages, so $0.014 per invoice. The LLM call is $0.004 per invoice, weighted by the 11% routing rate to gpt-4o. The audit-table write is essentially free at our volume (around $0.0001 per invoice in Kusto ingestion). Total cost per invoice: $0.0182. Annual volume: 142,000 invoices. Annual cost: about $2,580.

The 14,200-doc eval run costs about $200 and runs in 90 minutes parallelised across 12 agents. We run it on every PR (about 18 PRs a month against the pipeline repo), so monthly eval cost is roughly $3,600. Annualised: $43,200, which is more than the production pipeline cost. We have not optimised the eval cost because the eval is what makes us confident in the production cost; trading $43k a year for the ability to ship pipeline changes without breaking finance feels like a fine trade.

The old process cost was about £180k a year in AP clerk time on manual review. Net saving, year one, after all the eval and pipeline costs: about £140k.

Where we ended up

Field-level accuracy on the held-out eval, last run before this article: 98.7% aggregate, with every field above its individual threshold. The 14,200-document eval is the regression gate on every PR; we cannot merge a change that drops a field below its threshold. Every extraction is auditable; finance can pull any number from the ERP, find its document_hash in the InvoiceExtractions table, and see which extractor produced it, at what confidence, from which page polygon of the source PDF.

The hybrid pattern (DI for the 89% it does well, LLM with structured outputs for the 11% DI misses or doubts) is the part I would generalise to other document workflows. The expensive part of these problems is rarely the bulk of fields; it is the ones at the edge of the distribution. A model trained on the general distribution will always be weakest on those edges; a model with a schema and access to the structured output of the first model can fix the edges without re-doing the bulk. The structured-output layer is what makes the second model's output trustable, because the schema is the contract.

I underestimated, going in, how much the audit table would matter. Finance does not care that the pipeline is 98.7% accurate; they care that they can answer, in a discoverable way, "where did this number come from" for any individual entry. The audit table is the artefact that turned the AI pipeline from "magic that produces numbers" into "a chain of evidence that produces numbers." Twelve months in, finance signs off the AP control quarterly with three minutes of work; the auditor opens the dashboard, clicks two rows at random, follows the provenance back to the source PDF, and the conversation is over. That is what the 14,200 documents and the per-field provenance bought us, and it is the part I would do first if I were starting again.

The custom-model detour was the most useful failed attempt of the project. Spending two weeks training a worse model taught me that the diversity of training data was the constraint, not the model architecture. Once that was clear, the hybrid pattern was obvious; we have not been tempted back to a custom model since, and the prompt-as-config model means every accuracy improvement is a pull request, not a retraining cycle. For a domain (invoices) where the distribution shifts every time a vendor changes their template, that iteration speed is the property that compounds. The 9% gap was always going to close; the question was whether closing it would create more work each year than it saved, and the answer, eleven months in, is that it did not.