Skip to content
OObaro.Olori
All articles
Azure AI

Fine-tuning gpt-4o-mini on Azure: 4,180 examples, an Azure Pipeline, and the rollback we baked in from day one

The first run cost $312, took seven hours, and made the model worse than the base we were trying to improve. Three iterations later, after a calibration round with senior agents, a synthetic augmentation pass for the rare cases, and a fairly humbling lesson about system-prompt discipline, the fine-tuned gpt-4o-mini beat base by 18 points on the domain composite, dropped per-call cost to 0.15x, and saved $4,200 a month. This is the full build log: the 4,180 examples, the six-stage Azure Pipeline that ships the model, the gpt-4o judge rubric, and the auto-rollback that has already saved us twice in production.

18 min read 348 viewsAzure OpenAIFine-tuninggpt-4o-miniAzure Pipelines

The first fine-tuning run cost $312, took seven hours of wall-clock time, and produced a model that was measurably worse than the base gpt-4o-mini we were trying to improve on. The composite eval score dropped four points. Tone got more wooden. Refusals became more frequent and more hostile. On the Monday after the run completed, our lead support engineer looked at the scorecard, sighed, and asked whether we should just ship the base model with a longer system prompt and stop pretending. The answer turned out to be no, but it took three more iterations, a calibration round with our senior agents, a synthetic augmentation pass, and a fairly humbling lesson about system-prompt discipline before the fine-tuned model beat the base by 18 points on the domain composite.

This is the build log for that work: the 4,180 labelled examples, the Azure Pipeline that ships the model end to end, the eval rubric we judge against, and the rollback we baked in from day one because nothing about this process gave us enough confidence to skip it.

What we were trying to do

We run a domain-specific support copilot for a B2B product that has eleven distinct product areas (billing, identity, integrations, on-prem connector, reporting, three different API surfaces, and a handful of admin tools). The base gpt-4o-mini was good enough on generic questions but kept making three classes of mistake. It would refuse harmless internal questions like "how do I reset my own SSO test tenant" because the wording sounded administrative. It would over-explain billing disputes when the right answer was a four-line response pointing the user at the dispute form. And it occasionally hallucinated configuration steps for the on-prem connector that did not exist.

The first instinct was prompt engineering. We pushed the system prompt to about 2,100 tokens, embedded six example exchanges, and put in a "refusal style guide" block. Quality improved but cost went up linearly with prompt size, and any time we tried to add a new product area we hit the cliff where adding more instructions made the model regress on older areas.

Fine-tuning was the next move. Azure OpenAI supports fine-tuning gpt-4o-mini in the chat-completion format, and the math made sense: if we could bake the tone, the refusal style, and the product knowledge into the weights, the inference-time system prompt could shrink to a handful of tokens and the per-call cost would drop. The model availability matrix on Azure OpenAI said UK South had gpt-4o-mini fine-tuning, which mattered for reasons I will come back to.

The first run, and why it was worse than nothing

We scraped 4,800 examples from the support team's archive of resolved tickets. Each example became a three-message exchange: a stock system prompt, the user's question, and the agent's resolution. We exported it as JSONL, ran one validation script, and uploaded it. The first run cost $312 and took roughly seven hours.

The scorecard came back saying our shiny new model was worse than the base on three of the four rubric dimensions. We poured a coffee and started reading the eval failures. Three patterns jumped out.

First, the support team had not been consistent about how they phrased refusals. About 20% of the archive said "I'm not able to help with that, please contact your account manager." Another 20% said "this isn't something I can resolve from here, but X can." A surprising amount said "no." The model dutifully learned all three styles and picked one essentially at random per request. Tone scores collapsed.

Second, the support archive was wildly imbalanced. Billing disputes were only 3% of the tickets, but they were almost all the high-difficulty cases on our eval. The model had essentially no signal on what a good billing-dispute response looked like and was generalising from the more common areas, badly.

Third, we had been sloppy with the system prompt. Inference would use the long 2,100-token prompt; training used a placeholder 80-token one. The model learned to do well with a short prompt and then, at eval time, got handed a long prompt and behaved oddly. This was the dumbest of the three mistakes and the easiest to fix.

Iteration 1: the calibration round and the style spec

Before retraining, we got our three most senior agents in a room for half a day and made them write a "style spec." Thirty examples, one per common refusal scenario, with the exact wording we wanted the model to use. We argued about commas. The lead support engineer pushed back hard on the word "unfortunately" and won; the spec bans it.

Then we ran the remaining 4,800 examples through a rewrite pass. Where the resolution matched the style spec already, we kept it. Where it did not, a junior agent rewrote it to match. The rewrite pass took two weeks of part-time work across three people. We dropped 620 examples that could not be salvaged (tickets where the resolution was an internal escalation with no customer-facing response). That left us with 4,180 examples in the training set.

The training data format is plain JSONL, one example per line, each line a JSON object with a messages array exactly as the chat-completion API expects:

{"messages": [{"role": "system", "content": "You are the support copilot for Contoso Workspace. Be direct, do not use the word 'unfortunately', and never invent configuration steps. If a question is outside scope, decline using the canonical refusal phrase."}, {"role": "user", "content": "How do I rotate the API key for the on-prem connector without restarting the service?"}, {"role": "assistant", "content": "You cannot rotate the connector API key without a restart. The connector reads the key once at startup. Recommended approach: stage the new key in the secondary slot, schedule a restart in your maintenance window, and the connector will pick up the new key on next boot."}]}
{"messages": [{"role": "system", "content": "You are the support copilot for Contoso Workspace. Be direct, do not use the word 'unfortunately', and never invent configuration steps. If a question is outside scope, decline using the canonical refusal phrase."}, {"role": "user", "content": "Customer is disputing the September invoice. They say the seat count was 240, not 260."}, {"role": "assistant", "content": "Send them the seat audit export for September. The export shows daily seat usage and identifies which days exceeded 240. If the audit confirms 240, file a billing adjustment ticket with the export attached and the adjustment will process in 5-7 business days. Do not promise an adjustment before the audit is reviewed."}]}

Every example carries the same system message. That is the bit we got wrong on round one. The data preparation guide on Microsoft Learn calls this out explicitly: the system prompt at training time and the system prompt at inference time should match, because the model learns conditional on whatever system prompt it was trained under.

Iteration 2: synthetic augmentation for the rare classes

Even with the style spec applied, billing disputes were 3% of the corpus. The eval kept showing the model produced fluent-sounding billing answers that were wrong on procedure. We thought about throwing more real tickets at the problem but the archive was tapped out; we had already used everything resolvable.

The fix was synthetic augmentation. We took the 125 real billing-dispute examples, fed each one to gpt-4o with a prompt that said "produce three variations of this exchange that change the customer's wording and the specific dollar amounts but keep the resolution procedure identical," and got 375 candidate examples back. Then every candidate was reviewed by a senior agent who could accept, edit, or reject it. We kept 280. The same approach for the on-prem connector class produced another 190 examples. Total dataset grew from 4,180 to 4,650 for the second run.

The synthetic run cost $440. It came in slightly above the first because we were training for one extra epoch (n_epochs: 3) on the recommendation of an earlier evaluation that suggested the model was undertrained. The score improved sharply on billing-disputes and held on the common classes. Composite was now 11 points over base, which was the first run that beat base on every dimension.

Iteration 3: pin the system prompt

We thought we were done at this point. The eval said the model was good. Then we deployed it behind a 5% traffic slice and the live eval-on-sample (50 sampled responses a day, judged by gpt-4o) dropped seven points relative to what the offline eval had predicted.

The bug was that production was still sending the 2,100-token system prompt. The training data had the 80-token version. The inference layer was prepending the long one out of caution. The model behaved like a confused intern getting two different briefs.

We rewrote the inference layer to send exactly the system prompt that was in training, byte for byte, and the live eval recovered. For the third training run we also made the system prompt slightly more explicit, added a single sentence about the refusal phrase, and rebuilt the entire dataset with the new system message in messages[0] of every example. That run cost $480. Composite came in at 18 points over base.

The Pipeline

The whole flow runs as an Azure Pipeline that triggers on any change to data/training/*.jsonl on main. Six stages: validate, upload, train, eval, deploy, traffic-shift.

trigger:
  branches:
    include: [main]
  paths:
    include:
      - data/training/**

pool:
  vmImage: ubuntu-latest

variables:
  - group: aoai-finetune
  - name: serviceConnection
    value: 'sc-aoai-uksouth-prod'
  - name: resourceGroup
    value: 'rg-aoai-prod-uksouth'
  - name: aoaiResource
    value: 'aoai-prod-uksouth'
  - name: baseModel
    value: 'gpt-4o-mini-2024-07-18'
  - name: trainingFile
    value: 'data/training/copilot-v3.jsonl'

stages:
  - stage: Validate
    displayName: 'Validate training data'
    jobs:
      - job: validate
        steps:
          - checkout: self
          - task: UsePythonVersion@0
            inputs:
              versionSpec: '3.12'
          - script: |
              pip install -r tools/requirements.txt
              python tools/validate_jsonl.py "$(trainingFile)"
              python tools/token_count.py "$(trainingFile)" --limit 16000
              python tools/pii_scan.py "$(trainingFile)" --fail-on-detect
            displayName: 'Schema, token, and PII checks'

  - stage: Upload
    displayName: 'Upload to Azure OpenAI'
    dependsOn: Validate
    jobs:
      - job: upload
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                ACCESS_TOKEN=$(az account get-access-token \
                  --resource https://cognitiveservices.azure.com \
                  --query accessToken -o tsv)
                ENDPOINT="https://$(aoaiResource).openai.azure.com"

                RESP=$(curl -sS -X POST \
                  "$ENDPOINT/openai/files?api-version=2024-10-21" \
                  -H "Authorization: Bearer $ACCESS_TOKEN" \
                  -F "purpose=fine-tune" \
                  -F "file=@$(trainingFile)")

                FILE_ID=$(echo "$RESP" | jq -r .id)
                echo "##vso[task.setvariable variable=fileId;isOutput=true]$FILE_ID"
              name: uploadStep

  - stage: Train
    displayName: 'Kick off fine-tune job'
    dependsOn: Upload
    variables:
      fileId: $[ stageDependencies.Upload.upload.outputs['uploadStep.fileId'] ]
    jobs:
      - job: train
        timeoutInMinutes: 720
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                set -euo pipefail
                ACCESS_TOKEN=$(az account get-access-token \
                  --resource https://cognitiveservices.azure.com \
                  --query accessToken -o tsv)
                ENDPOINT="https://$(aoaiResource).openai.azure.com"

                JOB=$(curl -sS -X POST \
                  "$ENDPOINT/openai/fine_tuning/jobs?api-version=2024-10-21" \
                  -H "Authorization: Bearer $ACCESS_TOKEN" \
                  -H "Content-Type: application/json" \
                  -d "{
                    \"model\": \"$(baseModel)\",
                    \"training_file\": \"$(fileId)\",
                    \"hyperparameters\": {\"n_epochs\": 3},
                    \"suffix\": \"copilot-v3\"
                  }")

                JOB_ID=$(echo "$JOB" | jq -r .id)
                echo "Job: $JOB_ID"

                while true; do
                  STATUS=$(curl -sS \
                    "$ENDPOINT/openai/fine_tuning/jobs/$JOB_ID?api-version=2024-10-21" \
                    -H "Authorization: Bearer $ACCESS_TOKEN" \
                    | jq -r .status)
                  echo "$(date -u +%H:%M:%S) status=$STATUS"
                  case "$STATUS" in
                    succeeded) break ;;
                    failed|cancelled) echo "fine-tune $STATUS" >&2; exit 1 ;;
                  esac
                  sleep 120
                done

                MODEL_ID=$(curl -sS \
                  "$ENDPOINT/openai/fine_tuning/jobs/$JOB_ID?api-version=2024-10-21" \
                  -H "Authorization: Bearer $ACCESS_TOKEN" \
                  | jq -r .fine_tuned_model)

                echo "##vso[task.setvariable variable=modelId;isOutput=true]$MODEL_ID"
              name: trainStep

  - stage: Eval
    displayName: 'Offline eval gate'
    dependsOn: Train
    variables:
      modelId: $[ stageDependencies.Train.train.outputs['trainStep.modelId'] ]
    jobs:
      - job: eval
        steps:
          - script: |
              python tools/run_eval.py \
                --candidate "$(modelId)" \
                --baseline "$(baseModel)" \
                --questions data/eval/holdout-220.jsonl \
                --judge gpt-4o \
                --out scorecard.json
              python tools/eval_gate.py scorecard.json \
                --min-composite-delta 6
            displayName: 'Run judge-model eval, fail under +6 composite'

  - stage: Deploy
    displayName: 'Side-by-side deployment'
    dependsOn: Eval
    jobs:
      - job: deploy
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: $(serviceConnection)
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                az cognitiveservices account deployment create \
                  --resource-group "$(resourceGroup)" \
                  --name "$(aoaiResource)" \
                  --deployment-name "copilot-v3" \
                  --model-name "$(modelId)" \
                  --model-version "1" \
                  --model-format "OpenAI" \
                  --sku-name "Standard" \
                  --sku-capacity 50

                az ml model create \
                  --resource-group "$(resourceGroup)" \
                  --workspace-name "mlw-aoai-prod" \
                  --name "copilot-finetune" \
                  --version "$(Build.BuildId)" \
                  --path "azureml://datastores/workspaceblobstore/paths/finetune/$(modelId)" \
                  --description "Fine-tuned gpt-4o-mini, copilot-v3"

  - stage: TrafficShift
    displayName: 'Gradual rollout via Front Door'
    dependsOn: Deploy
    jobs:
      - deployment: rollout
        environment: 'aoai-prod'
        strategy:
          runOnce:
            deploy:
              steps:
                - template: traffic-shift.yml
                  parameters:
                    deployment: 'copilot-v3'
                    previous: 'copilot-v2'
                    steps: [5, 25, 100]
                    sampleSize: 50
                    gateMinScore: 0.92

The deployment step uses the Azure OpenAI deployments API to create a new deployment under the existing resource. We keep the previous deployment alive on the same resource, named copilot-v2, so traffic-shift is a Front Door routing change and rollback is the same change in reverse.

The eval

220 held-out questions, generated alongside the training corpus but never seen during training. Each judged by gpt-4o on a four-dimension rubric: groundedness (does the answer match the documented procedure), tone (does it match the style spec), refusal-appropriateness (does it decline things it should and not decline things it should not), and conciseness (is it the right length for the question). Each dim is 0 to 1; composite is the mean.

JUDGE_PROMPT = """
You are evaluating a support agent's response. Score each dimension from 0.0 to 1.0.

Question: {question}
Reference answer (from the playbook): {reference}
Candidate answer: {candidate}

Dimensions:
- groundedness: Does the candidate match the documented procedure in the reference?
  1.0 = procedurally identical, 0.5 = same outcome different steps, 0.0 = wrong procedure.
- tone: Does the candidate use the canonical style? No "unfortunately".
  Direct, four to twelve sentences typical, no filler.
- refusal_appropriateness: If the reference is a refusal, did the candidate refuse?
  If the reference is a real answer, did the candidate answer? 1.0 = matched, 0.0 = mismatched.
- conciseness: Is the candidate the right length for the question?
  1.0 = right length, 0.5 = 50% off, 0.0 = more than 3x off.

Output JSON only:
{{"groundedness": 0.0, "tone": 0.0, "refusal_appropriateness": 0.0, "conciseness": 0.0, "rationale": "..."}}
"""

def evaluate(candidate_model: str, holdout: list[dict], judge: str) -> dict:
    rows = []
    for ex in holdout:
        candidate_answer = call_chat(candidate_model, ex["messages"][:-1])
        judge_input = JUDGE_PROMPT.format(
            question=ex["messages"][1]["content"],
            reference=ex["messages"][2]["content"],
            candidate=candidate_answer,
        )
        scores = json.loads(call_chat(judge, [{"role": "user", "content": judge_input}]))
        scores["composite"] = sum(scores[k] for k in (
            "groundedness", "tone", "refusal_appropriateness", "conciseness"
        )) / 4
        rows.append({"id": ex["id"], **scores})
    return {
        "per_question": rows,
        "mean": {k: mean(r[k] for r in rows) for k in
                 ("groundedness", "tone", "refusal_appropriateness", "conciseness", "composite")},
    }

def scorecard(candidate_results: dict, baseline_results: dict) -> str:
    out = ["dim                       candidate   baseline   delta"]
    for dim in ("groundedness", "tone", "refusal_appropriateness", "conciseness", "composite"):
        c = candidate_results["mean"][dim]
        b = baseline_results["mean"][dim]
        out.append(f"{dim:25} {c:>9.3f}   {b:>8.3f}  {c-b:+.3f}")
    return "\n".join(out)

The gate in eval_gate.py is one line of logic: the candidate's composite must beat the baseline by at least 0.06 (six points on the 0-100 scale we present internally). Below that, the pipeline fails before deployment.

The rollback we baked in

Every deployment lives next to the previous one. Traffic shift goes 5%, 25%, 100% with a gate between each step. The gate is a live eval-on-sample: 50 random production responses a day are mirrored to the gpt-4o judge with the same rubric, and the rolling 24-hour score is compared against the previous deployment's rolling score. If the new one drops below 95% of the previous one's score, the Front Door rule reverts to 0% on the new deployment and 100% on the old one, no human in the loop.

The reverse-shift is exactly the same Azure Pipeline template (traffic-shift.yml) called with steps: [0] and the parameters swapped. We have triggered the auto-rollback twice in production. The first time was a genuine quality regression caused by the system-prompt mismatch I described. The second time was a false positive caused by a bad day of sampled questions; the score recovered the next morning but the rollback had already happened. Both times the rollback finished in under three minutes from gate-trip to traffic-back-on-previous, which is what we wanted.

The previous deployment is kept around for 30 days after a successful rollout, then garbage-collected. Storage for an idle deployment is negligible. The math on keeping it is "the cost of cleanup automation that may save a bad day is less than the cost of having a bad day with nothing to fall back on."

Troubleshooting

Three errors we hit, all of which cost time the first time they appeared.

InvalidFineTuneRequest: The training file contains lines that do not conform to the chat completion format. This is the error you get if any line is missing the messages array, has a role other than system/user/assistant, or has an assistant message before a user message. Our validate_jsonl.py is a 40-line script that opens each line as JSON, asserts the shape, and prints the line number on failure. We did not have it on the first run. Now we do, and the validate stage takes 12 seconds.

BadRequest: Token count per message exceeds maximum. gpt-4o-mini fine-tuning has a per-message token cap (16,384 at time of writing) and a per-example cap that is the sum across messages. A handful of our archive tickets had ten-paragraph customer descriptions that blew through it. The token-count check in the validate stage uses tiktoken against the o200k_base encoder and rejects the file if any example exceeds the limit. We had to truncate or split 31 examples in the original corpus.

RateLimitExceeded: You have exceeded the rate limit for fine-tuning jobs. Azure OpenAI enforces a concurrent fine-tune job limit per subscription. We hit this once when a developer kicked off a manual job from a branch while the pipeline was already running. The fix was to add a pipeline-level lock: a "lease" stage at the start that writes a lock file to a blob and refuses to proceed if the file already exists. The lease releases at the end of the pipeline or after a 12-hour timeout, whichever comes first.

Two other things worth knowing. The fine-tuning job will sit in pending state for a while before it starts running. Don't panic. The poll script in the pipeline accepts any of pending, running, validating_files as "still going" and only exits on succeeded, failed, or cancelled. And the fine_tuned_model field on the job response is only populated after succeeded; before that it is null, so don't try to read it during the poll.

Cost, and the regional gotcha

Across iterations, the training runs cost $312, $440, and $480. Total: $1,232 over three weeks. The third run is the one currently in production.

Inference cost is the surprise. Because the fine-tuned model has internalised the tone and the playbook, the inference-time system prompt shrank from 2,100 tokens to 96. Output tokens shrank too, because the model learned to be more concise. Combined, the per-call cost is roughly 0.15x what we paid for the same workload on the base model with the long prompt. At our volume (around 90,000 calls/month for the copilot, plus offline judging for the eval loop), that translates to about $4,200/month saved. The training runs paid back inside the first three weeks.

The regional gotcha. We wanted the copilot deployed in two regions, UK South and West Europe, to serve UK and EU traffic with low latency. Fine-tuned Azure OpenAI models can only be deployed in the region the training file was uploaded to. There is no cross-region replication for fine-tuned weights; you would have to train twice, once per region, and reconcile the two. We decided the latency hit of EU traffic hairpinning to UK South (around 25ms extra round-trip) was acceptable for the copilot, and we kept the base model deployed in West Europe as a fallback for the case where the UK South deployment becomes unavailable. The fallback has triggered exactly twice, both during planned Azure maintenance, and the support team noticed the tone shift each time. The lesson is that you cannot really have a hot-standby for a fine-tuned model unless you train two and accept that they will drift.

Worth knowing too that the model retirement and lifecycle policy applies to base models, and fine-tuned models built on a base model are bound to that base model's lifecycle. When gpt-4o-mini-2024-07-18 is retired, the fine-tuned model goes with it. The pipeline has a calendar reminder six months before the announced retirement date that says: "retrain on the new base."

Where we ended up

Composite eval: base model 0.71, fine-tuned 0.89. Eighteen-point lead, holding for four months in production.

Per-call cost: 0.15x of the previous production cost. About $4,200/month saved at current volume.

Support team satisfaction: we ran an internal survey two months after rollout. The team rated the copilot 4.2/5 on "this answers like one of us would," up from 2.8/5 before fine-tuning. The lead support engineer, the one who had asked whether we should give up after the first run, asked instead whether we could fine-tune a second model for the chat-with-customer use case.

Looking back, the technical work was the easy part. The hard part was admitting that the support team's archive was not the dataset we thought it was. The archive was a collection of resolutions, written under deadline, by people who disagreed in good faith about tone. Treating it as ground truth on the first run is what cost us $312 and a wooden model. The calibration round with senior agents (the half-day where they wrote thirty examples by consensus) was the highest-leverage thing we did in the entire project. Without it, the eval scores would have plateaued and we would probably have concluded fine-tuning does not work for this use case.

Two related observations. First, the eval mattered more than the training. We could iterate on the dataset, on hyperparameters, on the system prompt, but only because we had a stable judge model with a stable rubric telling us whether we were getting better or worse. If the eval is noisy, every other piece of work is noisy too, and you cannot tell signal from drift. We spent more wall-clock time on the eval rubric than on the data prep, and that ratio felt wrong in the moment but turned out to be exactly right.

Second, the rollback was psychologically what made us comfortable shipping. A traffic-shift with hard gates and an auto-revert removes the question "what if this is bad in ways we did not catch offline?" from the deployment conversation. The team could ship a new fine-tune on a Wednesday afternoon and trust that if it was actually bad, the system would notice within the day and roll it back without paging anyone. That trust is the thing that lets you iterate quickly. Without it, every deployment is a meeting, and you stop trying things, and the model stops getting better.

The fourth iteration is on the bench right now. The dataset is bigger (5,600 examples after another augmentation pass on the rare classes), the system prompt is slightly different (we want to test removing the refusal-style line, which the model seems to have fully internalised), and we are going to try n_epochs: 2 to see whether we can ship a marginally cheaper run with similar quality. If the eval says it is six points better than the current production model, we ship. If it is not, we throw the run away and move on. That is, in the end, the only honest way to do this work.