Skip to content
OObaro.Olori
All articles
Azure AI

An internal voice assistant on GPT-4o-realtime: sub-800ms turn-taking and the barge-in that took twice as long to build

We chased latency for two weeks and hit 780ms turn-taking on a Thursday demo at 14:30. The workshop lead leaned in and said 'let me interrupt it mid-sentence,' and the whole thing fell apart. Barge-in was the entire UX. Here is the build: GPT-4o-realtime over WebSocket on Azure OpenAI, Azure AI Speech as fallback STT, server-side VAD tuned past the breathing-triggers-turn-end failure, and the cancel-flush-restart loop that took twice as long to ship as the happy path.

23 min read 312 viewsGPT-4o-realtimeAzure AI SpeechWebSocketsVoice UI

It was 14:30 on a Thursday and the demo room had four people in it. We had been chasing latency for two weeks. End-to-end turn-taking had just dipped under 800ms on the dashboard for the first time, 780ms median across the last fifty turns, and the workshop lead was nodding the way people nod when they think a thing might actually ship. I asked the assistant to read me the torque spec for an M8 bolt. It started reading. Then the workshop lead leaned in and said, very calmly, "let me interrupt it mid-sentence."

He spoke over the assistant's first six words. The assistant kept reading. He spoke louder. The assistant kept reading. When he stopped, the assistant kept reading for another four seconds. Then it finished the torque spec, paused, and tried to start answering his interrupting question, which by that point had been overwritten in his head by the next four things he wanted to do. He looked at me. I looked at the laptop. The nodding had stopped.

That was the moment we realised we had built the wrong feature. We had built a low-latency voice assistant. What the workshop actually needed was a voice assistant that could be interrupted, because hands-busy operators in the workshop area do not wait for an assistant to finish a sentence. They cut it off, redirect it, and keep moving. Barge-in was the entire UX. Turn-taking latency was the table stakes. The next four weeks were spent on the interrupt path, which is the part nobody writes about and which took roughly twice as long to build as the happy path.

This is the whole build. Two engineers, six weeks, one internal voice assistant for the workshop floor sitting on GPT-4o-realtime on Azure OpenAI, with Azure AI Speech as a fallback transcription path. Median turn-taking at 720ms in the workshop, barge-in feels natural now, the team uses it 30 to 40 times a day. Most of the code below is the boring part. The interesting part is the discipline of treating every millisecond as a budget line and treating interruption as a first-class state.

The use case, and why the workshop is unusual

The audience is engineers in the workshop area. They have torque wrenches in their hands, gloves on, and a tablet on a magnetic mount. They want three things from a voice assistant: read me the spec for X, walk me through the runbook for Y, and open a ticket for Z when Z is what just broke. These are not chatty conversations. They are short, often interrupted, often spoken over machine noise, and the cost of a slow assistant is not "user mildly annoyed," it is "engineer puts the wrench down and types the question into the tablet instead," at which point the voice assistant has lost the user permanently for that day.

The implication for the design: every interaction is two to four turns. Long monologues from the assistant are a failure mode, not a feature. The user must be able to cut the assistant off and redirect. The assistant must not require a wake word per turn because the workshop is not silent and the false-trigger rate on wake words went above what we could tolerate. We solved that with push-to-talk on the tablet for session start, then open-mic with server-side VAD for the rest of the session.

Architecture

Four moving parts.

  1. Client. A small browser app on the tablet. Captures microphone audio with getUserMedia, runs it through an AudioWorklet that produces 16-bit PCM at 24kHz mono, and ships chunks over a WebSocket to the middleware. Plays back PCM16 audio it receives from the middleware through an AudioContext. Also handles push-to-talk gesture, mute, and the "this turn is over" UI hint.

  2. Middleware. A Node service that brokers one realtime session per active client. The middleware mints an Entra access token, opens the WebSocket to Azure OpenAI's realtime endpoint, forwards client audio in, forwards model audio out, and intercepts the control events (response.created, response.audio.delta, input_audio_buffer.speech_started, and so on). The middleware also owns the tool implementations: lookup_spec, fetch_runbook, create_ticket. It runs the tool calls against our internal services and feeds the results back into the realtime session.

  3. Azure OpenAI realtime endpoint. A GPT-4o-realtime deployment in East US 2. WebSocket protocol, documented as the GPT-4o Realtime API for speech and audio. Input format pcm16. Output format pcm16 at 24kHz. Voice: alloy. Turn detection mode: server_vad.

  4. Azure AI Speech as fallback STT. When the realtime session has a hiccup (a dropped WebSocket, a model deployment hot-swap, a 503), the middleware falls back to streaming the same PCM16 into Azure AI Speech real-time speech-to-text and then sending the transcribed text to the regular chat completions endpoint. This is degraded mode. Latency goes from 720ms to about 1.6s. Barge-in stops working in degraded mode. Users notice. The fallback exists so that during a hiccup the assistant still functions, even if poorly, instead of going silent.

The reason to put the middleware in the middle, instead of letting the browser talk to Azure OpenAI directly, is twofold. First, the WebSocket connection to Azure OpenAI authenticates with an Entra access token, and we do not want long-lived tokens in the browser. The middleware does the token broker. Second, tool calls (create_ticket, etc.) must run inside our network with managed identity against internal APIs, which the browser cannot do.

The latency budget

Sub-800ms end-to-end was the target. End-to-end means "from the audio chunk where the user finishes speaking, to the audio chunk where the assistant starts speaking." This is the number that determines whether a conversation feels alive. Below 800ms, it feels like talking to a person who is paying attention. Above 1.2s, it feels like talking to a kiosk.

We decomposed the budget as follows.

Segment Budget Where it goes
VAD silence detection 280ms silence_duration_ms on server VAD
Network in (client to model) 80ms Tablet to Azure region, mostly TLS and TCP RTT
Model first audio token 200ms Model time to first output chunk
TTS first audio chunk decoded 150ms Model emits, middleware forwards, client decodes
Client audio queue start 90ms AudioContext schedule plus the first ScriptProcessor frame
Total 800ms

The honest version: of these, only the network number and the client queue number are under our direct control. The 280ms VAD number is a tunable; the 200ms model number is whatever the model gives us; the 150ms TTS number is whatever the realtime path gives us. You can shave a few tens of milliseconds in each direction and you can spend a week trying to shave another twenty. We learned the hard way that the budget is not a thing you optimise in isolation. You optimise the user experience and the budget falls out.

The session-config message

The realtime API is a WebSocket. When the connection opens, the client sends a session.update event that configures everything: modalities, voice, input/output audio format, VAD parameters, system prompt, tools. The middleware does this once per session.

// middleware/openai-realtime.js
import WebSocket from "ws";

export async function openRealtimeSession({ accessToken, deployment, region }) {
  const url =
    `wss://${region}.api.cognitive.microsoft.com/openai/realtime` +
    `?api-version=2024-12-17&deployment=${deployment}`;

  const ws = new WebSocket(url, {
    headers: {
      Authorization: `Bearer ${accessToken}`,
    },
  });

  await new Promise((resolve, reject) => {
    ws.once("open", resolve);
    ws.once("error", reject);
  });

  ws.send(
    JSON.stringify({
      type: "session.update",
      session: {
        modalities: ["audio", "text"],
        voice: "alloy",
        input_audio_format: "pcm16",
        output_audio_format: "pcm16",
        input_audio_transcription: {
          model: "whisper-1",
        },
        turn_detection: {
          type: "server_vad",
          threshold: 0.6,
          prefix_padding_ms: 200,
          silence_duration_ms: 320,
        },
        instructions: WORKSHOP_SYSTEM_PROMPT,
        tools: [
          {
            type: "function",
            name: "lookup_spec",
            description:
              "Look up an engineering spec by part number or component name.",
            parameters: {
              type: "object",
              properties: {
                identifier: { type: "string" },
                kind: { type: "string", enum: ["part", "component", "assembly"] },
              },
              required: ["identifier"],
            },
          },
          {
            type: "function",
            name: "fetch_runbook",
            description:
              "Fetch a runbook by name. Returns the runbook text. The assistant should read it back step by step and pause between steps.",
            parameters: {
              type: "object",
              properties: {
                name: { type: "string" },
              },
              required: ["name"],
            },
          },
          {
            type: "function",
            name: "create_ticket",
            description:
              "Open a ticket in the workshop ticketing system. Confirm the summary out loud before calling this.",
            parameters: {
              type: "object",
              properties: {
                summary: { type: "string" },
                severity: { type: "string", enum: ["P1", "P2", "P3"] },
              },
              required: ["summary", "severity"],
            },
          },
        ],
        tool_choice: "auto",
        temperature: 0.6,
      },
    })
  );

  return ws;
}

Three of these parameters had real consequences and are worth dwelling on.

turn_detection.type: "server_vad". This is the difference between "the server decides when the user has finished a turn" and "the client sends explicit input_audio_buffer.commit events." Server VAD is what makes the experience feel hands-free. Without it the client has to detect silence locally and tell the server, which adds a round trip we cannot afford.

silence_duration_ms: 320. This is the dominant term in the latency budget. It is the amount of silence the server waits before declaring the user's turn over. We started at 280ms (which is what the docs use as a sane default) and lived with it for nine days before tuning up to 320ms. The reason is below in the VAD tuning section.

prefix_padding_ms: 200. The model receives audio starting 200ms before the VAD-detected speech start, which means it has context for the leading consonant on words that begin with s, f, th. Without padding, "spec for X" came back transcribed as "pec for X" in about one in twenty turns.

The audio loop on the client

The client side is small but exacting. The microphone path uses an AudioWorkletProcessor to downsample to 24kHz PCM16 and post 20ms frames over the WebSocket. The playback path takes incoming response.audio.delta events, base64-decodes them, and queues them into an AudioContext for gapless playback.

// client/mic-worklet.js
class MicWorklet extends AudioWorkletProcessor {
  constructor() {
    super();
    this.targetRate = 24000;
    this.frameSize = (20 / 1000) * this.targetRate; // 480 samples
    this.buffer = new Int16Array(0);
  }

  process(inputs) {
    const input = inputs[0][0];
    if (!input) return true;

    // Downsample from sampleRate (typically 48000) to 24000 with linear interp.
    const ratio = sampleRate / this.targetRate;
    const outLen = Math.floor(input.length / ratio);
    const out = new Int16Array(outLen);
    for (let i = 0; i < outLen; i++) {
      const idx = i * ratio;
      const i0 = Math.floor(idx);
      const i1 = Math.min(i0 + 1, input.length - 1);
      const frac = idx - i0;
      const sample = input[i0] * (1 - frac) + input[i1] * frac;
      out[i] = Math.max(-32768, Math.min(32767, sample * 32767));
    }

    // Concatenate into a rolling buffer and emit 20ms frames.
    const merged = new Int16Array(this.buffer.length + out.length);
    merged.set(this.buffer, 0);
    merged.set(out, this.buffer.length);
    this.buffer = merged;

    while (this.buffer.length >= this.frameSize) {
      const frame = this.buffer.slice(0, this.frameSize);
      this.buffer = this.buffer.slice(this.frameSize);
      this.port.postMessage(frame, [frame.buffer]);
    }

    return true;
  }
}

registerProcessor("mic-worklet", MicWorklet);

The frames arrive in the main thread, get base64-encoded, and get shipped to the middleware as input_audio_buffer.append events:

// client/realtime-client.js
worklet.port.onmessage = (event) => {
  const frame = event.data;
  // Convert ArrayBuffer to base64 without going through a string roundtrip.
  const base64 = arrayBufferToBase64(frame.buffer);
  ws.send(
    JSON.stringify({
      type: "input_audio_buffer.append",
      audio: base64,
    })
  );
};

The middleware forwards these straight through to Azure OpenAI's realtime endpoint. We do not buffer audio on the middleware in normal operation. Buffering on the middleware adds latency and creates resynchronisation problems when interrupts happen.

The output side and the playback queue

Audio out comes back as response.audio.delta events, each with a base64-encoded chunk of PCM16. The client decodes them into AudioBuffers and schedules them on a single AudioContext so they play gaplessly:

// client/playback.js
const ctx = new AudioContext({ sampleRate: 24000 });
let playHead = 0;
const queuedSources = [];

function playChunk(base64) {
  const pcm = base64ToInt16(base64);
  const float = new Float32Array(pcm.length);
  for (let i = 0; i < pcm.length; i++) float[i] = pcm[i] / 32768;

  const buf = ctx.createBuffer(1, float.length, 24000);
  buf.copyToChannel(float, 0);

  const src = ctx.createBufferSource();
  src.buffer = buf;
  src.connect(ctx.destination);

  const now = ctx.currentTime;
  const startAt = Math.max(playHead, now);
  src.start(startAt);
  playHead = startAt + buf.duration;

  queuedSources.push(src);
  src.onended = () => {
    const idx = queuedSources.indexOf(src);
    if (idx !== -1) queuedSources.splice(idx, 1);
  };
}

export function flushPlayback() {
  for (const src of queuedSources) {
    try {
      src.stop();
    } catch {}
  }
  queuedSources.length = 0;
  playHead = ctx.currentTime;
}

The flushPlayback() function is the one that matters for barge-in. When the middleware tells the client to cancel, the client stops every queued buffer and resets the play head. Without flushPlayback, the assistant keeps talking for a second or two after the user has clearly started speaking, because the audio that the model has already streamed is sitting in the client's queue waiting to play.

VAD tuning, or how breathing nearly killed the demo

The first version of the VAD config had threshold: 0.5 and silence_duration_ms: 280. On paper, those are reasonable defaults. In the workshop, with a microphone on a tablet six inches from an engineer's face, two things happened.

First, the engineer's breathing triggered turn-end. A normal exhale at the right distance from a mic registers above 0.5 on the server VAD probability score. The user would ask half a question, take a breath, and the server would declare the turn complete and start generating a response to "what is the torque on" without the noun. Hilarious in isolation, terrible for usability. The false-trigger rate on a fifteen-minute hands-free session was about 8% of turns.

Second, normal "um, let me think" hesitations were getting truncated. Engineers in the workshop pause mid-thought. They are doing physical work and they say "the M8 bolt on the, uh, the upper, the upper bracket." The pauses between words are 300 to 500ms. With silence_duration_ms: 280, the server cut them off after the first "uh."

The tune that solved both: threshold: 0.6, silence_duration_ms: 320. Threshold up by 0.1 makes breath events less likely to trigger; silence duration up by 40ms gives engineers room to pause mid-thought. The cost is 40ms added to the budget. The benefit is the false-trigger rate dropping from 8% to 0.7% over the next week of real workshop sessions.

We logged every VAD event with its probability score for two days to find these numbers. The script that did the analysis is short and worth keeping in your toolkit:

// middleware/vad-analysis.js
const events = [];

ws.on("message", (raw) => {
  const e = JSON.parse(raw);
  if (
    e.type === "input_audio_buffer.speech_started" ||
    e.type === "input_audio_buffer.speech_stopped"
  ) {
    events.push({
      type: e.type,
      t: Date.now(),
      audio_start_ms: e.audio_start_ms,
      audio_end_ms: e.audio_end_ms,
    });
  }
});

// Every minute, print speech segments shorter than 500ms (likely false triggers).
setInterval(() => {
  const segments = [];
  for (let i = 0; i < events.length - 1; i++) {
    if (
      events[i].type === "input_audio_buffer.speech_started" &&
      events[i + 1].type === "input_audio_buffer.speech_stopped"
    ) {
      segments.push({
        duration_ms: events[i + 1].audio_end_ms - events[i].audio_start_ms,
      });
    }
  }
  const short = segments.filter((s) => s.duration_ms < 500);
  console.log(`segments=${segments.length} short=${short.length}`);
}, 60_000);

Two days of this in the workshop with threshold: 0.5 showed 47 short segments out of 612 total (7.7%). With threshold: 0.6 and silence_duration_ms: 320 it dropped to 4 out of 581 (0.7%).

Barge-in, the part that took twice as long

The state machine looks innocent until you implement it. The assistant is in one of three states: idle, listening, speaking. Barge-in is the transition from speaking back to listening triggered by the user speaking over the assistant.

The realtime API tells us when the user starts speaking via input_audio_buffer.speech_started. If we are currently in speaking (i.e., the model is streaming audio out and the client is playing it), three things have to happen, in order, within roughly 200ms or the user hears tail audio:

  1. Send response.cancel to the model so it stops generating.
  2. Tell the client to flush its playback queue and reset the play head.
  3. Switch state to listening and start accepting new audio for the user's next turn.

If any of those three steps is late, the experience falls apart. If response.cancel is late, the model keeps generating tokens that you will then receive and have to discard, which wastes both bandwidth and quota. If the playback flush is late, the user hears 400 to 700ms of the assistant's voice tailing off after they have clearly taken over, which is the single most off-putting failure mode in voice UI. If the state switch is late, the next chunks of the user's audio get treated as if they belong to the cancelled response.

Here is the middleware piece that drives all three. The middleware sits in the middle of the WebSocket and proxies events both ways while also intercepting the speech-start signal.

// middleware/session.js
import { openRealtimeSession } from "./openai-realtime.js";

export function attachClient({ clientWs, accessToken, deployment, region }) {
  let modelWs = null;
  let state = "idle";
  let currentResponseId = null;

  (async () => {
    modelWs = await openRealtimeSession({ accessToken, deployment, region });

    modelWs.on("message", (raw) => {
      const evt = JSON.parse(raw);

      switch (evt.type) {
        case "response.created":
          currentResponseId = evt.response.id;
          state = "speaking";
          break;

        case "response.audio.delta":
          // Forward audio chunk to the client immediately. No buffering.
          clientWs.send(
            JSON.stringify({
              type: "audio.delta",
              audio: evt.delta,
              response_id: evt.response_id,
            })
          );
          break;

        case "response.done":
          if (state === "speaking") state = "listening";
          currentResponseId = null;
          break;

        case "input_audio_buffer.speech_started":
          if (state === "speaking" && currentResponseId) {
            // Barge-in. Cancel, flush, switch.
            modelWs.send(
              JSON.stringify({
                type: "response.cancel",
                response_id: currentResponseId,
              })
            );
            clientWs.send(JSON.stringify({ type: "audio.flush" }));
            state = "listening";
            currentResponseId = null;
          }
          break;

        case "response.function_call_arguments.done":
          handleToolCall(evt, modelWs);
          break;
      }
    });
  })();

  clientWs.on("message", (raw) => {
    const evt = JSON.parse(raw);
    if (evt.type === "audio.chunk") {
      modelWs?.send(
        JSON.stringify({
          type: "input_audio_buffer.append",
          audio: evt.audio,
        })
      );
    }
  });

  clientWs.on("close", () => {
    modelWs?.close();
  });
}

The asymmetry to notice: cancellation is push-based from the model side (server VAD says speech started, middleware reacts), but audio in is push-based from the client side. The middleware has to keep both halves in sync. In practice we found one timing race: if the model's response.audio.delta arrives at the middleware after input_audio_buffer.speech_started but before the response.cancel round trips, the middleware will still forward those audio chunks to the client, which will play them. We fixed it by tracking cancelled_response_ids on the middleware and dropping any response.audio.delta whose response_id is in that set:

const cancelledResponseIds = new Set();

case "input_audio_buffer.speech_started":
  if (state === "speaking" && currentResponseId) {
    cancelledResponseIds.add(currentResponseId);
    modelWs.send(JSON.stringify({
      type: "response.cancel",
      response_id: currentResponseId,
    }));
    clientWs.send(JSON.stringify({ type: "audio.flush" }));
    state = "listening";
    currentResponseId = null;
  }
  break;

case "response.audio.delta":
  if (cancelledResponseIds.has(evt.response_id)) {
    // Drop. This belongs to a cancelled response.
    break;
  }
  clientWs.send(JSON.stringify({
    type: "audio.delta",
    audio: evt.delta,
    response_id: evt.response_id,
  }));
  break;

After this fix, the median time from "user starts speaking over assistant" to "assistant audio stops" measured at the client was 180ms. Below 200ms feels instantaneous; above 300ms feels rude.

Per-segment latency tracing

You cannot tune what you cannot see. We added a tracing layer that timestamps every event and produces a per-turn breakdown of where time went. The shape is small enough to keep in a single file and useful enough that we left it on in production at 5% sampling.

// middleware/trace.js
export function startTurnTrace() {
  const t = {
    audio_in_first: null,
    speech_started: null,
    speech_stopped: null,
    response_created: null,
    audio_out_first: null,
    audio_out_last: null,
    cancelled: false,
  };

  return {
    mark(name) {
      if (t[name] == null) t[name] = performance.now();
    },
    cancel() {
      t.cancelled = true;
      t.cancelled_at = performance.now();
    },
    finish() {
      if (!t.audio_in_first || !t.audio_out_first) return null;
      return {
        vad_silence_ms: t.response_created - t.speech_stopped,
        model_first_chunk_ms: t.audio_out_first - t.response_created,
        end_to_end_ms: t.audio_out_first - t.speech_stopped,
        speak_duration_ms: t.audio_out_last - t.audio_out_first,
        cancelled: t.cancelled,
      };
    },
  };
}

A live trace from one normal turn ("read me the torque spec for an M8 stainless bolt"):

vad_silence_ms: 318
model_first_chunk_ms: 184
end_to_end_ms: 731
speak_duration_ms: 4210
cancelled: false

A live trace from one barge-in turn:

vad_silence_ms: 312
model_first_chunk_ms: 197
end_to_end_ms: 745
speak_duration_ms: 1810
cancelled: true
cancel_to_silence_ms: 174

The numbers we watch every day are end_to_end_ms (target <800ms median, p95 <1100ms) and cancel_to_silence_ms on cancelled turns (target <200ms median). We dashboard the p50, p95, and p99 of each. Two weeks of production data: end-to-end p50 is 720ms, p95 1040ms, p99 1380ms. Cancel-to-silence p50 is 180ms, p95 240ms.

The Azure AI Speech fallback path

The realtime endpoint is the happy path. We had three brownouts in the first month, ranging from 30 seconds to about four minutes, where the WebSocket disconnected and reconnects either failed or returned 503. During those windows the assistant was unusable.

The fix was to add a degraded mode that runs on Azure AI Speech SDK for STT plus regular chat completions for the response. It is slower (no streaming TTS in degraded mode, no barge-in) but it functions. The middleware decides which path to use on session open and on every disconnect.

// middleware/fallback-stt.js
import sdk from "microsoft-cognitiveservices-speech-sdk";

export function startFallbackRecognizer({ key, region, onTranscript }) {
  const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
  speechConfig.speechRecognitionLanguage = "en-GB";

  const pushStream = sdk.AudioInputStream.createPushStream(
    sdk.AudioStreamFormat.getWaveFormatPCM(24000, 16, 1)
  );
  const audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);
  const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

  recognizer.recognized = (_s, e) => {
    if (e.result.reason === sdk.ResultReason.RecognizedSpeech) {
      onTranscript(e.result.text);
    }
  };

  recognizer.startContinuousRecognitionAsync();

  return {
    push(pcm) {
      pushStream.write(pcm);
    },
    stop() {
      recognizer.stopContinuousRecognitionAsync();
      pushStream.close();
    },
  };
}

When fallback is active, the middleware swaps which path receives the audio frames coming from the client. Transcripts are sent to the regular GPT-4o chat completion endpoint with the same system prompt, and the text response is sent back to the client to render. We do not synthesise voice in degraded mode because adding TTS would push end-to-end latency past 2 seconds and at that point users abandon the session anyway. The client shows a small "degraded" banner.

Authentication and tokens

The middleware authenticates to Azure OpenAI with an Entra access token, not an API key. Per session, the middleware acquires a token using a user-assigned managed identity in the same Azure subscription as the OpenAI resource. Tokens are cached and refreshed five minutes before expiry.

// middleware/auth.js
import { DefaultAzureCredential } from "@azure/identity";

const credential = new DefaultAzureCredential();
let cached = null;

export async function getOpenAIToken() {
  if (cached && cached.expiresOn > Date.now() + 5 * 60 * 1000) {
    return cached.token;
  }
  const tokenResponse = await credential.getToken(
    "https://cognitiveservices.azure.com/.default"
  );
  cached = {
    token: tokenResponse.token,
    expiresOn: tokenResponse.expiresOnTimestamp,
  };
  return cached.token;
}

The managed identity has a role assignment of Cognitive Services OpenAI User on the OpenAI resource. No keys are in middleware config. The token never leaves the middleware. The browser cannot reach the OpenAI endpoint directly because the OpenAI resource has its public network access disabled and is reachable only through a private endpoint inside the VNet the middleware lives in.

The silent disconnect problem

A WebSocket that closes cleanly fires a close event. A WebSocket on a flaky network can stop transmitting without firing anything at all, particularly if a stateful firewall or NAT mid-path expires the flow. We saw this in week three: a session would appear to be alive on both ends, but no audio was flowing in either direction, and neither side noticed for tens of seconds.

The mitigation is a heartbeat. The middleware sends a ping every 10 seconds. If a pong does not come back within 2 seconds, the middleware tears down the connection and reconnects, replaying the session config and the last few seconds of buffered audio so the user does not have to repeat themselves.

function heartbeat(ws, onDead) {
  let alive = true;
  ws.on("pong", () => {
    alive = true;
  });
  const interval = setInterval(() => {
    if (!alive) {
      clearInterval(interval);
      try {
        ws.terminate();
      } catch {}
      onDead();
      return;
    }
    alive = false;
    ws.ping();
  }, 10_000);
  ws.on("close", () => clearInterval(interval));
}

On the client side we keep the last 4 seconds of microphone audio in a rolling buffer. On reconnect, we replay it. From the user's perspective, the assistant pauses for a beat (typically 600 to 900ms during a reconnect) and then catches up. This is the difference between "the assistant felt slow for a second" and "the assistant died and I had to start over."

Troubleshooting

InvalidRequestError: input_audio_format must be 'pcm16' or 'g711_ulaw' or 'g711_alaw'. We hit this on day one. The example we copied from used audio/pcm. The realtime API rejects MIME-style format strings. The values it accepts are the literal strings pcm16, g711_ulaw, or g711_alaw. We use pcm16 because the rest of the pipeline is 16-bit linear PCM at 24kHz.

InvalidRequestError: session.turn_detection.silence_duration_ms must be between 100 and 5000. Trying to set silence_duration_ms: 50 to chase latency. Don't. Below 100ms is rejected, and below about 250ms in practice you get false turn-end on syllable boundaries anyway.

The case where the WebSocket dropped silently and the SDK did not fire an error event. Documented above. The fix is the heartbeat. The symptom that pointed us at it was a single user session that sat at "listening" for 90 seconds without any inbound audio events while the user was visibly speaking into the tablet. The middleware logs showed no event traffic at all for that window. Nothing was emitted on the WebSocket because nothing was being received and nothing was timing out. Heartbeats give you a deadline.

unrecognised event 'response.audio_transcript.delta'. The realtime API has emitted new event types over the months as the surface has expanded. Our middleware was logging warnings every time it received one. The fix is to switch the event handling from a switch statement that errors on unknown events to one that ignores them. New event types do not break old clients; old clients just see less. The list of events for the current API version is in the realtime API reference.

VAD too sensitive. Already covered above. The signal that you have this problem is the false-trigger rate; the script in the VAD section is the diagnostic.

Tool call arrives with malformed JSON in arguments. This is a model behaviour, not an API bug. About 1 in 200 tool calls in our first week had truncated JSON arguments. The fix is a forgiving JSON parser on the middleware side that attempts a repair pass (close trailing braces, strip trailing commas) before giving up. We also added a system prompt instruction that emphasises producing complete JSON.

Two voice sessions interleaving. We thought we had a sharing-state bug. We had a load balancer bug. Two clients were hitting the same middleware instance and the in-memory session map was being keyed by something stable per IP rather than per browser tab. Fixed with a per-session UUID generated client-side at connection open.

Realtime API charges by input and output audio minutes. We watched the bill closely. Prompt caching for the realtime API does apply, but only to the system prompt and any cached context above a length threshold. Our system prompt was below the threshold for the first two weeks. We bulked it out with a couple of paragraphs of stable context (the workshop's safety standards, the calling conventions for the three tools) and the cache hit rate went above 80% on subsequent turns in the same session, which cut input cost per session by roughly a third.

Where we ended up

Twenty-eight workshop engineers use the assistant 30 to 40 times a day in aggregate. The most common interaction shapes:

  • "What is the torque on the M8 bolt for the upper bracket on the L4?" → lookup_spec → assistant reads back two lines.
  • "Walk me through the runbook for hydraulic line flush." → fetch_runbook → assistant reads step 1, pauses, waits for "next" or "skip" or "back," reads step 2, etc.
  • "Open a P2 ticket: pressure sensor on rig 7 is intermittent." → assistant repeats the summary out loud, asks "should I file it?", waits for confirmation, calls create_ticket, reads the ticket id back.

Median end-to-end turn-taking is 720ms in the workshop, measured continuously. P95 is 1040ms. Barge-in cancel-to-silence is 180ms median. Daily session count is 30 to 40. The fallback to Azure AI Speech has triggered four times in the last 60 days, totalling about seven minutes of degraded mode.

Two engineers, six weeks. Roughly the first ten days got the happy path to 780ms. The remaining four weeks were barge-in, the cancelled-response audio dropping, the silent disconnect heartbeat, the VAD tune, the reconnect with buffered audio replay, the tool-call JSON repair, the fallback path, and the eighty small adjustments to the system prompt that mean the assistant reads runbook steps one at a time instead of in one breath.

A coda on what we actually built

If I sit back from this and ask "what did we learn?", the answer is not technical. The technical work is real and the numbers are real, and the latency budget was the right discipline. But the lesson from the 14:30 demo was about what counted as the product.

We thought the product was a voice assistant. The product was actually a voice assistant the user could interrupt without flinching. The 800ms number was necessary, not sufficient. Without barge-in, 800ms feels like a fast kiosk. With barge-in, 800ms feels like a conversation. The kiosk-versus-conversation gap is not a slider you traverse by shaving milliseconds; it is a step function across the interrupt boundary. Below that boundary you are building dictation. Above it you are building a tool that fits in a hand-busy workflow.

The implication is a method, not a metric. When the first end-to-end demo goes from "works" to "kind of nice," do not call the work done. Hand the demo to a user who will treat it the way they treat their phone. Watch them interrupt it. Watch them redirect it. Watch them shortcut past the polite turn-taking your latency budget bought you. If they cannot do those things, the actual feature is hidden behind your happy-path success and you have weeks of work left.

The other thing I would say to a team starting the same build today: instrument the cancellation path first, not last. Spend a day on the trace tooling before you spend a day on the latency tooling. The cancellation path is where the user experience lives. If your dashboard cannot show you cancel-to-silence latency by p50 and p95, you cannot tune the thing the user actually cares about. We added that tracing in week four; if I were doing it again, it would be in week one.

The workshop has not stopped using it. The tablet is on the magnetic mount. The microphone is six inches from someone's face. The assistant is listening. When the engineer wants to redirect it, they just talk over it, and it stops, and it listens. That is the part that took twice as long to build, and that is the part that mattered.