Building Realtime Voice Agents in 2026
2026-05-07 · voice-agents, sip, telephony, ai-agents, llms, nat
Notes from building realtime voice agents in 2026 — the shape of a Next.js + LangChain + OpenAI Realtime + Twilio stack, why most of the engineering lives in the integration layer, and why NAT is still a thing two decades on.
I've been building realtime voice agents over the past few weeks designed to handle inbound phone traffic. The voice-and-reasoning part is, by 2026 standards, mostly a solved problem: the OpenAI Realtime API, Google's Live audio, ElevenLabs' conversational stack all do the talking part well enough to ship. Picking up the phone and listening to one of these agents run a real conversation is, on first encounter, striking.
What's less striking — and where most of the actual work lives — is the integration layer underneath.
A realtime voice agent isn't a model with a microphone. It's a small distributed system: a phone number, a SIP stack accepting the call, a media pipeline handling the audio, a runtime running the conversation, an integration layer holding the pieces together. The model is one box in that diagram. The other boxes are the engineering.
The agent I have been building answers inbound phone calls, holds a real conversation with the caller, and decides — mid-conversation — when to call a tool, escalate to a human, or wrap up the call. The pattern is intentionally generic: the same architecture, with different prompts and a different tool surface, runs support, scheduling, lead qualification, screening, or any other phone-shaped interaction. Phones are still where users meet you for high-stakes or high-volume conversations, and the volume is what makes the engineering investment worthwhile.
The shape of the stack
The stack throughout this article is concrete: Next.js (Node) on the edge, Twilio Voice for the SIP trunk, OpenAI Realtime (gpt-realtime-1.5) for the speech model, and LangChain for tool orchestration on the agent side.
End to end, the call path looks like this: a phone number routes through Twilio Voice, which terminates the SIP trunk and hands me a media stream over WebSocket. My Next.js handler bridges that stream to OpenAI Realtime running gpt-realtime-1.5. The model's audio comes back through the same WebSocket and out through Twilio to the caller. LangChain wraps any tool calls the agent invokes mid-conversation. Most of the boxes have nothing to do with the LLM.
A minimal implementation, in pieces
Three short snippets that together sketch how a call gets answered. The code is illustrative, not production — error handling, retries, and observability are all elsewhere.
The Twilio webhook is the entry point. When an inbound call arrives, Twilio fires a POST at this route and we return TwiML that tells Twilio to open a media WebSocket back to our handler:
// app/api/voice/inbound/route.ts
export async function POST(req: Request) {
const twiml = `
<Response>
<Connect>
<Stream url="wss://${process.env.HOST}/api/voice/stream" />
</Connect>
</Response>`;
return new Response(twiml, {
headers: { 'Content-Type': 'text/xml' },
});
}
The WebSocket handler bridges Twilio's media stream to OpenAI Realtime. Audio frames flow in both directions; the model picks up the call:
// app/api/voice/stream — WebSocket handler
import OpenAI from 'openai';
const openai = new OpenAI();
async function handleCall(twilioWs: WebSocket) {
const model = await openai.realtime.connect({
model: 'gpt-realtime-1.5',
});
model.send({
type: 'session.update',
session: {
instructions:
'You are answering an inbound call. Be warm, brief, and helpful.',
voice: 'alloy',
},
});
twilioWs.on('message', (frame) => model.sendAudio(frame));
model.on('audio', (frame) => twilioWs.send(frame));
}
Tools are the deterministic side of the agent. The model decides when to call them; the system decides what they do:
import { tool } from '@langchain/core/tools';
import { z } from 'zod';
export const lookupRecord = tool(
async ({ phoneNumber }) => {
const record = await db.records.findByPhone(phoneNumber);
return JSON.stringify(record);
},
{
name: 'lookup_record',
description: 'Look up a record by phone number.',
schema: z.object({ phoneNumber: z.string() }),
},
);
The Twilio webhook is one HTTP endpoint. The bridge is one WebSocket handler. The tools are deterministic functions. Most of the integration work — error paths, session state, observability, hand-offs — lives outside of these snippets, but the shape is there.
SIP and trunking
Every voice agent that wants to be reachable from a real phone speaks SIP somewhere in its stack. The world's carriers, PBXes, and call-routing infrastructure all speak it; there is no shortcut around it. What changes is whether you handle SIP yourself or whether a trunk provider handles it for you.
Twilio Voice gives you a SIP trunk wrapped in an HTTP and WebSocket abstraction. You hand them a webhook URL; they hand you media streams. The trade-off is the usual one with managed infrastructure: you give up some control over the lower layers in exchange for not having to run them. For a generic call-handling agent, that trade is almost always right. For a specialised agent at scale — direct control over codec selection, custom session border controllers, carrier-specific routing — going closer to the trunk pays off.
NAT and firewalls — still on the critical path
SIP negotiates where media should flow inside its own payload: the SDP body of an INVITE carries the IP address and port where the endpoint expects to receive RTP. A NAT box that only rewrites the IP and UDP headers leaves those payload addresses untouched, and the media ends up pointed at an unreachable internal address. Worse, RTP runs on dynamically allocated UDP ports negotiated per call — the firewall has no way to know which ports to open until it has parsed the SDP itself.
So the connection-tracking layer has to read into the application protocol: parse the SIP message, rewrite the addresses inside the SDP to match the NAT mapping, and pinhole the right ports for the right call — in real time, for every call.
That is exactly what nf_conntrack_sip.c and nf_nat_sip.c do in the Linux kernel. I wrote them in 2005; the netfilter community has maintained them since. They are still the canonical SIP helpers in the kernel, and the underlying problem hasn't moved in twenty-one years.
Trunk providers like Twilio handle most of this transparently behind their abstraction. It surfaces — quickly — when you start running custom media handling, self-hosted session border controllers, or any kind of edge component that sits between the trunk and your runtime.
Where the AI layer ends and the system layer begins
The model is excellent at the conversation. The model is not, and should not be, the place where you handle session state, retries, hand-off logic, or anything that needs to be deterministic. The LangChain tool in the snippet above is a small example of where the line falls: the model decides when to look a record up; the system decides how.
Drawing that line cleanly is much of the integration design. Tools that should be deterministic — database lookups, external API calls, anything with a definite right answer — sit on the system side. Reasoning, paraphrasing, follow-up questions, recognising when to escalate — those sit on the model side. When in doubt, push the work toward the system. An undeterministic database lookup is a bug. A deterministic conversation is a robot.
The old stack underneath
A realtime voice agent in 2026 is mostly the telephony stack of the last twenty years with a new model on the conversation side. The phone number, the SIP trunk, the media pipeline, the NAT traversal — none of that is new. The model is. And the model is not where the engineering goes.
The engineering is everywhere else.