I Rewrote Pipecat in Rust. Here's What I Learned Building a Voice Agent Framework from Scratch.

description: rustvani is a Rust-native voice agent pipeline framework — a from-scratch port of Pipecat for when 30MB of memory per instance is all

There's a moment in every voice AI conversation where the user says something, and then... waits. Maybe 800 milliseconds. Maybe a full second. Maybe more. In that silence, trust evaporates. The user starts wondering if the bot heard them. They repeat themselves. The bot now has two overlapping utterances to deal with. Things spiral.

I spent months obsessing over that silence. And eventually, I decided the best way to kill it was to stop fighting Python's runtime and just rewrite the whole pipeline in Rust.

The result is rustvani (वाणी — vānī, meaning voice/speech) — a from-scratch Rust port of the Pipecat voice agent framework. It's open source, it's in production, and a single instance uses about 30MB of memory.

Why Pipecat Deserved a Rust Port

Let me be clear: Pipecat is excellent software. The pipeline architecture — frames flowing through a chain of processors, each doing one thing well — is one of the cleanest abstractions I've seen for real-time media. VAD to STT to LLM to TTS, all composable, all pluggable. I studied the Pipecat codebase extensively before writing a single line of Rust.

But when you're running voice agents in production — real users, real phone calls, real government deployments — you start hitting walls that aren't about architecture. They're about the runtime underneath.

Python's async story is good but not great for this workload. The GIL means your VAD inference and your audio I/O are fighting for the same thread. Memory usage per instance adds up fast when you're running dozens of concurrent sessions. And cold start times matter when you're doing scale-to-zero on Fly Machines.

Rust doesn't have these problems. Tokio gives you real concurrency. There's no GIL. Memory is predictable and tiny. And a Rust binary cold-starts in milliseconds, not seconds.

The Architecture (Without the Code)

If you know Pipecat, you'll feel right at home. The core abstraction is the same: frames flow through a chain of processors.

But the Rust implementation diverges in some interesting ways.

Frames are enums, not classes. Python Pipecat uses class inheritance — TranscriptionFrame extends DataFrame. In rustvani, frames are a three-level enum: System, Control, and Data. The Rust compiler enforces exhaustive pattern matching, so you literally cannot forget to handle a frame type. If you add a new variant, every processor that touches frames will fail to compile until you handle it.

Two-queue priority routing. Every processor has two async queues — one for system frames (interruptions, VAD signals, lifecycle events) and one for data frames (audio, transcriptions, text). System frames bypass the data queue entirely. This means an interruption signal is never stuck behind a backlog of audio chunks waiting to be processed. In practice, this is what makes interruptions feel instant.

Pipelines are processors. A Pipeline chains processors together, but it is itself a FrameProcessor. So you can nest pipelines inside pipelines. This sounds academic until you need it, and then it's exactly the right abstraction.

No vtable on the hot path. The LLM adapter uses generic bounds instead of trait objects, so the compiler generates specialized code per provider. No dynamic dispatch overhead during inference streaming.

Dhara: Conversations That Actually Flow

Most voice agent frameworks treat conversation as a single long prompt. You stuff everything into one system prompt and hope the LLM stays on track. This works for demos. It does not work when your bot needs to greet the user, collect information, look things up, and then transfer to a different mode — all in one call.

rustvani ships with Dhara (ധാര — dhārā, meaning flow/stream), a node-based conversation flow engine. Each node defines its own system prompt, its own set of tools, and its own context strategy. Tool handlers return either "stay in this node" or "transition to node X." The LLM doesn't need to know about the state machine — it just sees the tools and prompt for its current node.

This sounds simple, but it changes everything about how you design voice agents. Instead of one massive prompt that tries to cover every conversational path, you decompose the conversation into discrete stages. The greeting node has a greeting prompt and greeting tools. The data collection node has different tools. The handoff node has yet another set. Each node is small, focused, and testable.

RAVI: A Protocol Layer

Pipecat's ecosystem uses RTVI (Real-Time Voice and Video Inference) as its client-server protocol. rustvani has its own equivalent called RAVI (रवि — ravi, meaning the sun) — a Rust-native protocol layer that handles the client handshake, speaking state signals, transcription forwarding, LLM token streaming, and function call events over WebSocket.

The transport layer currently runs on axum + tokio-tungstenite. Binary frames carry PCM audio, text frames carry RAVI protocol messages. Audio chunking is configurable in 10ms multiples for smooth playback. There's automatic bot speaking state management — the transport knows when the bot starts and stops speaking and broadcasts those signals through the pipeline.

30MB

That's the memory footprint of a running rustvani instance. Not 300MB. Not "it depends." Thirty megabytes, including the Silero VAD model loaded in memory.

This number matters because it determines your deployment economics. On Fly Machines with scale-to-zero, each instance spins up only when a user connects, handles the call, and shuts down. At 30MB per instance, you can run a lot of concurrent sessions on modest hardware. And cold starts are fast enough that the user never notices.

For comparison, a typical Python voice agent process — even a well-optimized one — tends to land somewhere between 200MB and 500MB depending on what you've loaded.

What's Plugged In

rustvani currently integrates with:

VAD: Silero (ONNX, via the ort runtime — runs on CPU, no GPU needed)
STT: Sarvam AI (streaming WebSocket)
LLM: OpenAI (SSE streaming, with full function/tool calling support)
TTS: Sarvam AI (WebSocket streaming)
Database tooling: Built-in Neon Postgres tool with schema introspection, parameterized queries, and pgvector similarity search — the LLM never writes raw SQL
Transport: WebSocket via axum

The function calling system has two handler types: simple handlers that return a string directly to the LLM context, and data handlers that return a summary for the LLM plus raw structured data as a downstream frame for UI or logging. This distinction matters when your tool fetches a menu or a database result — the LLM gets a digestible summary, but the frontend gets the full dataset to render.

Who This Is For

If you're building voice agents in Pipecat and hitting scaling walls — memory, latency, cold starts, concurrent sessions — rustvani is a direct migration path. The architecture is the same. The mental model is the same. The language is different.

If you're a Rust developer curious about voice AI, this is a batteries-included framework that doesn't ask you to figure out VAD state machines and LLM streaming from scratch.

If you're deploying voice agents in production — especially in environments where resource constraints are real — the 30MB footprint and millisecond cold starts might solve problems you're currently throwing hardware at.

What's Next

rustvani is under active development. The roadmap includes transport-agnostic architecture (a Rust trait mirroring Pipecat's abstract Transport interface, so the pipeline core works against Box<dyn Transport> without caring whether it's WebSocket, WebRTC, or something else), OpenTelemetry tracing integration for production observability, and support for local TTS models like Kokoro for fully offline deployments.

Contributions are welcome. If you've worked with Pipecat and want to help build the Rust equivalent, the codebase will help.

https://github.com/Allenmylath/rustvani

rustvani is open source. If you're building something with it, I'd love to hear about it.