LiteLLM is the glue a lot of us reach for when an app has to talk to more than one
model provider. One interface, dozens of backends. It is great. But once you run it
under real load, the hot path stops being the model call and starts being the
plumbing around it: connection pooling, rate limiting, token counting on big
inputs. That plumbing is pure Python, and under concurrency it shows.
So I built fast-litellm: a drop-in Rust acceleration layer that swaps the hot
paths out for PyO3 extensions and falls back to Python everywhere else. This is the
honest write-up, including the cases where Rust lost.
Lead with the numbers, even the bad ones
These compare production-grade Python (thread-safe implementations) against the
Rust versions:
| Component | Result |
|---|---|
| Connection pool | 3.2x faster (lock-free DashMap) |
| Rate limiting | 1.6x faster (atomic ops) |
| Large-text token counting | 1.5 to 1.7x faster |
| High-cardinality rate limits (1000+ keys) | 42x less memory |
| Small-text token counting | 0.5x, Python wins (FFI overhead dominates) |
| Routing with complex Python objects | 0.4x, Python wins |
That last block is the important part. Crossing the Python to Rust boundary is not
free. For a 12-token chat message, the FFI overhead is bigger than the work you
saved, so Rust loses. Anyone who tells you their native extension is faster at
everything is not measuring the small cases.
Where it wins, it wins because of data structures, not because "Rust is fast." The
connection pool uses a lock-free DashMap so concurrent workers stop serializing
on a global lock. The high-cardinality rate limiter holds 1000+ unique keys in a
fraction of the Python footprint. 42x less memory is a memory-layout story, not a
language story.
The architecture: accelerate the hot path, never break the app
The design constraint I cared about most: nobody rewrites their app to try this,
and nobody ships a native extension that can take prod down. So the layer has two
halves.
LiteLLM (Python)
└─ fast_litellm (Python integration layer)
├─ monkeypatches the hot paths
├─ feature flags + gradual rollout
├─ performance monitoring
└─ automatic fallback to Python
└─ Rust components (PyO3)
├─ connection_pool
├─ rate_limiter
├─ tokens
└─ core (routing)
The Python side does the patching, watches the metrics, and owns the safety net.
The Rust side does the work. If an accelerated component throws or looks wrong, the
integration layer falls back to the original Python implementation instead of
propagating the failure. That single decision is what makes a native accelerator
safe to actually turn on in production.
Drop-in, or it does not get used
import fast_litellm # accelerates LiteLLM automatically
import litellm
response = litellm.completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}],
)
One import before litellm. It patches the hot paths on load, and every
accelerated component has that automatic fallback. Installation is the boring part,
which is the point:
uv add fast-litellm # or: pip install fast-litellm
Because the risky path is turning native code on in prod, rollout is gated. Feature
flags let you send a percentage of traffic through the Rust path first, watch the
monitoring, and widen it only when the numbers hold. If you run the LiteLLM proxy
under gunicorn, a two-line wrapper with --preload applies the acceleration before
the workers fork:
# app.py
import fast_litellm # apply before litellm loads
from litellm.proxy.proxy_server import app
Where it does not fit
Be honest about who should not bother with this.
- If your traffic is short prompts and simple routing, the FFI overhead can make you slightly slower, not faster. The table above is not marketing, it is a warning label. Measure your own payload sizes first.
- If you are not concurrency-bound, the connection-pool win shrinks. The 3.2x comes from removing lock contention. No contention, no prize.
- It is another native dependency in your build. For a single-process, low-QPS script, the operational cost is probably not worth the millisecond.
The sweet spot is the opposite of all that: many workers, many unique rate-limit
keys, long inputs, connection-pool pressure. That is where the data-structure wins
compound.
What I would take from this
- Profile before you port. The win was in three specific hot paths, not "the code."
- Measure the small inputs too. FFI overhead is real and it will embarrass you.
- Make it a drop-in or it dies on the vine. Zero-config plus automatic fallback is what turns "interesting Rust project" into something a team will actually run.
Code, the full benchmark breakdown, and the PyO3 architecture are here:
https://github.com/neul-labs/fast-litellm
If you run LiteLLM at any real volume, I would love to know which path is your
bottleneck, and whether the small-input penalty bites you. Kick the tyres, issues
welcome.