I Built a Production-Grade AI Gateway in Rust — Here's What I Learned

rust dev.to

TL;DR — I built a production-ready, distributed API gateway in Rust that routes traffic to OpenAI, Anthropic (Claude), and Ollama with auth, Redis rate limiting, PostgreSQL usage tracking, Prometheus metrics, and AWS Terraform deployment. Gateway overhead is ~1.2ms P99. Here's why and how.

👉 GitHub: MihirMohapatra/rust-ai-gateway


The Problem I Kept Running Into

Six months ago, my team was using OpenAI directly in every service. Then we wanted to test Claude 3.5 Sonnet for some tasks. Then our compliance team asked: "Can you show me every AI request we've ever made?" Then someone ran a batch job and we got a surprise invoice.

Sound familiar?

The real problem wasn't any single API — it was that we had no control plane for our AI traffic. No unified auth, no rate limiting per team, no cost visibility, no ability to swap providers without touching application code.

I looked at existing solutions. Most are Python-based (LiteLLM), or Go-based (Kong), or SaaS (Portkey). Nothing was Rust. Nothing had the full stack I wanted in a single deployable binary — auth, rate limiting, observability, multi-provider routing, and cloud IaC — without a subscription.

So I built one.


What It Does (10-Second Version)

Your App  →  AI Gateway  →  OpenAI (GPT-4, o1, o3)
                         →  Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
                         →  Ollama (Llama3, Mistral, local models)
Enter fullscreen mode Exit fullscreen mode

It's an OpenAI-compatible drop-in replacement. Change one line in your existing code — the base_url — and you instantly get:

  • Per-key API authentication (Argon2id hashed)
  • Redis sliding window rate limiting
  • Full request/token audit trail in PostgreSQL
  • Prometheus metrics + pre-built Grafana dashboards
  • Distributed tracing via X-Request-ID
  • Docker + AWS ECS Fargate deployment with Terraform

Why Rust? (The Real Answer)

I've shipped services in Java, Kotlin, Python, and Go. The honest answer to "why Rust" isn't idealism — it's that this specific problem punishes GC pauses.

An AI gateway sits in the hot path of every LLM call. When your app fires 50 concurrent requests, the gateway's tail latency becomes your app's tail latency. GC-induced pauses at P99 compound badly in this scenario.

Here's what I measured after building the Rust version:

Metric Rust (This Project) Typical Node.js Typical Go
Memory (idle) 12 MB 80–150 MB 20–30 MB
Memory (10k connections) 45 MB 500+ MB 100–150 MB
P99 latency (gateway overhead) 1.2 ms 8–15 ms 2–4 ms
Startup time 50 ms 2–5 seconds 200 ms
Binary size 8 MB 200+ MB (node_modules) 15 MB

The 12 MB idle footprint means I can run this on the cheapest ECS Fargate task (0.5 vCPU / 1 GB RAM) and still have headroom for hundreds of concurrent requests.

But the real win was correctness under concurrency. The rate limiter touches Redis on every request across multiple gateway nodes. In Go or Node I'd be paranoid about races. In Rust, the compiler simply won't compile a race condition. That's not a cliché — it meant I could build the distributed rate limiter without a single Mutex in the business logic, because the ownership model forced me into the right design.


Architecture Deep Dive

The Request Pipeline

Every request flows through a layered middleware stack before hitting a provider:

Client Request
     │
     ▼
1. Request ID         ── Assigns UUID, injects X-Request-ID header
2. Tracing Layer      ── Structured span for each request
3. Compression        ── gzip on responses
4. 60s Timeout        ── Hard stop, prevents hung connections
5. Auth Middleware    ── Validates API key with Argon2id
6. Rate Limiter       ── Redis sliding window per key
7. Provider Router    ── Routes by model name → OpenAI/Anthropic/Ollama
8. Proxy Request      ── Forwards to upstream
9. Usage Logging      ── Async insert to PostgreSQL
10. Metrics Update    ── Prometheus counters + histograms
Enter fullscreen mode Exit fullscreen mode

Step 5 and 6 are the interesting ones.

Authentication: Why Argon2id?

The industry standard for password hashing is Argon2id — it won the Password Hashing Competition for a reason. But for API keys, most people just use bcrypt or, worse, store a plain SHA256 hash. Here I used Argon2id because:

  1. Memory-hard — GPU-based brute force is expensive
  2. Configurable parallelism and memory cost
  3. The argon2 crate is excellent in the Rust ecosystem

On key creation, only the hash is stored. On every request, the presented key is verified with argon2::verify_encoded(). The raw key never hits the database — not even once.

// Key creation: hash and store
let salt = SaltString::generate(&mut OsRng);
let argon2 = Argon2::default();
let hash = argon2.hash_password(raw_key.as_bytes(), &salt)?.to_string();
sqlx::query!(
    "INSERT INTO api_keys (user_id, key_hash, prefix, name) VALUES ($1, $2, $3, $4)",
    user_id, hash, &prefix, key_name
).execute(&pool).await?;

// Verification on every request — no raw key ever touches the DB
let is_valid = argon2::verify_encoded(&stored_hash, presented_key.as_bytes())?;
Enter fullscreen mode Exit fullscreen mode

Rate Limiting: Redis Sliding Window

Fixed window rate limiting has an edge case that'll bite you: a burst of 2× the limit can happen right across a window boundary. Sliding window fixes this.

The implementation uses a Redis sorted set per API key. On each request:

  1. Remove all entries older than the window (currently 60 seconds)
  2. Count remaining entries
  3. If count ≥ limit → reject with 429
  4. Otherwise add current timestamp and continue
let now = SystemTime::now()
    .duration_since(UNIX_EPOCH)?
    .as_millis() as f64;
let window_start = now - (window_seconds * 1000) as f64;

let (count,): (i64,) = redis::pipe()
    .zrembyscore(&key, 0.0, window_start)   // evict old entries
    .ignore()
    .zcard(&key)                              // count current
    .query_async(&mut conn)
    .await?;

if count >= limit as i64 {
    return Err(AppError::RateLimitExceeded { limit, window_seconds });
}

// Add current request timestamp
conn.zadd(&key, now, now).await?;
conn.expire(&key, window_seconds as usize + 1).await?;
Enter fullscreen mode Exit fullscreen mode

All of this runs atomically per request. Because multiple gateway nodes share the same Redis, rate limiting is globally consistent across horizontal replicas.

Multi-Provider Routing

Provider selection is purely model-name-driven — no configuration per-request:

pub fn select_provider(model: &str) -> Provider {
    if model.starts_with("gpt-") || model.starts_with("o1") || model.starts_with("o3") {
        Provider::OpenAI
    } else if model.starts_with("claude-") {
        Provider::Anthropic
    } else {
        Provider::Ollama  // default: local
    }
}
Enter fullscreen mode Exit fullscreen mode

This means you can A/B test gpt-4 vs claude-3-5-sonnet vs llama3 from a single endpoint — just change the model field. No infrastructure changes, no redeploys.


The Workspace Architecture

The project is a Cargo workspace with three crates:

crates/
├── gateway/      # Main binary — routes, middleware, HTTP server
├── shared/       # Library — domain models, provider clients, errors
└── dashboard/    # HTMX metrics dashboard (separate binary)
Enter fullscreen mode Exit fullscreen mode

Splitting shared out was intentional. The provider clients (OpenAI, Anthropic, Ollama) are pure reqwest + serde with no web framework dependency. This means I can test provider logic without spinning up an HTTP server. It also means the dashboard can import usage stats without duplicating types.

The gateway crate's state is a single AppState struct passed through Axum's Extension:

#[derive(Clone)]
pub struct AppState {
    pub db: PgPool,
    pub redis: ConnectionManager,
    pub config: Arc<Config>,
}
Enter fullscreen mode Exit fullscreen mode

Clone is Arc under the hood for the pool and connection manager — cheap shared ownership across async tasks, zero allocation per request.


Observability: What I Actually Instrument

Every production service eventually asks: "Why was that request slow?" You want to answer that without waking anyone up.

Structured logging — In production, every request emits a JSON log line:

{"timestamp":"2026-05-31T10:30:00.000Z","level":"INFO","message":"Chat completion completed","model":"gpt-4","provider":"openai","latency_ms":1250,"prompt_tokens":45,"completion_tokens":120,"request_id":"550e8400-e29b-41d4-a716-446655440000"}
Enter fullscreen mode Exit fullscreen mode

That request_id is injected by the first middleware and propagated through the upstream provider call. When something goes wrong, you grep by request_id and get the full trace across logs.

Prometheus metrics — Four metrics I actually care about for an AI gateway:

Metric Why It Matters
gateway_requests_total{model,provider,status} Per-model error rates
gateway_request_duration_ms (histogram) P50/P99 latency breakdown
gateway_tokens_total{model} Cost attribution
Rate limit rejection rate Detect runaway clients

Grafana comes pre-provisioned via monitoring/grafana/provisioning/. One docker-compose up and you have a live dashboard. No manual datasource setup.


Health Probes: Liveness vs Readiness

This distinction matters in Kubernetes and ECS, and most tutorials get it wrong.

GET /health/live — Returns 200 as long as the process is running. Never checks external deps. This is what your orchestrator uses to decide whether to kill and restart the task.

GET /health/ready — Returns 200 only if the database is reachable. This is what your load balancer uses to decide whether to send traffic. If the DB connection pool is exhausted, this fails and the LB stops routing to that instance.

// Readiness: check DB before accepting traffic
async fn health_ready(State(state): State<AppState>) -> impl IntoResponse {
    match sqlx::query("SELECT 1").fetch_one(&state.db).await {
        Ok(_) => StatusCode::OK,
        Err(_) => StatusCode::SERVICE_UNAVAILABLE,
    }
}
Enter fullscreen mode Exit fullscreen mode

AWS Deployment (Terraform)

The infrastructure is fully defined in infra/main.tf:

Internet → ALB → ECS Fargate (2+ tasks) → RDS PostgreSQL
                                         → ElastiCache Redis
Enter fullscreen mode Exit fullscreen mode

A few decisions worth calling out:

ECS Fargate over EC2 — No instance management, no capacity reservations. The gateway is stateless, so horizontal scaling is just bumping the desired task count. The Rust binary's 50ms cold start means new tasks are ready quickly.

RDS over self-managed Postgres — The usage logs table will grow. Automated backups, point-in-time recovery, and Multi-AZ are worth the cost for a service where audit trails matter.

ElastiCache Redis over self-managed — The rate limiter's correctness depends on Redis. Managed Redis with automatic failover means a Redis restart doesn't briefly disable rate limiting for all keys.


Testing: The Three Layers

The test suite has a clear pyramid:

Unit tests      — Zero I/O. Config parsing, provider routing logic,
                  request/response serialization.

Mock tests      — Wiremock HTTP server standing in for OpenAI/Ollama.
                  Provider clients tested in isolation.

Integration     — Real PostgreSQL + Redis (via Docker Compose).
                  Full auth, rate limit, stats endpoint flows.
Enter fullscreen mode Exit fullscreen mode

The mock tests were the most valuable. I caught a bug where the Anthropic client was sending the max_tokens field in the wrong position for the Messages API. Unit tests wouldn't have caught it. Integration tests against the real Anthropic API would've been slow and flaky. Wiremock was the right tool.

# Run the full suite
./scripts/test.sh all

# Or just the mocks (fast, no external deps)
cargo test --package gateway --test provider_test
Enter fullscreen mode Exit fullscreen mode

Performance Benchmarks

Gateway overhead only (not provider latency):

Metric Value
Throughput (health endpoint) ~45,000 req/s
P50 latency 0.3 ms
P99 latency 1.2 ms
Memory at idle 12 MB
Memory at 10k concurrent 45 MB
Binary startup < 50 ms

The 45 MB at 10k connections is the number that keeps surprising people. Node.js would be well past 500 MB at that concurrency. The difference isn't algorithmic — it's that Rust async tasks have a few hundred bytes of stack per task vs megabytes in thread-per-connection models.


What I'd Add Next

The project is production-ready today, but a few things are on the roadmap:

Streaming support — The current implementation buffers the full response before returning. For long Claude/GPT outputs, you want text/event-stream SSE forwarding so the client sees tokens as they arrive.

Token budget enforcement — Right now rate limiting is request-based. A more useful primitive for AI gateways is token-based budgets: "this key can spend 1M tokens/day."

Retry with fallback — If OpenAI returns a 503, automatically retry with Ollama. The provider abstraction is already in place; it just needs a retry policy layer.

Prompt caching metrics — Anthropic and OpenAI both have prompt caching. Tracking cache hit rates per key would give teams actionable cost reduction data.


Quick Start

git clone https://github.com/MihirMohapatra/rust-ai-gateway
cd rust-ai-gateway
cp .env.example .env
# Add your OPENAI_API_KEY and ANTHROPIC_API_KEY (both optional)
docker-compose up -d
curl http://localhost:3000/health
Enter fullscreen mode Exit fullscreen mode

Grafana dashboard at http://localhost:3002 (admin/admin).

Register and get an API key:

curl -X POST http://localhost:3000/auth/register \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "password": "secure-pass"}'

curl -X POST http://localhost:3000/auth/keys \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "password": "secure-pass", "key_name": "dev"}'
Enter fullscreen mode Exit fullscreen mode

Point your existing OpenAI SDK at it:

from openai import OpenAI

client = OpenAI(
    api_key="aig_your_key_here",
    base_url="http://localhost:3000/v1"  # ← Only this line changes
)
# Everything else stays the same
Enter fullscreen mode Exit fullscreen mode

Closing Thoughts

Building this forced me to think seriously about what "production-grade" actually means for infrastructure software. It's not just "it works." It's:

  • Observability first — You need to know what it's doing before it fails, not after
  • Failure modes are designed, not discovered — Graceful shutdown, circuit breaking, liveness vs readiness
  • Security is structural — Argon2id isn't a detail; it's the only defensible choice
  • Benchmarks are hypotheses — The 45 MB memory number is only meaningful relative to alternatives

If you're building AI-powered products and you're hitting the "which provider do I use and how do I track it" question, I hope this gives you a usable starting point.

Star the repo if it's useful, open issues if you find gaps, and PRs are welcome — especially for the streaming feature.


I'm a Senior Backend Engineer with a background in FinTech (HSBC, Westpac, FedEx) currently building AI/ML infrastructure in Rust and Go. Open to senior backend and AI infra roles with US/EU remote companies.

Find me on GitHub or LinkedIn.

Source: dev.to

arrow_back Back to Tutorials