When deploying large language models to production, measuring performance accurately is critical. Whether you're using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand:
- Throughput: How many requests per second can your system handle?
- Latency metrics: Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency
- Token generation speed: Tokens per second under different concurrency levels
- Tail latency: P95 and P99 values that affect user experience
In this post, I'll walk through the key metrics for benchmarking language models and share why I built llmperf-rs, a Rust-based benchmarking tool that takes a different approach to measuring these metrics.
The Problem with Existing Tools
While working with ray-project/llmperf (now archived), I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.
There's also genai-perf, which is very comprehensive. My only issue was running it on Ubuntu 22.04 without Docker. As of this update, they've sunsetted genai-perf in favor of aiperf.
vllm-bench is solid too, but requires installing vllm.
The goal was to build a simple binary that runs almost anywhere with minimal dependencies. It was also a learning project.
Metrics
This is a summary of the full metrics documentation.
Time To First Token (TTFT)
TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It's also important for RAG-based applications where a large chunk of processing happens at the prefill stage.
TTFT = first_token_timestamp - request_start_timestamp
Lower is better.
Inter-Token Latency (ITL)
ITL is the time between consecutive tokens during generation. Spikes can reveal multiple issues, most commonly network problems. ITL is usually consistent due to how KV caches and the computation works.
When testing against vLLM, I noticed that high ITL spikes happen when you benchmark close to the context limit. I suspect this is due to vLLM's eviction of requests if they exceed the KV cache size.
For example, if 3 requests come in with 0.8x context length and 0.2x for generation, but the GPU has space for only 2.8x context length, one of the requests will be preempted.
Aggregation: concatenate ALL ITL values across all responses, then compute statistics. Each response produces (N-1) ITL values (where N is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.
Throughput Metrics
Prefill TPS — tokens processed per second during the prefill phase:
Prefill TPS = input_tokens / TTFT
However, prefill TPS doesn't accurately reflect system performance because TTFT includes queue wait time, not just actual processing time. When a server is under load, your request might sit in a queue waiting for resources. The lower prefill TPS in that case reflects queue contention, not the system's processing capability.
Decode TPS — tokens generated per second during the decode phase:
Decode TPS = output_tokens / (final_time - decode_start_time)
This is the generation speed: how fast the model produces output.
What Matters Most
For production serving, focus on TTFT, ITL stats, and maybe RPM.
TTFT measures how quickly users see their first token — this is the perceived responsiveness of your system.
ITL statistics reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose preemption events from KV cache limits and network issues between components.
ITL matters less for batch jobs or non-streaming APIs where users don't watch tokens arrive in real-time.
Token Counting
Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:
-
API response — Most OpenAI-compatible endpoints return token counts in the
usagefield. By default, llmperf-rs uses this as priority. - Tokenizer — For exact input counts, pass a HuggingFace tokenizer. Note that chat templates may cause <10 token variance.
The original llmperf uses a single tokenizer for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct one or rely on API-reported counts.
For example, Llama-2 has a vocab size of 32000, while Qwen3-4B has 151936. In my own testing, setting input tokens to 8192 against a Qwen endpoint while using the default llama tokenizer returned values around 7363-7376 tokens.
Validating Your Results
All benchmark runs should end with finish_reason = length (meaning the model hit the max_tokens limit). If you see finish_reason = stop, the model stopped early. This affects metrics like RPM and E2E latency. Higher rejection rates can produce higher RPMs and lower latency due to shorter responses.
When to Use llmperf-rs
Use llmperf-rs when: running benchmarks with minimal dependencies, testing OpenAI-compatible endpoints, wanting low overhead (Rust, no Ray/ZMQ), or needing a quick way to test endpoints.
Consider alternatives when: you need GPU-level metrics (use trtllm-bench or aiperf), testing vLLM-specific features, requiring extensive reporting dashboards, or needing distributed testing.
Why ITL Matters Even When Throughput Looks Good
High throughput with bad ITL means tokens arrive in bursts, and chat users notice the choppy streaming. ITL spikes (p99 >100ms) often indicate preemption, network issues, or other problems. For non-user-facing use cases like agentic coding, throughput may matter more than ITL specifics.
The full version with code examples, benchmarks, and installation instructions is on my blog.