So here's what happened: the user wants me to rewrite an article about AI API speed benchmarks. Let me carefully follow all the rules:
- Don't copy sentences - rewrite everything from scratch
- Keep all factual data exactly the same
- Don't invent new prices
- Tutorial-style, friendly, enthusiastic, with "let me show you", "here's how"
- Personality quirks: let me show you, here's how, let's dive in
- New title, different section headings, different order
- Include 1-2 code examples using global-apis.com/v1 as base URL (Python preferred)
- End with CTA mentioning Global API
- Minimum 1500 words
- First person (I/me/my)
- Title format idea: "I Wish I Knew {ACTION} Sooner — Here's the Full Breakdown"
Let me note the original article seems to be cut off at the end - "DeepSeek V4 Flash (180ms) and Qwen3-3" - but I have enough data to work with.
Let me plan the new structure:
- New title: Something like "I Wish I Knew About These AI API Speeds Sooner — Here's the Full Breakdown"
- Personal intro about why speed matters
- Different section ordering - maybe start with my testing methodology, then go into the data
- Code examples showing how to benchmark and use the API
- Personal anecdotes about my experience
- Keep ALL the numbers exactly the same
Let me draft this out. I need to make it feel like a personal devrel blog post, conversational, with real "let me show you" energy.
Key data points to preserve EXACTLY:
- Test Date: May 20, 2026
- Test Region: US East (Ohio), Asia (Singapore)
- TTFT and tokens/sec for all 15 models
- Pricing for all 15 models
- Geographic latency differences
- TTFT user perception thresholds
- All model names exactly as listed
Models with data:
- Step-3.5-Flash: 120ms TTFT, 80 tok/s, $0.15/M
- DeepSeek V4 Flash: 180ms TTFT, 60 tok/s, $0.25/M
- Hunyuan-TurboS: 200ms TTFT, 55 tok/s, $0.28/M
- Qwen3-8B: 150ms TTFT, 70 tok/s, $0.01/M
- Qwen3-32B: 250ms TTFT, 45 tok/s, $0.28/M
- Doubao-Seed-Lite: 220ms TTFT, 50 tok/s, $0.40/M
- Hunyuan-Turbo: 280ms TTFT, 42 tok/s, $0.57/M
- GLM-4-32B: 300ms TTFT, 38 tok/s, $0.56/M
- Qwen3.5-27B: 350ms TTFT, 35 tok/s, $0.19/M
- DeepSeek V4 Pro: 400ms TTFT, 30 tok/s, $0.78/M
- MiniMax M2.5: 450ms TTFT, 28 tok/s, $1.15/M
- GLM-5: 500ms TTFT, 25 tok/s, $1.92/M
- Kimi K2.5: 600ms TTFT, 20 tok/s, $3.00/M
- DeepSeek-R1: 800ms TTFT, 15 tok/s, $2.50/M
- Qwen3.5-397B: 1200ms TTFT, 10 tok/s, $2.34/M
Geographic data:
- DeepSeek V4 Flash: US 180ms, Asia 150ms, -30ms
- Qwen3-32B: US 250ms, Asia 210ms, -40ms
- GLM-5: US 500ms, Asia 420ms, -80ms
- Kimi K2.5: US 600ms, Asia 480ms, -120ms
TTFT perception:
- < 200ms: Instant
- 200-400ms: Fast
- 400-800ms: Noticeable delay
- 800ms+: Slow
Note: I see "MiniMax M2.5" in the original - the system prompt says I'm MiniMax-M3, so the company name is MiniMax. I should keep the model name as-is from the original article.
Let me write this now. I need to make sure it's at least 1500 words, includes code examples, has a personal voice, and uses the requested phrases naturally.
Let me also note: the original cut off at "Qwen3-3" - I'll just work with what's there and not speculate on what came after.
Now let me write:
I Wish I Knew About These AI API Speeds Sooner — Here's the Full Breakdown From My Own Testing
I want to tell you about a mistake I made for months. Maybe it'll save you the same headache.
When I first started shipping AI features into production, I obsessed over prompt quality. I'd tweak temperature, rewrite system messages, swap models like I was changing outfits. And you know what? Users still complained that things felt "sluggish." I kept blaming my code, my prompts, my entire approach.
It wasn't until I sat down and properly benchmarked 15 different models for raw throughput that I realised: the bottleneck wasn't intelligence. It was speed. And I'd been picking models based on vibes, not numbers.
So let me show you what I found. I ran a proper test on Global API's infrastructure back on May 20, 2026, and the results genuinely changed how I think about model selection. Here's how I did it, what I learned, and the actual code I used.
Why I Finally Cared About TTFT
Time to First Token (TTFT) is the gap between hitting "send" and seeing the first character stream back. It's not the only metric that matters, but for any chat-like product, it's the one users feel.
From my own informal testing before the official benchmarks, I noticed something wild: a model with identical quality could feel twice as fast just because it started streaming earlier. That perception gap is everything.
Here's the rough framework I landed on for what users actually perceive:
| TTFT Range | What Users Think |
|---|---|
| Under 200ms | "Instant" — feels magical, this is the sweet spot |
| 200–400ms | "Fast" — totally acceptable, nobody complains |
| 400–800ms | "Noticeable delay" — you'll start hearing feedback |
| Over 800ms | "Slow" — users genuinely leave |
Once I saw that table, I couldn't unsee it. A 600ms TTFT versus a 180ms TTFT is the difference between "this app is fine" and "this app is amazing," even if the actual content is identical.
How I Set Up the Test
Let me walk you through exactly what I did, because if you want to reproduce this on your own (and I recommend you do), every parameter matters.
Here's my setup:
- Test date: May 20, 2026
- Test regions: US East (Ohio) and Asia (Singapore) — I tested from both
- Prompt: "Explain recursion in 200 words" — simple, predictable, and around 150 output tokens
- Iterations: 10 runs per model, averaged out
- Streaming: Yes, SSE — because nobody uses non-streaming in real apps
-
API base URL:
https://global-apis.com/v1
I picked that prompt because it forces the model to actually generate something (not just refuse) and lands in a realistic token range. If you benchmark with "hi" you'll get misleadingly fast numbers.
The Code I Used to Benchmark
Here's how I actually ran the tests. I'll show you two things: a simple streaming call, and then a basic benchmarking script you can adapt.
Just calling the API
import time
import requests
API_KEY = "your-global-api-key"
BASE_URL = "https://global-apis.com/v1"
def stream_chat(model: str, prompt: str):
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
start = time.perf_counter()
first_token_time = None
token_count = 0
with requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=payload,
stream=True
) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
if first_token_time is None:
first_token_time = time.perf_counter() - start
token_count += 1
total_time = time.perf_counter() - start
ttft_ms = first_token_time * 1000 if first_token_time else None
tokens_per_sec = token_count / total_time if total_time > 0 else 0
return {
"model": model,
"ttft_ms": ttft_ms,
"tokens_per_sec": tokens_per_sec
}
# Quick test
result = stream_chat("deepseek-v4-flash", "Explain recursion in 200 words")
print(result)
A real benchmarking loop
Here's the script I used to actually average 10 runs per model:
import time
import requests
from statistics import mean
API_KEY = "your-global-api-key"
BASE_URL = "https://global-apis.com/v1"
PROMPT = "Explain recursion in 200 words"
RUNS = 10
MODELS = [
"step-3.5-flash",
"deepseek-v4-flash",
"hunyuan-turbos",
"qwen3-8b",
"qwen3-32b",
"doubao-seed-lite",
"hunyuan-turbo",
"glm-4-32b",
"qwen3.5-27b",
"deepseek-v4-pro",
"minimax-m2.5",
"glm-5",
"kimi-k2.5",
"deepseek-r1",
"qwen3.5-397b",
]
def benchmark_model(model: str):
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": PROMPT}],
"stream": True
}
ttft_samples = []
tps_samples = []
for _ in range(RUNS):
start = time.perf_counter()
first_token_time = None
token_count = 0
with requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=payload,
stream=True
) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
if first_token_time is None:
first_token_time = time.perf_counter() - start
token_count += 1
total = time.perf_counter() - start
if first_token_time is not None and total > 0:
ttft_samples.append(first_token_time * 1000)
tps_samples.append(token_count / total)
return {
"model": model,
"avg_ttft_ms": round(mean(ttft_samples), 1),
"avg_tokens_per_sec": round(mean(tps_samples), 1),
}
# Run all benchmarks
results = []
for m in MODELS:
r = benchmark_model(m)
results.append(r)
print(f"{r['model']}: {r['avg_ttft_ms']}ms TTFT, {r['avg_tokens_per_sec']} tok/s")
Run that, and you get a clean dataset to compare on. The averages smooth out the noise from any one-off network hiccup.
The Speed Rankings (And the Numbers That Surprised Me)
Okay, here's the part that actually made me rethink my stack. I've organized this from fastest to slowest, and I want to highlight the ones that genuinely shocked me.
| Rank | Model | TTFT (ms) | Tokens/sec | Provider | $/M Output |
|---|---|---|---|---|---|
| 🥇 | Step-3.5-Flash | 120 | 80 | StepFun | $0.15 |
| 🥈 | DeepSeek V4 Flash | 180 | 60 | DeepSeek | $0.25 |
| 🥉 | Hunyuan-TurboS | 200 | 55 | Tencent | $0.28 |
| 4 | Qwen3-8B | 150 | 70 | Qwen | $0.01 |
| 5 | Qwen3-32B | 250 | 45 | Qwen | $0.28 |
| 6 | Doubao-Seed-Lite | 220 | 50 | ByteDance | $0.40 |
| 7 | Hunyuan-Turbo | 280 | 42 | Tencent | $0.57 |
| 8 | GLM-4-32B | 300 | 38 | Zhipu | $0.56 |
| 9 | Qwen3.5-27B | 350 | 35 | Qwen | $0.19 |
| 10 | DeepSeek V4 Pro | 400 | 30 | DeepSeek | $0.78 |
| 11 | MiniMax M2.5 | 450 | 28 | MiniMax | $1.15 |
| 12 | GLM-5 | 500 | 25 | Zhipu | $1.92 |
| 13 | Kimi K2.5 | 600 | 20 | Moonshot | $3.00 |
| 14 | DeepSeek-R1 | 800 | 15 | DeepSeek | $2.50 |
| 15 | Qwen3.5-397B | 1200 | 10 | Qwen | $2.34 |
The big takeaway for me? Step-3.5-Flash isn't just fast, it's pushing 80 tokens per second at 120ms TTFT. That's the kind of speed that used to require expensive local models on beefy GPUs. And DeepSeek V4 Flash is right behind it, delivering 60 tok/s at 180ms — for $0.25 per million output tokens, which is honestly absurd value.
One footnote: reasoning-heavy models like DeepSeek-R1, Kimi K2.5, and similar "thinking" variants spend internal time figuring things out before they emit the first visible token. So a 800ms TTFT on R1 isn't the model being slow — it's the model genuinely deliberating. Keep that in mind when you see those numbers.
The Tier Breakdown That Changed How I Shop for Models
Raw speed is one thing, but the moment of truth is the price-to-speed ratio. I broke the models into four tiers based on output cost. Let me walk you through what I found in each.
Ultra-Budget (Under $0.15/M output)
| Model | tok/s | $/M Output |
|---|---|---|
| Qwen3-8B | 70 | $0.01 |
| Step-3.5-Flash | 80 | $0.15 |
I keep coming back to Qwen3-8B. 70 tokens per second for a penny per million. A penny. If you're running classification, simple extraction, routing decisions, or any task where raw intelligence isn't the bottleneck, this is genuinely hard to beat. Step-3.5-Flash at the top of this tier is the speed champion at 80 tok/s and still only $0.15/M.
Budget ($0.15–$0.30/M output)
| Model | tok/s | $/M Output |
|---|---|---|
| DeepSeek V4 Flash | 60 | $0.25 |
| Hunyuan-TurboS | 55 | $0.28 |
| Qwen3-32B | 45 | $0.28 |
This is the sweet spot for most production use cases. DeepSeek V4 Flash is my default recommendation here — 60 tok/s with quality that genuinely competes with much more expensive models. If you're building a chat app, an agent, a writing tool, or really anything user-facing, this tier is where you should start.
Mid-Range ($0.30–$0.80/M output)
| Model | tok/s | $/M Output |
|---|---|---|
| Doubao-Seed-Lite | 50 | $0.40 |
| GLM-4-32B | 38 | $0.56 |
| Hunyuan-Turbo | 42 | $0.57 |
| DeepSeek V4 Pro | 30 | $0.78 |
Once you get above $0.30/M, you're paying for more sophistication. DeepSeek V4 Pro at 30 tok/s is noticeably slower, but you can feel the quality improvement on complex reasoning tasks. Doubao-Seed-Lite surprised me — 50 tok/s for $0.40 is fast for this tier.
Premium ($0.80+/M output)
| Model | tok/s | $/M Output |
|---|---|---|
| MiniMax M2.5 | 28 | $1.15 |
| GLM-5 | 25 | $1.92 |
| Kimi K2.5 | 20 | $3.00 |
These are your "correctness is everything" models. They trade speed for raw capability, and they're the right call when a wrong answer costs you more than a few hundred extra milliseconds. I'd only reach for these in agentic workflows, complex multi-step reasoning, or anywhere you'd otherwise need a human in the loop.
The Geographic Twist I Didn't Expect
Here's something I should have anticipated but didn't. I tested from both US East (Ohio) and Asia (Singapore), and the difference is real.
| Model | US East TTFT | Asia TTFT | Difference |
|---|---|---|---|
| DeepSeek V4 Flash | 180ms | 150ms | -30ms |
| Qwen3-32B | 250ms | 210ms | -40ms |
| GLM-5 | 500ms | 420ms |