How to Serve Mistral Medium 3.5 128B Without Running Out of GPU Memory

So you saw Mistral dropped their new open-weight 128B parameter model and thought "I should run this locally." You pulled the weights, fired up your inference server, and immediately got slapped with an OOM error. Yeah. Been there.

Serving large dense models is a different beast than the 7B or 13B models most of us cut our teeth on. Mistral Medium 3.5 128B is a fully dense 128 billion parameter model with a 256k token context window, vision capabilities, and native function calling. It's genuinely impressive on benchmarks — but none of that matters if you can't actually get it running.

Let me walk through the problems you'll hit and how to solve each one.

The Root Cause: Dense Models Are Memory Hogs

Here's the fundamental math that ruins your day. A 128B parameter model in BF16 (which is how Mistral ships the weights) requires roughly 256 GB of GPU VRAM just for the model weights. That's before you account for KV cache, activation memory, or any batching overhead.

A single H100 has 80 GB of VRAM. A single A100 tops out at 80 GB. You physically cannot fit this model on one GPU. Period.

This isn't a configuration problem or a skill issue — it's physics. You need to split the model across multiple GPUs using tensor parallelism.

Step 1: Get Your Hardware Straight

You need a minimum of 4x 80GB GPUs (H100 or A100) to serve this model in FP8, and realistically 8 GPUs for comfortable BF16 serving with room for KV cache.

Here's the rough breakdown:

FP8 (F8_E4M3): ~128 GB for weights → 4x H100 minimum, but tight
BF16: ~256 GB for weights → 4x H100 won't cut it, go with 8
KV cache for 256k context: this eats additional VRAM per concurrent request

Mistral ships the weights in both BF16 and F8_E4M3 formats on the Hugging Face repo. If you're memory-constrained, the FP8 weights are your friend.

Step 2: Use vLLM With Tensor Parallelism

vLLM is the recommended serving framework and honestly, it's the path of least resistance here. The key flag is --tensor-parallel-size which splits the model across your GPUs.

# Serve with 8 GPUs using tensor parallelism
vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --max_num_batched_tokens 16384 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

A few things to note:

--gpu-memory-utilization 0.8 reserves 80% of each GPU for the model, leaving headroom for the OS and CUDA overhead. Don't crank this to 0.95 thinking you're being clever — you'll get sporadic OOM crashes under load.
--max_num_batched_tokens 16384 caps how many tokens can be processed in a single batch. Lower this if you're still hitting memory issues.
The --tool-call-parser mistral flag enables native function calling support, which is one of this model's strongest features.

Step 3: Handle the Context Window Trap

The model supports 256k tokens of context. That doesn't mean you should use all of it.

Every token in the context window requires KV cache memory that scales linearly with sequence length. If you try to serve multiple concurrent requests each using 100k+ tokens, your VRAM will evaporate.

# Limit max sequence length if you don't need the full 256k
vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max_num_batched_tokens 16384

Setting --max-model-len 32768 caps each request at 32k tokens. For most use cases — code generation, chat, function calling — 32k is more than enough. Only bump this up if you genuinely need long-context processing, and accept that you'll serve fewer concurrent requests.

Step 4: Configure Reasoning Effort Properly

Mistral Medium 3.5 has a configurable reasoning mode that a lot of people miss. You can set reasoning_effort to "high" or "none" per request.

This matters for performance because "high" reasoning generates internal chain-of-thought tokens before responding — which means more compute time and more memory per request.

from openai import OpenAI

# vLLM exposes an OpenAI-compatible API
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# Quick factual query — no reasoning needed
fast_response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{"role": "user", "content": "What's the HTTP status code for 'not found'?"}],
    extra_body={"reasoning_effort": "none"},  # skip chain-of-thought
    temperature=0.0
)

# Complex debugging task — use full reasoning
deep_response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{"role": "user", "content": "Analyze this stack trace and suggest fixes..."}],
    extra_body={"reasoning_effort": "high"},  # enable chain-of-thought
    temperature=0.7  # recommended temp for reasoning mode
)

Route simple requests to "none" and save "high" for complex tasks. Your throughput will thank you.

Step 5: Speed Up Inference With Speculative Decoding

Mistral released an EAGLE model alongside the main weights specifically for speculative decoding. This is basically a small draft model that proposes tokens which the large model then verifies in parallel. It can significantly reduce latency.

Both vLLM and SGLang support this. Check the model card on Hugging Face for the specific EAGLE model reference — it's linked directly from the Mistral Medium 3.5 repo.

The SGLang Alternative

If vLLM gives you trouble, SGLang is a solid alternative that's also production-ready for this model. The setup is similar — you'll still need tensor parallelism across multiple GPUs. I've found SGLang sometimes handles high-concurrency workloads slightly better, but honestly both are good choices. Pick whichever your team already has experience with.

Prevention: Capacity Planning Before You Deploy

Before you commit to serving a 128B model, ask yourself these questions:

What's your expected concurrent request count? Each concurrent request needs KV cache memory. More users = more VRAM.
What context lengths will your users actually need? Don't pay for 256k if 8k covers 95% of your traffic.
Is FP8 precision acceptable? For most applications, the quality difference between BF16 and FP8 is negligible. The memory savings are not.
Do you actually need 128B? Mistral has smaller models that might solve your problem at a fraction of the cost. Don't serve a 128B model for tasks where a 24B model would do.

Quick Troubleshooting Reference

Symptom	Likely Cause	Fix
OOM on startup	Not enough GPUs	Increase `--tensor-parallel-size` or use FP8 weights
OOM under load	KV cache overflow	Lower `--max-model-len` or `--gpu-memory-utilization`
Slow first response	Model loading	Expected — 128B takes time to load. Use model preloading
Garbled function calls	Wrong parser	Add `--tool-call-parser mistral` flag
Inconsistent outputs	Temperature too high	Use 0.0 for deterministic tasks, 0.7 for reasoning

Wrapping Up

Serving a 128B dense model isn't something you stumble into successfully. It requires deliberate hardware planning, proper tensor parallelism configuration, and smart memory management. The good news is that with vLLM or SGLang and enough GPUs, it's entirely doable — and Mistral Medium 3.5 is genuinely worth the effort for agentic and function-calling workloads.

Start with FP8 weights on 4-8 GPUs, cap your context length conservatively, route requests by reasoning effort, and scale up from there. The model is open-weight under a modified MIT license, so you can actually inspect what you're deploying — which is more than I can say for most models at this capability level.