Last month I was sitting in a client's office — no Wi-Fi, a locked-down corporate laptop, and a deadline to rewrite 40 API endpoint descriptions. Every cloud-based AI tool I normally rely on was completely unreachable. So I pulled a Samsung T7 SSD from my bag, plugged it into the USB-C port, and had a portable LLM on a USB stick generating clean documentation in under two minutes. No internet. No API keys. No data leaving the room.
This wasn't a party trick. It's genuinely the most useful thing I built all year. And the setup is far simpler than you'd think.
Why a Portable LLM on a USB Stick Actually Matters
Running AI locally isn't new. I wrote a complete guide to running local LLMs earlier this year, and it's still one of my most-read posts. But there's a real difference between "local" and "portable." A local setup is tied to one machine. A portable setup lives on an external drive and works on whatever computer you plug it into.
I didn't appreciate that difference until I needed it. Here's when I actually reach for this thing:
- Air-gapped environments. Client sites, government offices, secure labs. Anywhere Wi-Fi is restricted or flat-out untrusted.
- Travel without connectivity. I wrote half this post on a flight to Vancouver with Llama 3.1 8B generating outlines for me. Planes and trains are where this setup really shines.
- Privacy-sensitive work. Legal documents, medical notes, proprietary code. The Electronic Frontier Foundation has consistently argued that running AI locally is a meaningful step for user privacy — your data never touches a third-party server, can't train future models, and can't get exposed in a breach.
- Machine-hopping. I move between a MacBook, a Windows desktop, and a Linux workstation. One USB drive serves all three.
After using this setup daily for six weeks, I'm comfortable saying: it's not gimmicky. It's just practical.
What You Need: Hardware and Software
The hardware requirements are straightforward, but the drive you choose makes or breaks the experience.
The USB drive matters enormously. A standard USB 3.0 flash drive with 150-250 MB/s read speeds will make model loading painfully slow. Don't do it. You want a portable NVMe SSD with USB 3.2 Gen 2 speeds. The Samsung T7 delivers up to 1,050 MB/s reads, and I've consistently seen 800-900 MB/s in real-world use. The 1TB model gives you room for 5-8 quantized models at once. The SanDisk Extreme Pro is another solid pick.
For reference, an internal NVMe SSD exceeds 3,000 MB/s, so you will notice the difference during model loading. But once the model is in RAM, inference speed is identical to a local install. The bottleneck shifts entirely to CPU and memory.
Minimum specs for the host machine:
- 16GB RAM (8GB technically works with small models but you'll be swapping constantly)
- Any modern x86_64 or ARM processor
- A free USB-C or USB-A 3.0+ port
Software: Ollama is what makes this work. It's open-source, runs on macOS, Windows, and Linux, and supports a single environment variable — OLLAMA_MODELS — that redirects where models are stored. That one variable is the entire trick behind the portable setup. The Ollama GitHub repository documents all supported environment variables in its configuration.
Setting Up Your Portable AI Drive Step by Step
Here's the exact process I use. The whole thing takes about 20 minutes, and most of that is download time.
Step 1: Format and prepare the drive. Use exFAT if you need cross-platform compatibility between Windows, macOS, and Linux. It handles large files without the 4GB limitation of FAT32. Create a folder called ollama-models at the root of the drive.
Step 2: Install Ollama on your primary machine. Download from ollama.com, install normally. This only needs to happen once per machine. On macOS and Linux it's a single command. Windows is a standard installer.
Step 3: Point Ollama at your USB drive. Before pulling any models, set the OLLAMA_MODELS environment variable to your external drive. On macOS or Linux: export OLLAMA_MODELS=/Volumes/YourDrive/ollama-models (adjust the path for your mount point). On Windows, set it via System Properties or PowerShell. This tells Ollama to store and read all model files from the USB drive instead of your home directory.
Step 4: Pull your models. Run ollama pull llama3.1:8b to grab Meta's Llama 3.1 8B. In Q4_K_M quantization (Ollama's default), this downloads roughly 4.7GB to your USB drive. Then grab a second model for variety — I'd go with ollama pull gemma3:4b for something lighter that's surprisingly capable for its size.
Step 5: Create a launcher script. Write a small shell script (or batch file on Windows) that sets the OLLAMA_MODELS variable and starts Ollama. Save it on the USB drive. This is the "plug and play" piece — on any new machine with Ollama installed, you run the script from your drive and everything just works.
Here's a quick demo of the general Ollama workflow if you haven't used it before:
[YOUTUBE:AYvZJCo6D_Y|Run your own private LLM in 10 minutes]
The key: models live on the drive, not the machine. Plug into a different laptop, set the variable, and your entire AI toolkit is there.
Which Models Work Best on a Portable USB Setup
Not every model is a good fit for this. I tested about a dozen. Here are the ones that actually earned a spot on my drive:
| Model | Disk Size | RAM Needed | Best For |
|---|---|---|---|
| Llama 3.1 8B (Q4_K_M) | ~4.7 GB | 8 GB | General-purpose, coding, writing |
| Gemma 3 4B | ~3.0 GB | 6 GB | Fast responses, summarization |
| Llama 3.2 3B | ~2.0 GB | 4 GB | Low-RAM machines, quick tasks |
| Mistral 7B (Q4_K_M) | ~4.4 GB | 8 GB | Instruction-following, European languages |
| Phi-3 Mini (3.8B) | ~2.3 GB | 4 GB | Surprisingly good reasoning for its size |
My daily driver is Llama 3.1 8B. It handles coding questions, writing assistance, and data transformation well enough that I rarely miss cloud access. For machines with only 8GB RAM, Llama 3.2 3B is the right call. It won't match the 8B model's quality, but it's responsive and genuinely useful.
I benchmarked these on my portable setup against the same models running from an internal SSD. Model load time was 3-4x slower from USB — about 8 seconds vs 2 seconds for Llama 3.1 8B on the Samsung T7. But token generation speed? Identical. Around 12-15 tokens per second on an M2 MacBook Air. Inference is CPU and RAM-bound, not storage-bound. If you're curious how local models compare to cloud AI for coding specifically, I ran a detailed benchmark of local LLMs vs Claude that breaks down the real tradeoffs.
Can You Run an LLM Entirely From a USB Drive Without Installing Anything?
This is the question everyone asks, and the honest answer is: almost, but not quite.
Ollama needs to be installed on the host machine. The models are fully portable on the USB drive, but the Ollama binary requires local installation because it interacts with the OS's GPU drivers and memory management.
Linux users have a workaround — you can download the Ollama binary directly onto the USB drive and run it without a system-wide install. On macOS and Windows, you're looking at a quick install step the first time you use a new machine. Takes about 60 seconds.
The workflow I've settled on is simple: Ollama is already installed on every machine I regularly touch (it's tiny, under 150MB). The models — the big files at 2-5GB each — live on the USB drive. I plug in, run my launcher script, and I'm working within 10 seconds.
Privacy Isn't the Side Benefit. It's the Whole Point.
I'll be direct: the privacy case for portable, offline AI isn't theoretical to me. I've worked on projects where sending data to OpenAI or Anthropic's servers would have violated contractual obligations. NDAs, data residency requirements, HIPAA-adjacent sensitivity. If you do any enterprise consulting or work in regulated industries, you know these situations aren't rare. They're Tuesday.
With a portable LLM, the entire computation happens on the local machine's CPU and RAM. Model files sit on your encrypted USB drive (I use VeraCrypt on top of the drive's hardware encryption). Nothing traverses a network. No API call to log. No prompt stored on someone else's server. No training data contribution you didn't consent to.
As the team at Humanloop noted in their analysis of local LLM security benefits, running models locally prevents sensitive data from being sent to the cloud, giving users full control over their information. The simplest privacy architecture is one where sensitive data never leaves your machine. Full stop.
For anyone in cybersecurity, legal tech, or healthcare, this is quickly becoming a requirement, not a preference. I've tested LLM API latency across major providers, and while cloud APIs win on raw throughput, they come with data exposure tradeoffs that a USB-based local setup eliminates entirely.
What Comes Next for Portable AI
The models are shrinking and getting more capable at a rate that honestly surprises me. A year ago, you needed a 13B parameter model to get useful coding assistance. Today, Llama 3.1 8B and Gemma 3 4B handle most daily tasks competently. I tested Gemma 3 on a Raspberry Pi 5 — an $80 computer — and it produced coherent, useful output. Within 18 months, a 3B model will match what today's 8B models can do. I'd bet money on it.
That means this USB setup only gets better. Faster models, smaller files, more capable output. The 1TB Samsung T7 I'm using today could eventually hold a dozen production-quality models, each specialized for different tasks.
The most powerful AI isn't the one with the most parameters. It's the one you actually have access to when you need it.
If you're an engineer who works across multiple machines, travels regularly, or handles anything remotely sensitive — build this setup this weekend. It takes 20 minutes and a $100 USB SSD. The first time you're stuck somewhere without internet and your AI toolkit is sitting right in your pocket, you'll get it.
Carry your own intelligence. Stop renting it.
Originally published on kunalganglani.com