Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit
For my Mem0 AI/ML internship assignment, I built a fully working voice-controlled
AI agent that accepts audio input, classifies intent, executes local tools, and
displays everything in a clean UI. Here's how I built it and what I learned.
What It Does
You speak (or type) a command → the agent transcribes it → classifies your intent
→ executes the right action → shows the result. All in one pipeline.
Supported intents:
- create_file — creates a new file in the output/ folder
- write_code — generates code using LLM and saves it
- summarize — summarizes provided text
- general_chat — conversational Q&A
- compound — multiple commands in one utterance
Architecture
Audio Input → STT (Whisper/Groq) → Intent Classification (LLM) → Tool Execution → Streamlit UI
Tech Stack
| Component | Tool |
|---|---|
| Speech-to-Text | Groq Whisper API |
| Intent + Generation | Groq (llama-3.3-70b) |
| UI | Streamlit |
| Language | Python |
Model Choices & Why
STT — Groq Whisper API: I chose Groq over local HuggingFace Whisper because
my machine doesn't have a GPU. Groq processes audio in under 1 second using
whisper-large-v3 on their free tier. The code supports local Whisper as well
via HuggingFace transformers as a fallback.
LLM — Groq (llama-3.3-70b): For intent classification, I needed structured
JSON output reliably. Groq's API with response_format: json_object gave
consistent results. The system prompt instructs the model to return intent,
filename, language, and sub_tasks for compound commands.
Key Challenge — Intent Classification
Getting the LLM to return reliable JSON every time was the hardest part. My
solution was a strict system prompt that:
- Defines every intent clearly
- Forces JSON-only output
- Has a fallback parser that strips markdown fences if the model adds them
Bonus Features Implemented
- Compound commands — "Generate bubble sort and save it as bubble.py"
- Human-in-the-loop — confirmation prompt before any file operation
- Graceful degradation — handles LLM failures, bad audio, unknown intents
- Session memory — chat context preserved across turns
Safety
All file operations are restricted to an output/ folder. The _safe_path()
function strips any directory traversal attempts and adds timestamps to filenames
to prevent overwrites.
What I Learned
- Prompt engineering for structured output is more important than model size
- Groq's free tier is surprisingly powerful for production-quality inference
- Streamlit makes it incredibly fast to build AI demo UIs
- Always restrict file operations to a sandboxed directory