Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit

For my Mem0 AI/ML internship assignment, I built a fully working voice-controlled
AI agent that accepts audio input, classifies intent, executes local tools, and
displays everything in a clean UI. Here's how I built it and what I learned.

What It Does

You speak (or type) a command → the agent transcribes it → classifies your intent
→ executes the right action → shows the result. All in one pipeline.

Supported intents:

create_file — creates a new file in the output/ folder
write_code — generates code using LLM and saves it
summarize — summarizes provided text
general_chat — conversational Q&A
compound — multiple commands in one utterance

Architecture

Audio Input → STT (Whisper/Groq) → Intent Classification (LLM) → Tool Execution → Streamlit UI

Tech Stack

Component	Tool
Speech-to-Text	Groq Whisper API
Intent + Generation	Groq (llama-3.3-70b)
UI	Streamlit
Language	Python

Model Choices & Why

STT — Groq Whisper API: I chose Groq over local HuggingFace Whisper because
my machine doesn't have a GPU. Groq processes audio in under 1 second using
whisper-large-v3 on their free tier. The code supports local Whisper as well
via HuggingFace transformers as a fallback.

LLM — Groq (llama-3.3-70b): For intent classification, I needed structured
JSON output reliably. Groq's API with response_format: json_object gave
consistent results. The system prompt instructs the model to return intent,
filename, language, and sub_tasks for compound commands.

Key Challenge — Intent Classification

Getting the LLM to return reliable JSON every time was the hardest part. My
solution was a strict system prompt that:

Defines every intent clearly
Forces JSON-only output
Has a fallback parser that strips markdown fences if the model adds them

Bonus Features Implemented

Compound commands — "Generate bubble sort and save it as bubble.py"
Human-in-the-loop — confirmation prompt before any file operation
Graceful degradation — handles LLM failures, bad audio, unknown intents
Session memory — chat context preserved across turns

Safety

All file operations are restricted to an output/ folder. The _safe_path()
function strips any directory traversal attempts and adds timestamps to filenames
to prevent overwrites.

What I Learned

Prompt engineering for structured output is more important than model size
Groq's free tier is surprisingly powerful for production-quality inference
Streamlit makes it incredibly fast to build AI demo UIs
Always restrict file operations to a sandboxed directory

Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit

Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit

What It Does

Architecture

Tech Stack

Model Choices & Why

Key Challenge — Intent Classification

Bonus Features Implemented

Safety

What I Learned

Links