Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit

python dev.to

Building a Voice-Controlled Local AI Agent with Whisper, Groq & Streamlit

For my Mem0 AI/ML internship assignment, I built a fully working voice-controlled
AI agent that accepts audio input, classifies intent, executes local tools, and
displays everything in a clean UI. Here's how I built it and what I learned.

What It Does

You speak (or type) a command → the agent transcribes it → classifies your intent
→ executes the right action → shows the result. All in one pipeline.

Supported intents:

  • create_file — creates a new file in the output/ folder
  • write_code — generates code using LLM and saves it
  • summarize — summarizes provided text
  • general_chat — conversational Q&A
  • compound — multiple commands in one utterance

Architecture

Audio Input → STT (Whisper/Groq) → Intent Classification (LLM) → Tool Execution → Streamlit UI

Tech Stack

Component Tool
Speech-to-Text Groq Whisper API
Intent + Generation Groq (llama-3.3-70b)
UI Streamlit
Language Python

Model Choices & Why

STT — Groq Whisper API: I chose Groq over local HuggingFace Whisper because
my machine doesn't have a GPU. Groq processes audio in under 1 second using
whisper-large-v3 on their free tier. The code supports local Whisper as well
via HuggingFace transformers as a fallback.

LLM — Groq (llama-3.3-70b): For intent classification, I needed structured
JSON output reliably. Groq's API with response_format: json_object gave
consistent results. The system prompt instructs the model to return intent,
filename, language, and sub_tasks for compound commands.

Key Challenge — Intent Classification

Getting the LLM to return reliable JSON every time was the hardest part. My
solution was a strict system prompt that:

  1. Defines every intent clearly
  2. Forces JSON-only output
  3. Has a fallback parser that strips markdown fences if the model adds them

Bonus Features Implemented

  • Compound commands — "Generate bubble sort and save it as bubble.py"
  • Human-in-the-loop — confirmation prompt before any file operation
  • Graceful degradation — handles LLM failures, bad audio, unknown intents
  • Session memory — chat context preserved across turns

Safety

All file operations are restricted to an output/ folder. The _safe_path()
function strips any directory traversal attempts and adds timestamps to filenames
to prevent overwrites.

What I Learned

  • Prompt engineering for structured output is more important than model size
  • Groq's free tier is surprisingly powerful for production-quality inference
  • Streamlit makes it incredibly fast to build AI demo UIs
  • Always restrict file operations to a sandboxed directory

Links

Source: dev.to

arrow_back Back to Tutorials