Building A Voice Controlled AI Agent For A Internship Assignment

Recently I applied for the Mem0 AI - ML & Generative AI Developer Internship. As part of the selection process, I was given a challenging assignment: build a fully local voice-controlled AI agent that can understand spoken commands, classify intent, and perform real actions on my machine — all while keeping everything safe and visible in a clean UI.

Here’s how I built it.

Project Overview:

-
The agent accepts audio input in two ways:

Direct microphone recording
Uploading an existing .wav or .mp3 file

It then:

Converts speech to text using a local Whisper model
Uses a local LLM to understand the user’s intent
Executes the appropriate action (create file, write code, summarize text, or general chat)
Shows the complete pipeline in a Streamlit UI

All file operations are strictly restricted to an output/ folder for safety.

Architecture:-

The system follows a clean pipeline:

Audio Input → Speech-to-Text → Intent Understanding → Tool Execution → UI Display

I used Streamlit for the frontend because it’s fast to build and looks professional.

Tech Stack & Model Choices:-

For Speech-to-text I used the whisper(small model) as it is fast, local and runs well on CPU.
For the Intent Detection and Code Generation I used Ollama + phi3:mini as it's lightweight, fully local and good at following instructions.
For the Audio Handling I used sounddevice for microphone and pydub for handling uploaded audio files
For the UI, I used Streamlit for its simplicity and excellent support for real time updates.

I chose phi3:mini as the main LLM because it’s small, fast on CPU, and performs surprisingly well for intent classification and simple code generation.

Challenges I Faced:-

The biggest challenge was Ollama’s GPU runner crashing repeatedly on my Ubuntu 22.04 laptop even though it detected my NVIDIA GPU. I kept getting the “llama runner process has terminated” error. After multiple clean reinstalls and driver tweaks, I had to force CPU mode using num_gpu_layers=0. This made the agent stable, though it sacrificed some speed.

Another challenge I faced was when I was trying to add some "Bonus(Optional)" functionalities, namely "Human-in-the-Loop" as I added the UI confirmation prompt before file execution, the session was terminated as soon as the 'Human' confirmed the file execution because of Streamlit's button rendering behavior.
It means that when a checkbox is checked, the page rerenders and the button state gets reset because of how Streamlit handles session state. The confirmation checkbox causes the app to rerun, and by the time it checks the button, the state is confused.
I couldn't overcome this problem so I simply removed this "bonus functionality"

Another challenge was making the intent detection robust. LLMs sometimes return messy responses with extra text or markdown, so I had to implement custom JSON extraction logic with fallbacks to prevent the app from breaking on unexpected outputs.

I also focused heavily on safety, as instructed in the assignment pdf, ensuring no file could ever be written outside the output/ folder using filename validation and path checks.

Model Benchmarking & Learings:-

While building this agent, I experimented with a few small LLMs available through Ollama to find the best fit for my hardware.

I tested gemma2:2b, llama3.2:3b, and phi3:mini on the same set of tasks (intent classification and simple code generation) while running everything on CPU mode.

gemma2:2b was the fastest, often responding in under 2 seconds, but it occasionally gave less accurate or incomplete code. llama3.2:3b offered good quality and was reasonably fast, but it was the most unstable on my system. In the end, phi3:mini struck the best balance, it was fast enough and produced cleaner and more reliable outputs, and handled structured JSON prompts better than the others.

For Speech-to-Text, I compared the tiny, base, and small versions of faster-whisper. The small model gave me the best accuracy on technical terms and function names, with transcription speed roughly 1.5× real-time on CPU. The tiny model was quicker but made more mistakes, especially when I spoke code-related commands.

This benchmarking exercise taught me that theoretical model size doesn’t always translate to real-world performance. Hardware constraints and prompt quality play a much bigger role than I initially expected.

Final Thoughts:-

This project gave me valuable hands-on experience in building end-to-end local AI systems, handling real hardware limitations, and making agents both functional and safe.

GitHub Repository:-

https://github.com/cHaMpIoN-37/voice-ai-agent

Feel free to clone the repo and try it yourself. Any feedback is welcome!