Most AI demos work perfectly — until you try to use them like a real system.
I built this as part of an AI/ML internship assignment, but it quickly turned into something deeper. What started as "just get voice input working with an LLM" turned into debugging audio pipelines, fixing session state bugs I didn't know existed, and figuring out how to make an LLM understand follow-up commands without losing context.
No OpenAI APIs. No usage costs. Just Python, Whisper, LLaMA 3, and a lot of time in the terminal.
What It Does
You give the agent a command — by voice or text — and it figures out what you want and executes it.
Under the hood, it:
- Transcribes audio using Whisper (running locally)
- Sends the text to LLaMA 3 via Ollama for intent classification
- Executes the corresponding action
- Stores context for follow-up commands
Supported Actions
-
create_file— create a file with provided content -
write_code— generate and save code -
summarize— summarize text, optionally save it -
quiz— generate MCQs from previous content -
chat— general conversation with session memory
The most interesting part is compound and contextual commands. You can say "Summarize this and save it to summary.txt" as one prompt, then follow up with "Generate a quiz from the previous summary" — and it works, because the system maintains context across turns.
Architecture
Voice/Text Input
↓
stt.py ← Whisper transcription
↓
classifier.py ← LLaMA 3 intent classification
↓
actions.py ← Action dispatcher
↓
ollama_client.py ← LLM interface + retry logic
↓
memory.py ← Session state (Streamlit)
↓
app.py ← UI (Streamlit)
Each module has one job. This separation didn't exist in the first version — it came after debugging and refactoring. Early on, LLM calls were embedded inside the classifier, which created messy dependencies across modules. Pulling things apart made everything easier to test and reason about.
Challenge 1: Whisper Integration Was Not Straightforward
At first I assumed transcription would be the easy part. It wasn't.
Whisper depends on ffmpeg for audio processing. My first solution was to monkey-patch Whisper's internal audio loader:
# original approach — don't do this
whisper_audio.load_audio = _load_audio_with_local_ffmpeg
This worked — but introduced a hidden global side effect. Any part of the codebase importing Whisper would unknowingly use the patched version with no warning. That's the kind of thing that causes weird bugs later when you've forgotten you did it.
Fix
Instead of patching globally, I switched to calling the custom loader directly:
def transcribe(audio_path: str, size: str = "base") -> str:
model = load_model(size)
audio = _load_audio_with_local_ffmpeg(audio_path)
Additional Improvements
Quiet recordings were causing Whisper to hallucinate or return garbage. Adding normalization before passing audio to the model helped a lot:
if np.max(np.abs(audio)) > 0:
audio = audio / np.max(np.abs(audio))
Combined with beam_size=5 and temperature=0.0 in the decoding options, transcription got noticeably more reliable. I also used imageio-ffmpeg to bundle ffmpeg directly — so users don't need to install it separately, which matters when sharing a project.
Noisy environments still cause issues. That's a real limitation of running the tiny Whisper model locally.
Challenge 2: Streamlit Session State Is Not What You Think
This bug took longer than expected to figure out.
Streamlit reruns your entire script on every interaction. My original memory.py stored history in a plain module-level list:
# looked fine, wasn't fine
_history = []
def add(transcription, intents, results):
_history.append(...)
On a local machine with one browser tab, this works perfectly. Open a second tab and both sessions share the same Python process — which means the same _history list. Two users would see each other's conversation history.
Fix
Moved everything into st.session_state:
def _get_history_store():
if "memory_history" not in st.session_state:
st.session_state["memory_history"] = []
return st.session_state["memory_history"]
Each session now has isolated state. I also added a fallback list for test environments where Streamlit isn't running — which matters for unit testing without spinning up the full app.
Challenge 3: Context Resolution for Follow-Up Commands
This part took some trial and error to get right.
When a user says "Generate a quiz from the previous summary", the classifier correctly returns intent = quiz — but the content it extracts is just the literal phrase. That's useless as quiz material. The actual text to quiz from is the summary generated earlier in the conversation.
Solution
I introduced a context resolution layer that runs before any action executes. It checks whether the content looks like a reference to prior conversation, and if it does, it fetches the right content:
def _resolve_contextual_content(intent, content, chat_history, last_summary_text):
if intent == "quiz" and _is_previous_text_reference(content):
previous_text = (
last_summary_text
or _resolve_previous_assistant_text(chat_history)
or _resolve_previous_conversation_text(chat_history)
)
if previous_text:
return previous_text
return content
last_summary_text is stored separately in session state every time a summarize action completes. So quiz follow-ups always have the right source material, even if other messages happened in between.
This made multi-step workflows actually reliable.
The Ollama Client Refactor
Early on, the Ollama API call lived inside classifier.py. When actions.py needed LLM access too, it imported from the classifier just for that one function — a clear layering problem.
Fix
Created a dedicated ollama_client.py:
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://localhost:11434/api/generate")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3")
OLLAMA_MAX_RETRIES = int(os.getenv("OLLAMA_MAX_RETRIES", "2"))
Added retry logic, timeout handling, and exponential backoff. Before this, a single Ollama timeout would crash the entire pipeline and surface a raw Python exception to the user. Now failures are caught and retried gracefully.
Limitations
Worth being honest about what doesn't work perfectly:
- Noisy environments — transcription quality drops noticeably with background noise
- Ambiguous commands — intent classification can fail on vague input; the more specific the command, the better
- CPU inference — LLaMA 3 is usable on CPU but noticeably slow without a GPU
- No persistence — session history lives in memory and resets on page reload
What I Learned
This project shifted how I think about building AI systems:
- The model is not the hard part — the system around it is. The LLM worked fine. Making the pipeline around it reliable took most of the effort.
- Edge cases define reliability. The session state bug, the monkey-patch issue, the context resolution logic — none of it shows up in tutorials. You find it by building something and watching it break.
-
Architecture decisions compound quickly. Separating
ollama_client.pysounds like over-engineering for a small project. It made debugging and testing significantly easier. - Voice pipelines amplify errors across layers. A bad transcription leads to a wrong classification leads to the wrong action. Each layer multiplies the mistakes from the one before it.
The gap between "the model works" and "the system works" is where real engineering happens. That's what this project actually taught me.
What I'd Improve Next
- FastAPI backend — decouple the logic from the UI so it's reusable from any frontend
- Streaming responses — Ollama supports it; the UX improvement would be significant
- Intent confidence scoring — safer execution for ambiguous or destructive commands
Running the Project
pip install -r requirements.txt
ollama serve
ollama pull llama3
streamlit run app.py
# run tests
python -m unittest -q
Python 3.11 required. Works on CPU — GPU recommended for speed. Whisper tiny is the default model.
GitHub: https://github.com/Akhilesh0605/voice-ai-agent.git
Final Thoughts
This started as an assignment and ended up being one of the more useful things I've built — not because of the features, but because of what broke along the way.
Building AI systems isn't about making the model work. It's about making everything around it reliable. That's where the actual challenge is, and honestly, where most of the learning is too.
Built with Python 3.11, Streamlit, OpenAI Whisper, LLaMA 3, and Ollama.
Questions or feedback — drop them in the comments.