📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/01-memory — clone,
docker compose up, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.
Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.
I wanted a chat companion with practical persistent memory — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.
TL;DR
- Hot layer (Redis) — recent messages per conversation, short TTL, low-latency reads.
- Cold layer (ChromaDB) holds summaries of chunks, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.
- On every user message, three retrieval paths fire in parallel via
asyncio.gather: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt. - Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.
Run it yourself in 5 minutes
Before the architectural deep-dive, boot the demo so you can poke the memory layers live.
1. Clone and enter the folder
git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/01-memory
2. Configure two tokens
cp .env.example .env
Open .env and fill:
-
TELEGRAM_BOT_TOKEN— get it from @BotFather (30 seconds:/newbot, pick a name, copy the token) -
OPENROUTER_API_KEY— from openrouter.ai/keys. The defaultLLM_MODELis a free-tier Llama 3.1 8B so you don't spend a cent.
3. Start the stack
docker compose up --build -d
docker compose logs -f bot # watch the bot come alive
Four containers: redis, chromadb, api (FastAPI inspector on localhost:8000), bot (your Telegram bot polling).
4. Talk to your bot
Open it on Telegram, hit /start, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.
5. Peek at what each layer holds
# Replace 12345 with your own Telegram user ID (ask @userinfobot)
curl http://localhost:8000/memory/12345/demo/recent | jq
curl http://localhost:8000/memory/12345/demo/summary | jq
recent shows the raw Redis buffer. summary shows the latest ChromaDB document.
With the demo running, the rest of this post explains what you just booted.
Why rolling summaries alone don't work
A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also lossy in a very specific way: nuance dies in repeated compression.
Walk it through three regenerations:
Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"
By turn 4, the reason is gone. A companion bot starts sounding generic. The fix used here: keep raw recent messages verbatim and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.
Architecture
Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.
The hot layer — Redis
Each (user_id, character_id) conversation is stored as a bounded Redis list:
async def save_message(user_id: int, char_id: str, role: str, content: str) -> None:
r = get_redis()
key = f"chat:{user_id}:{char_id}:messages"
msg = json.dumps({
"role": role,
"content": content,
"ts": datetime.now(timezone.utc).isoformat(),
})
pipe = r.pipeline()
pipe.rpush(key, msg)
pipe.ltrim(key, -HOT_BUFFER_SIZE, -1)
pipe.expire(key, 86400 * HOT_BUFFER_TTL_DAYS)
await pipe.execute()
Three things matter here:
-
ltrimon every write. The list is bounded. Memory per user is O(1), not O(conversation length). -
TTL extended on every write. Inactive users' history evicts automatically. Configure Redis with
allkeys-lruso overflow evicts instead of refusing writes —noevictionis the default and it's a footgun. -
Pipelined writes.
rpush + ltrim + expirein one round trip.
The cold layer — ChromaDB with summaries, not messages
A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).
Instead: embed LLM-generated summaries of chunks. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.
Retrieval — three paths in parallel
On every user message, the chat handler fires three reads in parallel via asyncio.gather:
async def build_prompt_context(user_id: int, char_id: str, user_query: str) -> dict:
"""Parallel fire the three reads. Returns everything the handler needs."""
recent, summary, memories = await asyncio.gather(
get_recent(user_id, char_id),
get_latest_summary(user_id, char_id),
get_relevant_memories(user_id, char_id, user_query),
)
return {"recent": recent, "summary": summary, "memories": memories}
The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.
Production issues that came up
Double-summarize race. Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.
User clears history mid-summarize. A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check r.exists(key) before writing; bail if the list is gone.
Empty summaries cached. LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: if summary: guard before setex.
ChromaDB collection doesn't exist for new users. col.query raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.
What I'd change if starting over
- Skip pgvector for this shape of workload. Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.
- Don't embed per message. Index exploded, recall didn't improve. Summary-level is the right granularity.
- Summarize fixed-size windows, not time-based batches. Daily summaries are useless for users who chatted 500 times in one day.
- Build the cancellation pattern from day 1. Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.
Where this lives
HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: @HoneyChatAIBot on Telegram or honeychat.bot in the browser.
Public docs: github.com/sm1ck/honeychat — service topology, API surface, major flows.
Next in the series: LLM routing per tier — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.
References
- ChromaDB docs
- Redis
LTRIM - aiogram
- OpenRouter
- Character.AI pinned memories
- Character.AI chat memories
- Replika memory docs
- ChatGPT Memory FAQ
If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.