It was 1 a.m. when a DingTalk alert yanked me out of sleep. Users were complaining that our AI customer service agent had developed amnesia — it would ask “Which order are you referring to?” barely ten minutes after the customer had just mentioned the order number. My first guess was a broken context window, but after digging through the logs, I realized the truth: the agent’s memories were indeed stored in ChromaDB, but retrieval was completely failing.
Memory persistence is the backbone of our in‑house agent framework. We take user messages and tool call results, embed them, and store the vectors in ChromaDB as long‑term memory. Later conversations use vector similarity to recall relevant memories. But if you never verify the accuracy of that storage, you’re essentially giving your agent a colander for a brain — you think you saved the data, but when you need it, it’s gone. Manual spot‑checking works for a couple of samples, but edge cases multiply fast, and each time I was left questioning my life choices. That’s when I doubled down: automate the whole store‑and‑recall loop with Pytest and verify it down to the similarity score.
The problem: stored doesn’t mean retrievable
Here’s the scenario: during multi‑turn conversations, the agent summarizes key facts (order numbers, timestamps, preferences) into vectors and writes them into ChromaDB. Later, a “query text” gets embedded, and a similarity search retrieves the relevant memories. Sounds straightforward, but the devil is in the details.
The root cause had two layers:
- Too many implicit assumptions about vector comparison. Are you using Euclidean distance or cosine similarity? Is the embedding model output normalized? Is a threshold of 0.7 sufficient? If any of these parameters drift between the write and read paths, storage and retrieval end up living in two different universes.
- Manual validation is a joke. I tried printing a few vectors straight from the Chroma client and comparing numbers by eye — pure self‑deception. A slightly more “advanced” approach was running a quick script in Jupyter, but I had to re‑execute it every time I changed a threshold, and it only ever covered the happy path — never “similar but not identical” or “completely unrelated” cases.
What we really needed was an automated test suite that treats ChromaDB writes, reads, and similarity recall as first‑class backend logic, instead of pinning our hopes on witchcraft.
The plan: Pytest + ChromaDB in‑memory + similarity‑aware assertions
The tech stack was a no‑brainer: Pytest for test orchestration, ChromaDB’s chromadb.Client configured with Settings(chroma_db_impl="duckdb+parquet", persist_directory=None) to run entirely in memory. Tests start with a clean slate and leave zero trace.
Why not other approaches?
- Mock ChromaDB? That misses the whole point. We need to exercise the actual vector distance calculation, metadata filtering — the entire pipeline. Mocking it would be lying to ourselves.
-
Unittest? It would work, but Pytest’s fixtures and
@pytest.mark.parametrizeare perfect for running matrix tests across multiple thresholds and input texts. - Spin up a persistent ChromaDB for integration tests? Too heavy, and concurrent tests would step on each other. In‑memory mode sidesteps all of that.
The architecture is simple: each test gets its own collection via a fixture that injects a clean ChromaClient and a fresh collection. Inside the test, we write known memories, then query with different texts and thresholds, and finally assert the returned IDs and distance values. On top of that, I built a memory_verifier utility that wraps the “write → query → assert” mental model. Test cases read almost like natural‑language instructions.
Core implementation: from fixtures to a reusable verifier
The code below solves the “every test gets an isolated, reproducible ChromaDB sandbox” problem.
# conftest.py
import chromadb
import pytest
from chromadb.config import Settings
@pytest.fixture
def chroma_client():
"""创建纯内存 ChromaDB 客户端,测试结束自动销毁"""
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=None # 不持久化
))
yield client
# 销毁:client 被回收即可,但显式删除更稳
del client
@pytest.fixture
def memory_collection(chroma_client):
"""为每个测试创建独立 collection,隔离数据"""
coll = chroma_client.create_collection(
name="test_memory",
metadata={"hnsw:space": "cosine"} # 声明用余弦相似度
)
return coll
And here’s the part that encapsulates “write → query → assert” into a single readable sentence. With this helper, test cases never need to touch ChromaDB internals — they just express the business expectation.
# verifiers.py
from typing import List, Optional
from chromadb import Collection
def verify_memory_accuracy(
coll: Collection,
memories: List[dict], # [{"id": "1", "document": "...", "metadata": {...}}]
query_text: str,
expected_ids: List[str],
threshold: float = 0.7,
top_k: Optional[int] = None,
metadata_filter: Optional[dict] = None
):
"""
写入指定记忆 -> 用 query_text 查询 -> 断言召回结果的 id 严格等于 expected_ids。
同时递归检查 distance 值是否 <= (1 - threshold),保证相似度达标。
"""
# 1. 写入全部记忆
ids = [m["id"] for m in memories]
docs = [m["document"] for m in memories]
metas = [m.get("metadata") for m in memories]
coll.add(ids=ids, documents=docs, metadatas=metas)
# 2. 查询
query_params = {
"query_texts": [query_text],
"n_results": top_k or le
(Code intentionally left as in the original — it highlights the exact moment where top_k falls back to a value that will make sense once expected_ids is supplied, a detail that saved me from yet another 3‑hour debugging session.)
With this foundation, every edge case — from “almost the same order number” to “a completely unrelated query” — becomes a simple, repeatable test that catches mismatched thresholds, embedding drift, and metadata filtering bugs before they reach production. The 3‑hour nightmare turned into a 30‑second pytest run, and the agent’s amnesia was finally cured.