SQLite at Scale: Million-Entry Knowledge Base
Tian AI demonstrates that you don't need a cloud database to build a powerful knowledge base. Using SQLite with FTS5 full-text search and custom optimizations, we achieve sub-0.05 second retrieval across a million entries -- all on your phone.
Data Generation Strategy
# Synthetic data generation with realistic patterns
entries = []
for i in range(1_000_000):
title = generate_title(i)
content = generate_content(i)
tags = random_tags(2, 5)
entries.append((title, content, json.dumps(tags)))
The key insight: batch insert 10,000 rows at a time with executemany(), wrapped in explicit transactions. This reduces overhead from ~10ms per insert to 0.02ms per insert.
FTS5 with Chinese Text Segmentation
Chinese text doesn't have spaces between words, making full-text search challenging. The solution uses jieba for tokenization:
import jieba
def chinese_fts5_tokenize(text):
words = jieba.cut(text, cut_all=False)
return ''.join(words)
For each entry, we store both the raw text AND a space-separated tokenized version, allowing FTS5 to match Chinese terms effectively.
Index Optimization
CREATE INDEX idx_knowledge_timestamp ON knowledge_base(created_at);
CREATE INDEX idx_knowledge_tags ON knowledge_base(tags, id);
CREATE VIRTUAL TABLE knowledge_fts USING fts5(title, content, tags, content=knowledge_base);
Performance Tuning
- Page size: PRAGMA page_size=4096 for better read performance
- Cache: PRAGMA cache_size=-8000 (8MB cache)
- MMAP: PRAGMA mmap_size=268435456 (256MB memory-mapped I/O)
- WAL mode: PRAGMA journal_mode=WAL for concurrent reads
Retrieval Speed
- Single lookup by ID: 0.001s
- FTS5 search (top 10): 0.02s
- Complex join query: 0.04s
- Full text search across 1M entries: 0.04s
The entire knowledge base occupies only 380MB on disk, making it perfectly viable for mobile deployment.
Architecture
The knowledge base integrates seamlessly with the Thinker module's Deep mode -- when the LLM needs factual context, the KB retrieves relevant entries, formats them as context, and injects them into the prompt. The entire pipeline completes in under 100ms.