Tian AI Architecture Deep Dive: Building a Multi-Engine AI System
This post takes a deep technical look at the architecture of Tian AI — an open-source, self-evolving local AI system. If you haven't read the overview, check out Tian AI: The Self-Evolving AI System Powered by Qwen2.5.
Project Architecture Overview
Tian AI is organized as a multi-engine system with six core modules. Here's the complete architecture:
┌──────────────────────────────────────────────────────────────────┐
│ CLI / Web / Gradio UI │
├──────────────────────────────────────────────────────────────────┤
│ Flask API Layer │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Thinker (LLM Engine) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ │
│ │ │ Fast Mode │ │ CoT Mode │ │ Deep Mode │ │ │
│ │ │ (single │ │ (step-by- │ │ (multi-view + │ │ │
│ │ │ pass) │ │ step) │ │ reflection) │ │ │
│ │ └────────────┘ └────────────┘ └──────────────────┘ │ │
│ └─────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────────────────────────┐ │
│ │ Talker (Dialog) │ │
│ │ Short-term Memory | Long-term Memory | Emotion │ │
│ └─────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────────────────────────┐ │
│ │ Knowledge Retriever (RAG) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Concept │ │ Pattern │ │ LLM-Augment │ │ │
│ │ │ Extraction │ │ Matching │ │ Generation │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────────────────────────┐ │
│ │ Agent Scheduler │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ TaskQueue │ │ Priority │ │ Security │ │ │
│ │ │ (dependency │ │ Scheduler │ │ Whitelist │ │ │
│ │ │ sorting) │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────────────────────────┐ │
│ │ Self-Evolution System │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ AST Analysis │ │ LLM Suggest │ │ Auto-Patch │ │ │
│ │ │ (code scan) │ │ (improvement)│ │ (backup + │ │ │
│ │ │ │ │ │ │ verify) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ XP System │ │ Version │ │ │
│ │ │ (leveling) │ │ Manager │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────┤
│ LLMManager (Process Lifecycle) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Process Spawn│ │ Health Check │ │ Auto-Restart │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├──────────────────────────────────────────────────────────────────┤
│ llama.cpp Backend (Qwen2.5-1.5B GGUF) │
└──────────────────────────────────────────────────────────────────┘
The system uses Flask as the central web server, with Gradio as an alternative UI frontend. All communication between modules goes through Python function calls (direct in-process calls) rather than microservice RPC, keeping latency minimal on constrained devices.
1. Thinker — The Three-Layer Reasoning Engine
The Thinker is the most critical module. It wraps the local LLM with three distinct reasoning strategies, each optimized for different query types.
Fast Mode (Default)
For simple queries — greetings, fact lookup, straightforward questions. Single pass through the LLM with minimal context.
# Simplified Fast Mode flow
def fast_think(query: str, context: str = "") -> str:
prompt = _build_simple_prompt(query, context)
# Single LLM call, no chaining
response = llm.generate(prompt, max_tokens=256, temperature=0.3)
return response
Characteristics:
- Single LLM call, no chaining
- Low temperature (0.3) for deterministic answers
- 256 max tokens for speed
- ~1-3 seconds on mobile hardware
Chain-of-Thought Mode
For problems that benefit from step-by-step reasoning. The LLM is prompted to reason aloud before answering.
# Simplified CoT Mode flow
def cot_think(query: str) -> str:
prompt = f"""Question: {query}
Let's solve this step by step:
1) First, I need to understand what's being asked...
2) Let me break down the key components...
3) Considering each part...
4) Therefore, the answer is:"""
full_response = llm.generate(prompt, max_tokens=512, temperature=0.5)
# Extract final answer from reasoning chain
answer = _extract_final_answer(full_response)
return answer
Key implementation details:
- Higher temperature (0.5) allows creative reasoning paths
- 512 max tokens to accommodate the reasoning chain
- Answer extraction uses regex patterns to find the final conclusion
- Context window management: the reasoning process is truncated if it exceeds the model's limit
Deep Mode
For complex analysis requiring multi-perspective evaluation and reflection. This is the most sophisticated mode.
# Simplified Deep Mode flow
def deep_think(query: str) -> dict:
# Step 1: Generate multiple perspectives
perspectives = [
_ask_perspective(query, "technical"),
_ask_perspective(query, "ethical"),
_ask_perspective(query, "practical")
]
# Step 2: Cross-perspective synthesis
synthesis_prompt = f"""
Original query: {query}
Perspectives gathered:
1. Technical: {perspectives[0]}
2. Ethical: {perspectives[1]}
3. Practical: {perspectives[2]}
Synthesize these perspectives into a comprehensive answer.
Note areas of agreement and disagreement.
Provide a balanced final assessment.
"""
synthesis = llm.generate(synthesis_prompt, max_tokens=768, temperature=0.7)
# Step 3: Self-reflection
reflection_prompt = f"""
Original query: {query}
My synthesized answer: {synthesis}
Critically evaluate your own answer. What might be missing?
What assumptions were made? Is the reasoning sound?
"""
reflection = llm.generate(reflection_prompt, max_tokens=256, temperature=0.4)
return {
"perspectives": perspectives,
"synthesis": synthesis,
"reflection": reflection,
"final": _combine(synthesis, reflection)
}
Implementation notes:
- 3-5 independent perspective generations using different prompt frames
- Each perspective call can run independently (parallelizable)
- Synthesis phase combines viewpoints and identifies conflicts
- Reflection phase adds a meta-cognitive layer
- Total: 4-6 LLM calls per deep query
The Thinker module also handles prompt caching via an LRU+TTL cache (PromptCache), which avoids regenerating responses for identical queries within a configurable time window.
2. Knowledge Retriever — RAG Implementation
The Knowledge Retriever implements a Retrieval-Augmented Generation (RAG) pipeline using a local SQLite database as the document store.
Database Architecture
The knowledge base is a pre-built SQLite database containing millions of entries:
-- Core tables
CREATE TABLE concepts (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE,
domain TEXT,
description TEXT
);
CREATE TABLE qa_pairs (
id INTEGER PRIMARY KEY,
concept_id INTEGER REFERENCES concepts(id),
pattern_id INTEGER,
question TEXT,
answer TEXT,
confidence REAL DEFAULT 1.0
);
-- Full-text search index
CREATE VIRTUAL TABLE qa_fts USING fts5(
question, answer, concept_name,
content='qa_pairs',
content_rowid='id'
);
Retrieval Flow
User Query
↓
[1] Concept Extraction (keyword matching + NER)
↓
[2] FTS5 Search on SQLite
↓
[3] Score & Rank Results (BM25 + confidence weighting)
↓
┌─────────────────┐
│ confidence > 0.8 │──Yes──→ Return KB answer directly
└────────┬────────┘
│ No
↓
[4] Context Assembly (top-3 results as context)
↓
[5] LLM Augmented Generation
↓
Final Response
Performance
| Operation | Time | Notes |
|---|---|---|
| Concept extraction | ~0.01s | Regex + keyword matching |
| FTS5 search | ~0.02s | Indexed full-text search |
| Result ranking | ~0.01s | BM25 scoring |
| Total retrieval | ~0.04s | Without LLM call |
| LLM augmentation | ~1-3s | Depends on context size |
Key Design Decisions
SQLite over vector DB: No need for embeddings or vector similarity search. The structured QA pairs with FTS5 provide faster and more predictable results than embedding-based retrieval on a mobile device.
Confidence threshold of 0.8: Tuned empirically. Below this, the LLM augmentation adds significant value. Above it, the KB answer is already reliable.
30 question patterns per concept: Each concept has 30 pre-written question templates (e.g., "What is X?", "Explain X", "How does X work?"), ensuring flexible matching against diverse user inputs.
3. Agent Scheduler — TaskQueue + Security Whitelist
The Agent Scheduler is the orchestration layer that routes tasks between engines, manages concurrency, and enforces security policies.
TaskQueue with Dependency Sorting
class Task:
def __init__(self, task_id, func, args, kwargs,
depends_on=None, priority=0, timeout=30):
self.id = task_id
self.func = func # Callable
self.args = args
self.kwargs = kwargs
self.depends_on = depends_on or [] # List of Task IDs
self.priority = priority
self.timeout = timeout
self.status = "pending" # pending → running → done/failed
self.result = None
class TaskQueue:
def __init__(self):
self.tasks = {} # task_id → Task
self.results = {} # task_id → result
self._lock = threading.Lock()
def add_task(self, task):
self.tasks[task.id] = task
def get_ready_tasks(self):
"""Return tasks whose dependencies are all met."""
ready = []
for tid, task in self.tasks.items():
if task.status != "pending":
continue
deps_met = all(
dep_id in self.results
for dep_id in task.depends_on
)
if deps_met:
ready.append(task)
# Sort by priority (higher = first)
ready.sort(key=lambda t: -t.priority)
return ready
def execute(self, max_workers=4):
"""Execute ready tasks with thread pool."""
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {}
while self.tasks:
ready = self.get_ready_tasks()
if not ready:
if not futures:
break # Deadlock or done
# Wait for some futures to complete
done, _ = wait(futures, return_when=FIRST_COMPLETED)
self._collect_results(done, futures)
continue
for task in ready:
task.status = "running"
future = pool.submit(self._run_task, task)
futures[future] = task.id
done, _ = wait(futures, return_when=FIRST_COMPLETED)
self._collect_results(done, futures)
Security Whitelist
All Agent actions are filtered through a security whitelist that prevents unauthorized system operations:
class SecurityWhitelist:
ALLOWED_FUNCTIONS = {
"thinker.fast_think",
"thinker.cot_think",
"thinker.deep_think",
"knowledge.search",
"knowledge.query",
"memory.store",
"memory.retrieve",
"evolution.add_xp",
"evolution.check_level",
"evolution.apply_patch",
}
ALLOWED_PATHS = {
"/data/data/com.termux/files/home/miniGPT_project/Tian AI/",
"/tmp/tian_ai/",
}
ALLOWED_IMPORTS = {
"json", "os", "re", "datetime",
"sqlite3", "threading", "subprocess"
}
@classmethod
def validate_action(cls, action_name, path=None, imports=None):
if action_name not in cls.ALLOWED_FUNCTIONS:
raise SecurityError(f"Function {action_name} not allowed")
if path and not any(path.startswith(p) for p in cls.ALLOWED_PATHS):
raise SecurityError(f"Path {path} not allowed")
if imports:
bad = set(imports) - cls.ALLOWED_IMPORTS
if bad:
raise SecurityError(f"Import(s) not allowed: {bad}")
return True
4. Self-Evolution System — AST Analysis + Auto-Patching
The Self-Evolution system is the most unique component. It enables Tian AI to analyze its own source code and apply improvements autonomously.
AST Analysis Pipeline
import ast
import asttokens # Optional, for source-level annotations
class CodeAnalyzer:
def __init__(self, project_root):
self.root = project_root
self.report = {
"files": 0,
"total_lines": 0,
"functions": 0,
"classes": 0,
"complexity": {},
"duplications": [],
"issues": []
}
def analyze_file(self, filepath):
with open(filepath) as f:
source = f.read()
tree = ast.parse(source)
# Count functions and classes
functions = [n for n in ast.walk(tree) if isinstance(n, ast.FunctionDef)]
classes = [n for n in ast.walk(tree) if isinstance(n, ast.ClassDef)]
# Calculate cyclomatic complexity per function
for func in functions:
complexity = self._calc_complexity(func)
self.report["complexity"][f"{filepath}:{func.name}"] = complexity
# Detect long functions (>50 lines)
for func in functions:
if func.end_lineno - func.lineno > 50:
self.report["issues"].append({
"type": "long_function",
"file": filepath,
"function": func.name,
"lines": func.end_lineno - func.lineno
})
# Detect duplicate code blocks
self._detect_duplicates(filepath, tree)
self.report["files"] += 1
self.report["total_lines"] += len(source.splitlines())
self.report["functions"] += len(functions)
self.report["classes"] += len(classes)
def _calc_complexity(self, func_node):
"""McCabe cyclomatic complexity."""
base = 1
for node in ast.walk(func_node):
if isinstance(node, (ast.If, ast.While, ast.For,
ast.ExceptHandler, ast.With,
ast.Assert)):
base += 1
elif isinstance(node, ast.BoolOp):
base += len(node.values) - 1
return base
Auto-Patching System
class PatchEngine:
def __init__(self):
self.backup_dir = "/data/data/com.termux/files/home/miniGPT_project/Tian AI/backups/"
def apply_patch(self, filepath, old_code, new_code):
"""Apply a code patch with automatic backup."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = f"{self.backup_dir}{os.path.basename(filepath)}.{timestamp}.bak"
# 1. Backup original
shutil.copy2(filepath, backup_path)
# 2. Apply the change
with open(filepath, 'r') as f:
full_source = f.read()
if old_code not in full_source:
raise PatchError("Old code not found — patch rejected")
new_source = full_source.replace(old_code, new_code, 1)
# 3. Verify syntax before saving
try:
compile(new_source, filepath, 'exec')
except SyntaxError as e:
raise PatchError(f"Syntax error in patch: {e}")
# 4. Save
with open(filepath, 'w') as f:
f.write(new_source)
return {
"status": "applied",
"backup": backup_path,
"file": filepath
}
The Full Evolution Loop
1. SCAN ──→ AST walk all .py files
↓
2. ANALYZE ──→ Complexity metrics, duplication detection, code smells
↓
3. SUGGEST ──→ Send analysis report to LLM with structured prompt
↓
4. DECIDE ──→ LLM returns patches (old code → new code)
↓
5. APPLY ──→ PatchEngine applies with backup + syntax verification
↓
6. VERIFY ──→ Run compile(), run basic assertions
↓
7. COMMIT ──→ If verified, git commit with auto-generated message
Each successful evolution cycle grants XP to the system, which contributes to leveling up and unlocking new capabilities.
5. LLMManager — Process Lifecycle Management
The LLMManager is responsible for starting, monitoring, and restarting the llama.cpp server process. This is critical because the LLM backend is a separate C++ process that can crash under memory pressure.
class LLMManager:
def __init__(self, model_path, port=8080, threads=4, context=2048):
self.model_path = model_path
self.port = port
self.threads = threads
self.context = context
self.process = None
self._health_url = f"http://localhost:{port}/health"
self.restart_count = 0
self.max_restarts = 5
def start(self):
"""Spawn the llama-server process."""
cmd = [
"llama-server",
"-m", self.model_path,
"--port", str(self.port),
"-t", str(self.threads),
"-c", str(self.context),
"--mlock", # Prevent swapping
"--no-mmap", # Use malloc instead of mmap
]
self.process = subprocess.Popen(
cmd,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL
)
def health_check(self):
"""Check if the LLM process is responsive."""
try:
resp = requests.get(self._health_url, timeout=3)
return resp.status_code == 200
except:
return False
def wait_until_ready(self, timeout=60):
"""Poll health endpoint until the model is loaded."""
start = time.time()
while time.time() - start < timeout:
if self.health_check():
return True
time.sleep(2)
return False
def restart(self):
"""Graceful restart with automatic retry."""
if self.restart_count >= self.max_restarts:
raise RuntimeError("Max restarts exceeded")
self.stop()
self.start()
if self.wait_until_ready():
self.restart_count += 1
return True
return False
def auto_recover(self):
"""Monitor and auto-restart on crash."""
while True:
if not self.health_check():
logger.warning("LLM process unresponsive — restarting...")
if not self.restart():
logger.error("Failed to restart LLM process")
break
time.sleep(10) # Check every 10 seconds
def stop(self):
"""Terminate the LLM process."""
if self.process and self.process.poll() is None:
self.process.terminate()
try:
self.process.wait(timeout=5)
except subprocess.TimeoutExpired:
self.process.kill()
Key design decisions:
-
--mlock+--no-mmapprevents the OS from swapping the model to disk, which would cause catastrophic slowdowns - Separate health check thread runs every 10 seconds
- Max 5 restart attempts before giving up (prevents infinite crash loops)
- Process is launched with stdout/stderr suppressed to avoid filling up logs on the phone
6. PromptCache — LRU + TTL Caching Strategy
To avoid regenerating responses for identical queries (common in multi-turn conversations), the system implements a combined LRU+TTL cache:
class PromptCache:
def __init__(self, max_size=100, ttl_seconds=300):
self.cache = OrderedDict() # LRU ordering
self.max_size = max_size
self.ttl = ttl_seconds
self.timestamps = {} # key → timestamp
def get(self, key):
if key not in self.cache:
return None
# Check TTL
if time.time() - self.timestamps[key] > self.ttl:
self._evict(key)
return None
# Move to end (LRU update)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
# Evict oldest if at capacity
if len(self.cache) >= self.max_size:
oldest = next(iter(self.cache))
self._evict(oldest)
self.cache[key] = value
self.timestamps[key] = time.time()
self.cache.move_to_end(key)
def _evict(self, key):
del self.cache[key]
del self.timestamps[key]
Cache key composition: f"{mode}:{query}:{context_hash}" where mode is the thinking mode, query is the user input, and context_hash is a SHA256 of the conversation context.
Project Statistics
Here are the raw numbers from the codebase:
| Metric | Value |
|---|---|
| Python files | 770 |
| Total lines of code | ~170,041 |
| Core modules | 6 (Thinker, Talker, Knowledge, Agent, Evolution, LLMManager) |
| Extension languages | 3 (C, C++, Java) |
| C files | 1 (tian_hash — fast hashing) |
| C++ files | 1 (tian_engine — performance engine) |
| Java files | 1 (tian_tools — Android tooling) |
| Knowledge Base size | Millions of indexed concepts |
| Supported LLM | Qwen2.5-1.5B (GGUF quantized) |
| Backend framework | Flask |
| Alternative UI | Gradio |
The project is backed by a self-hosted GitHub mirror at github.com/3969129510/tian-ai.
What's Next
The architecture is designed for extensibility. Future directions include:
- Plugin system — Hot-loadable agent plugins with sandboxed execution
- Multi-modal pipeline — Image/audio understanding via local models
- Distributed agents — Multiple Tian AI instances collaborating over LAN
- Federated evolution — Privacy-preserving code improvement across instances
- Android APK — Standalone app packaging (no Termux required)
Getting Involved
Tian AI is fully open source. Contributions, issues, and forks are welcome.
git clone https://github.com/3969129510/tian-ai
cd tian-ai
# Explore the architecture
ls -la modules/
cat Thinker.py | head -100
Support development:
USDT (TRC-20): TNeUMpbwWFcv6v7tYHmkFkE7gC5eWzqbrs
BTC: bc1ph7qnaqkx4pkg4fmucvudlu3ydzgwnfmxy7dkv3nyl48wwa03kmnsvpc2xv
Tian AI — Open Source. Local. Self-Evolving.
GitHub: github.com/3969129510/tian-ai