Introduction
Through Chapter 4 (Security), we implemented Evals, Observability, and Security as individual components. In this chapter, we integrate them into a system for continuous operations.
LLMOps shares DNA with MLOps but faces fundamentally different challenges. Prompts are code, Evals replace unit tests, provider switching is routine, and costs are unpredictable.
[Before] Manually executed scripts
python evals/eval_rag.py ← run by hand
python security/secure_rag.py ← run by hand
[Now — LLMOps] Automated on every GitHub push
→ Evals check quality
→ Security validation
→ Auto-deploy when quality bar is met
The 2026 MLOps Maturity Model identifies Level 2 (CI/CD automation) as delivering the highest ROI — and most organizations sit between Level 1 and 2.
How LLMOps Differs from MLOps
LLMOps adds LLM-specific concerns on top of MLOps: prompt versioning and evaluation, hallucination monitoring, RAG retrieval quality measurement, token cost management, and content safety monitoring.
| MLOps | LLMOps |
|---|---|
| Version control model files | Version control prompts |
| Test with accuracy / loss | Test with Evals (LLM-as-a-Judge) |
| Deploy models | Deploy prompt configurations |
| Monitor data drift | Monitor answer quality and costs |
Directory Structure
pgvector-tutorial/
├── existing files
│
├── llmops/
│ ├── prompt_registry.py # ★ Prompt version management
│ ├── ci_eval.py # ★ CI evaluation script
│ └── cost_tracker.py # ★ API cost tracking
│
└── .github/
└── workflows/
└── llmops.yml # ★ GitHub Actions CI/CD pipeline
1. Prompt Version Management — llmops/prompt_registry.py
Prompts are code. They need version control, diff views, approval workflows, and rollback capability.
# llmops/prompt_registry.py
"""
Prompt version management
Changing a prompt can drastically affect RAG answer quality.
Track which version of each prompt is in use.
"""
import hashlib
from dataclasses import dataclass
PROMPTS = {
"rag_answer": {
"v1.0.0": {
"template": """Answer the question based on the following documents.
# Reference Documents
{context}
# Question
{question}
# Answer""",
"description": "Initial version",
},
"v1.1.0": {
"template": """Answer the question based on the following documents.
If the information is not in the documents, say "This information is not available in the documents."
# Reference Documents
{context}
# Question
{question}
# Answer (concise, based on the documents)""",
"description": "Anti-hallucination: explicitly instruct to not answer outside documents",
},
"v1.2.0": {
"template": """You are a document search assistant.
Answer questions based solely on the following documents.
Constraints:
- Do not answer information not in the documents
- Do not speculate or fill gaps
- If unknown, say "This information is not available in the documents"
# Reference Documents
{context}
# Question
{question}
# Answer""",
"description": "Security hardening: explicit role + constraints in system prompt style",
},
}
}
def get_prompt(name: str, version: str = "latest") -> str:
"""Retrieve a prompt template."""
if name not in PROMPTS:
raise ValueError(f"Prompt '{name}' not found")
versions = PROMPTS[name]
if version == "latest":
version = sorted(versions.keys())[-1]
if version not in versions:
raise ValueError(f"Version '{version}' not found")
return versions[version]["template"]
def list_versions(name: str) -> list[dict]:
"""Return version list for a prompt."""
if name not in PROMPTS:
raise ValueError(f"Prompt '{name}' not found")
result = []
for version, info in PROMPTS[name].items():
template_hash = hashlib.md5(info["template"].encode()).hexdigest()[:8]
result.append({
"version": version,
"description": info["description"],
"hash": template_hash,
})
return result
def compare_versions(name: str, v1: str, v2: str) -> dict:
"""Compare the diff between two versions."""
t1 = get_prompt(name, v1)
t2 = get_prompt(name, v2)
lines1 = set(t1.split("\n"))
lines2 = set(t2.split("\n"))
added = lines2 - lines1
removed = lines1 - lines2
return {
"added_lines": len(added),
"removed_lines": len(removed),
"char_diff": len(t2) - len(t1),
"sample_added": list(added)[:3],
}
if __name__ == "__main__":
print("=== Prompt Version List ===\n")
for version_info in list_versions("rag_answer"):
print(f"{version_info['version']} [{version_info['hash']}] - {version_info['description']}")
print("\n=== Diff: v1.0.0 → v1.2.0 ===")
diff = compare_versions("rag_answer", "v1.0.0", "v1.2.0")
print(f" Lines added: {diff['added_lines']}")
print(f" Lines removed: {diff['removed_lines']}")
print(f" Char delta: {diff['char_diff']:+d}")
print("\n=== Latest (v1.2.0) Prompt ===")
print(get_prompt("rag_answer", "latest"))
mkdir llmops
python llmops/prompt_registry.py
2. CI Evaluation Script — llmops/ci_eval.py
This script runs automatically on every GitHub push. It fails CI if quality drops below the threshold.
# llmops/ci_eval.py
"""
CI/CD evaluation script
Runs on every push. Returns exit code 1 if quality
falls below thresholds, causing CI to fail.
"""
import sys
import os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import psycopg2
from google import genai
from google.genai import types
from dotenv import load_dotenv
import time
import json
from llmops.prompt_registry import get_prompt
from evals.dataset import EVAL_DATASET
load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
conn = psycopg2.connect(
host=os.getenv("DB_HOST"),
port=os.getenv("DB_PORT"),
dbname=os.getenv("DB_NAME"),
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASSWORD"),
)
cur = conn.cursor()
# ── Quality thresholds (CI fails if below these) ─────────────
QUALITY_THRESHOLDS = {
"context_recall": 0.80,
"answer_relevancy": 0.70,
"overall": 0.75,
}
PROMPT_VERSION = os.getenv("PROMPT_VERSION", "latest")
def get_embedding(text: str) -> list[float]:
result = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=768,
),
)
return result.embeddings[0].values
def search(query: str, top_k: int = 3) -> list[dict]:
query_embedding = get_embedding(query)
cur.execute("""
SELECT title, body,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s;
""", (query_embedding, query_embedding, top_k))
rows = cur.fetchall()
return [
{"title": r[0], "body": r[1], "similarity": round(r[2], 4)}
for r in rows
]
def rag_answer_with_prompt(question: str, prompt_version: str) -> tuple[str, list[dict]]:
docs = search(question, top_k=3)
context = "\n\n".join([f"[{d['title']}]\n{d['body']}" for d in docs])
prompt_template = get_prompt("rag_answer", prompt_version)
prompt = prompt_template.format(context=context, question=question)
for attempt in range(3):
try:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
)
return response.text, docs
except Exception as e:
if ("503" in str(e) or "429" in str(e)) and attempt < 2:
time.sleep((attempt + 1) * 15)
else:
raise
def eval_context_recall(retrieved_docs, expected_docs):
retrieved_titles = [d["title"] for d in retrieved_docs]
hit = sum(1 for expected in expected_docs if expected in retrieved_titles)
return hit / len(expected_docs) if expected_docs else 0.0
def eval_answer_relevancy(answer, keywords):
hit = sum(1 for kw in keywords if kw.lower() in answer.lower())
return hit / len(keywords) if keywords else 0.0
def run_ci_eval(prompt_version: str = "latest") -> dict:
print(f"CI evaluation started: prompt version={prompt_version}")
print("=" * 60)
results = []
for item in EVAL_DATASET:
print(f"\n[{item['id']}] {item['question']}")
try:
answer, retrieved_docs = rag_answer_with_prompt(item["question"], prompt_version)
time.sleep(3)
context_recall = eval_context_recall(retrieved_docs, item["expected_docs"])
answer_relevancy = eval_answer_relevancy(answer, item["expected_answer_keywords"])
overall = (context_recall + answer_relevancy) / 2
results.append({
"id": item["id"],
"context_recall": context_recall,
"answer_relevancy": answer_relevancy,
"overall": overall,
})
status = "✓" if overall >= QUALITY_THRESHOLDS["overall"] else "✗"
print(f"{status} Context Recall: {context_recall:.2f} | Relevancy: {answer_relevancy:.2f} | Overall: {overall:.2f}")
except Exception as e:
print(f" ERROR: {e}")
results.append({
"id": item["id"],
"context_recall": 0.0,
"answer_relevancy": 0.0,
"overall": 0.0,
"error": str(e),
})
avg_recall = sum(r["context_recall"] for r in results) / len(results)
avg_relevancy = sum(r["answer_relevancy"] for r in results) / len(results)
avg_overall = sum(r["overall"] for r in results) / len(results)
report = {
"prompt_version": prompt_version,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"),
"metrics": {
"context_recall": round(avg_recall, 3),
"answer_relevancy": round(avg_relevancy, 3),
"overall": round(avg_overall, 3),
},
"thresholds": QUALITY_THRESHOLDS,
"passed": (
avg_recall >= QUALITY_THRESHOLDS["context_recall"] and
avg_relevancy >= QUALITY_THRESHOLDS["answer_relevancy"] and
avg_overall >= QUALITY_THRESHOLDS["overall"]
),
"results": results,
}
return report
if __name__ == "__main__":
report = run_ci_eval(PROMPT_VERSION)
print("\n" + "=" * 60)
print("CI Evaluation Report")
print("=" * 60)
print(f"Prompt version: {report['prompt_version']}")
print(f"Context Recall: {report['metrics']['context_recall']:.3f} (threshold: {QUALITY_THRESHOLDS['context_recall']})")
print(f"Answer Relevancy: {report['metrics']['answer_relevancy']:.3f} (threshold: {QUALITY_THRESHOLDS['answer_relevancy']})")
print(f"Overall: {report['metrics']['overall']:.3f} (threshold: {QUALITY_THRESHOLDS['overall']})")
with open("llmops/eval_report.json", "w") as f:
json.dump(report, f, ensure_ascii=False, indent=2)
print("\nReport saved to llmops/eval_report.json")
if report["passed"]:
print("\n✅ CI Evaluation: PASSED — Ready to deploy")
sys.exit(0)
else:
print("\n❌ CI Evaluation: FAILED — Quality thresholds not met")
sys.exit(1)
3. GitHub Actions CI/CD Pipeline — .github/workflows/llmops.yml
A pipeline that automatically runs evaluations on every push to GitHub.
# .github/workflows/llmops.yml
name: LLMOps CI/CD Pipeline
on:
push:
branches: [main]
paths:
- "*.py"
- "llmops/**"
- "evals/**"
- "security/**"
pull_request:
branches: [main]
jobs:
# ── Job 1: Security validation ───────────────────────────────
security-test:
name: Security Validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run input validator tests
run: python security/input_validator.py
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
# ── Job 2: RAG quality gate (Evals) ─────────────────────────
eval-gate:
name: RAG Quality Gate
runs-on: ubuntu-latest
needs: security-test
services:
postgres:
image: pgvector/pgvector:pg16
env:
POSTGRES_PASSWORD: password
POSTGRES_DB: vectordb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Setup DB and seed data
run: |
python 01_setup_db.py
python 02_create_index.py
python 03_ingest.py
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
DB_HOST: localhost
DB_PORT: 5432
DB_NAME: vectordb
DB_USER: postgres
DB_PASSWORD: password
- name: Run CI Eval (Quality Gate)
run: python llmops/ci_eval.py
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
DB_HOST: localhost
DB_PORT: 5432
DB_NAME: vectordb
DB_USER: postgres
DB_PASSWORD: password
PROMPT_VERSION: latest
- name: Upload eval report
uses: actions/upload-artifact@v4
if: always()
with:
name: eval-report
path: llmops/eval_report.json
# ── Job 3: Deploy (only if quality gate passes) ──────────────
deploy:
name: Deploy to Render
runs-on: ubuntu-latest
needs: eval-gate
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- name: Deploy to Render
run: curl -X POST "${{ secrets.RENDER_DEPLOY_HOOK_URL }}"
4. The LLMOps Pipeline in Full
Developer pushes code
↓
GitHub Actions triggered
↓
[Job 1] Security test
→ Test prompt injection detection with input_validator.py
↓ Pass
[Job 2] RAG quality evaluation (Evals gate)
→ Measure Context Recall and Answer Relevancy with ci_eval.py
→ Verify quality meets threshold (Overall ≥ 75%)
→ Save eval report as artifact
↓ Pass
[Job 3] Production deployment
→ Auto-deploy to Render
→ Langfuse tracing begins
5. Comparing Prompt Versions
# Evaluate with v1.1.0
PROMPT_VERSION=v1.1.0 python llmops/ci_eval.py
# Evaluate with v1.2.0
PROMPT_VERSION=v1.2.0 python llmops/ci_eval.py
# Compare scores and set the better version as "latest"
Common Errors
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: llmops |
Path not configured | Check sys.path.append(...)
|
| CI eval FAILED | Score below threshold | Improve prompt or adjust thresholds |
| GitHub Actions timeout | Gemini rate limit | Increase time.sleep()
|
| Render Deploy Hook not triggering | Secret not configured | Check GitHub Secrets |
Next Steps
- [Chapter 6: Fine-tuning] — Specialize a model for your domain using LoRA
- Multi-agent — Design systems where multiple Agents collaborate
- Governance — EU AI Act compliance, risk management, audit logs