Introduction
In Chapter 5 (MLOps), we built a CI/CD pipeline. This chapter explores a different approach: fine-tuning — training the model itself on your own data.
[RAG]
Question → search DB → pass results to LLM → answer
→ Requires documents, search costs apply
[Fine-tuning]
Question → Fine-tuned LLM → answer
→ Model carries the knowledge itself — no retrieval needed
When to Use RAG vs Fine-tuning
| RAG | Fine-tuning | |
|---|---|---|
| Best for | Latest info, internal document search | Specific styles, formats, specialized vocabulary |
| Knowledge updates | Just add documents | Requires retraining |
| Cost | API cost + DB cost | Training cost (one-time) |
| Hallucination | Less (grounded) | Somewhat more |
| Practical examples | Internal FAQ, policy search | Customer support tone, code generation |
On Gemini API Fine-tuning:
Since May 2025, fine-tuning is no longer available on the Gemini API free tier.
Vertex AI (paid) still supports it.
This tutorial uses Hugging Face + LoRA (completely free).
Types of Fine-tuning
Full Fine-tuning vs LoRA
Full Fine-tuning:
Updates all parameters (billions) of the model
→ High accuracy but requires GPU — expensive
→ 7B model needs minimum 16GB VRAM
LoRA (Low-Rank Adaptation):
Trains only 0.1–1% of all parameters
→ Works on CPU, low cost
→ Accuracy close to full fine-tuning
How LoRA works:
Original model weight matrix W (unchanged)
↓
Add low-rank matrices A × B (only these are trained)
↓
At inference: compute W + α × (A × B)
Directory Structure
pgvector-tutorial/
├── existing files
└── finetuning/
├── prepare_dataset.py # ★ Dataset preparation
├── train_lora.py # ★ LoRA fine-tuning
├── evaluate.py # ★ Comparison with base model
└── inference.py # ★ Inference with trained model
1. Install Libraries
pip install transformers datasets peft accelerate torch
pip freeze > requirements.txt
Note:
torchis a large library (several GB). Installation takes time.
For CPU-only:pip install torch --index-url https://download.pytorch.org/whl/cpu
2. Dataset Preparation — finetuning/prepare_dataset.py
We'll use the pgvector documents created earlier as fine-tuning data.
# finetuning/prepare_dataset.py
"""
Fine-tuning dataset preparation
Create question → answer pairs from pgvector documents.
"""
import json
import os
# ── Manually created Q&A pairs ────────────────────────────────
# Data quality is paramount for fine-tuning.
# In production, create 100–1000 high-quality pairs by hand.
TRAINING_DATA = [
{
"instruction": "What is the F1 score? Explain briefly.",
"input": "",
"output": "The F1 score is the harmonic mean of Precision (positive predictive value) "
"and Recall (sensitivity). "
"F1 = 2 × Precision × Recall ÷ (Precision + Recall). "
"It's especially useful when class imbalance is present.",
},
{
"instruction": "Show me code for cross-validation with scikit-learn.",
"input": "",
"output": "from sklearn.model_selection import cross_val_score\n"
"from sklearn.ensemble import RandomForestClassifier\n\n"
"model = RandomForestClassifier()\n"
"scores = cross_val_score(model, X, y, cv=5, scoring='f1')\n"
"print(f'F1 Score: {scores.mean():.3f} ± {scores.std():.3f}')",
},
{
"instruction": "What are the main methods for handling missing values in Pandas?",
"input": "",
"output": "There are three main approaches:\n"
"1. Drop: df.dropna() removes rows with missing values\n"
"2. Fill: df.fillna(value) replaces with a specified value (mean, median, etc.)\n"
"3. Interpolate: df.interpolate() fills using surrounding values\n"
"Numeric columns typically use mean imputation; categorical columns use mode.",
},
{
"instruction": "Give me three ways to reduce EC2 costs on AWS.",
"input": "",
"output": "1. Use Spot Instances: up to 90% cheaper than On-Demand. "
"Best for interruptible batch workloads.\n"
"2. Reserved Instances: 1–3 year commitments for up to 75% discount.\n"
"3. Delete unused resources: regularly remove idle EBS volumes, EIPs, and snapshots.",
},
{
"instruction": "What is a Kubernetes Pod?",
"input": "",
"output": "A Pod is the smallest deployable unit in Kubernetes.\n"
"It groups one or more containers that run on the same node, "
"sharing storage, networking, and execution specifications.\n"
"Pods are defined using YAML manifest files.",
},
{
"instruction": "What is the difference between Precision and Recall?",
"input": "",
"output": "Precision (positive predictive value): of all predicted positives, "
"the fraction that are actually positive. TP / (TP + FP). "
"Prioritize when false positives are costly.\n"
"Recall (sensitivity): of all actual positives, "
"the fraction correctly detected. TP / (TP + FN). "
"Prioritize when false negatives are costly.",
},
{
"instruction": "How do you select specific columns from a Pandas DataFrame?",
"input": "",
"output": "Several approaches are available:\n"
"Single column: df['column_name'] or df.column_name\n"
"Multiple columns: df[['col1', 'col2']]\n"
"Label-based: df.loc[condition, 'column']\n"
"Integer-based: df.iloc[row_index, col_index]",
},
{
"instruction": "What is the difference between AWS S3 and EBS?",
"input": "",
"output": "S3 (Simple Storage Service): Object storage. "
"Manages files by key. Scalable and inexpensive. Ideal for static files and backups.\n"
"EBS (Elastic Block Store): Block storage. "
"Attached to EC2 instances like an HDD/SSD. "
"Used for databases and OS disks.",
},
]
def save_dataset(data: list[dict], output_file: str):
"""Save dataset in JSONL format with 80/20 train/val split."""
split_idx = int(len(data) * 0.8)
train_data = data[:split_idx]
val_data = data[split_idx:]
train_file = output_file.replace(".jsonl", "_train.jsonl")
val_file = output_file.replace(".jsonl", "_val.jsonl")
with open(train_file, "w", encoding="utf-8") as f:
for item in train_data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
with open(val_file, "w", encoding="utf-8") as f:
for item in val_data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
print(f"Training data: {len(train_data)} samples → {train_file}")
print(f"Validation data: {len(val_data)} samples → {val_file}")
return train_file, val_file
def format_prompt(item: dict) -> str:
"""Convert to Alpaca format."""
if item["input"]:
return (
f"### Instruction:\n{item['instruction']}\n\n"
f"### Input:\n{item['input']}\n\n"
f"### Response:\n{item['output']}"
)
return (
f"### Instruction:\n{item['instruction']}\n\n"
f"### Response:\n{item['output']}"
)
if __name__ == "__main__":
os.makedirs("finetuning", exist_ok=True)
print(f"Dataset size: {len(TRAINING_DATA)} samples")
print("\n--- Sample ---")
print(format_prompt(TRAINING_DATA[0]))
train_file, val_file = save_dataset(
TRAINING_DATA,
"finetuning/dataset.jsonl"
)
print("\nDataset preparation complete")
mkdir finetuning
python finetuning/prepare_dataset.py
3. LoRA Fine-tuning — finetuning/train_lora.py
# finetuning/train_lora.py
"""
LoRA Fine-tuning
Model: microsoft/phi-2 (2.7B, runs on CPU)
LoRA config: rank=8, alpha=32
"""
import os
import json
import torch
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
MODEL_NAME = "microsoft/phi-2"
OUTPUT_DIR = "finetuning/lora_output"
MAX_LENGTH = 512
def load_dataset_from_jsonl(file_path: str) -> Dataset:
data = []
with open(file_path, "r", encoding="utf-8") as f:
for line in f:
data.append(json.loads(line))
return Dataset.from_list(data)
def format_prompt(item: dict) -> str:
if item.get("input"):
return (
f"### Instruction:\n{item['instruction']}\n\n"
f"### Input:\n{item['input']}\n\n"
f"### Response:\n{item['output']}"
)
return (
f"### Instruction:\n{item['instruction']}\n\n"
f"### Response:\n{item['output']}"
)
def main():
print("=== LoRA Fine-tuning Start ===\n")
# ── 1. Load model and tokenizer ───────────────────────────
print(f"Loading model: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
dtype=torch.float32, # float32 for CPU (torch_dtype is deprecated)
trust_remote_code=True,
)
# ── 2. LoRA configuration ─────────────────────────────────
# r (rank): dimension of low-rank matrices. Larger = more expressive but more memory
# alpha: scaling factor, typically 2–4× rank
# target_modules: layers to apply LoRA (usually Attention layers)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# => trainable params: ~1.3M / total: 2.7B (~0.05%)
# ── 3. Prepare datasets ───────────────────────────────────
print("\nLoading datasets...")
train_dataset = load_dataset_from_jsonl("finetuning/dataset_train.jsonl")
val_dataset = load_dataset_from_jsonl("finetuning/dataset_val.jsonl")
def tokenize_function(examples):
texts = [format_prompt({
"instruction": inst,
"input": inp,
"output": out,
}) for inst, inp, out in zip(
examples["instruction"],
examples["input"],
examples["output"],
)]
tokenized = tokenizer(
texts,
truncation=True,
max_length=MAX_LENGTH,
padding="max_length",
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
train_tokenized = train_dataset.map(tokenize_function, batched=True)
val_tokenized = val_dataset.map(tokenize_function, batched=True)
# ── 4. Training configuration ─────────────────────────────
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=1, # 1 for CPU
per_device_eval_batch_size=1,
gradient_accumulation_steps=4, # effective batch size = 4
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="none",
use_cpu=True, # CPU mode (no_cuda is deprecated)
)
# ── 5. Run training ───────────────────────────────────────
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized,
eval_dataset=val_tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
print("\nTraining started...")
print("(On CPU, this may take tens of minutes to hours)")
trainer.train()
# ── 6. Save model ─────────────────────────────────────────
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"\nModel saved: {OUTPUT_DIR}")
if __name__ == "__main__":
main()
python finetuning/train_lora.py
4. Inference — finetuning/inference.py
Run inference with the fine-tuned model and compare it to the base model.
# finetuning/inference.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
MODEL_NAME = "microsoft/phi-2"
LORA_DIR = "finetuning/lora_output"
def load_base_model():
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
dtype=torch.float32,
trust_remote_code=True,
)
return tokenizer, model
def load_finetuned_model():
tokenizer = AutoTokenizer.from_pretrained(LORA_DIR, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
dtype=torch.float32,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, LORA_DIR)
return tokenizer, model
def generate(tokenizer, model, instruction: str, max_new_tokens: int = 200) -> str:
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
generated = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
if __name__ == "__main__":
test_questions = [
"What is the F1 score?",
"How do you handle missing values in Pandas?",
]
print("=== Base Model vs Fine-tuned Model Comparison ===\n")
print("Loading base model...")
base_tokenizer, base_model = load_base_model()
print("Loading fine-tuned model...")
ft_tokenizer, ft_model = load_finetuned_model()
for question in test_questions:
print(f"\nQuestion: {question}")
print("-" * 50)
base_answer = generate(base_tokenizer, base_model, question)
print(f"[Base Model]\n{base_answer}")
ft_answer = generate(ft_tokenizer, ft_model, question)
print(f"\n[Fine-tuned Model]\n{ft_answer}")
python finetuning/inference.py
5. Data Quality Guidelines
Data quality is the most important factor in fine-tuning.
Characteristics of Good Data
✅ Consistency: Same question → same style of answer
✅ Accuracy: No incorrect information
✅ Diversity: Cover a wide variety of question patterns
✅ Conciseness: No unnecessary information
✅ Quantity: Minimum 100 samples, ideally 1000+
Data Size vs Expected Improvement
| Dataset Size | Expected Improvement |
|---|---|
| 10–50 samples | Minimal change |
| 100–500 samples | Clear improvement in style and format |
| 1,000+ samples | Domain knowledge and terminology retention |
| 10,000+ samples | Expert-level domain specialization |
6. Reading the Training Results
What the output tells you:
| Metric | Value | Meaning |
|---|---|---|
trainable% |
0.09% | Only 2.6M of 2.7B parameters trained (LoRA effect) |
eval_loss |
1.802 → 1.785 | Improving each epoch → learning correctly |
train_runtime |
109s | Completed in under 2 minutes on CPU |
7. Add to .gitignore
Model weight files should not be version-controlled — they can bloat the repository by several GB.
cat >> .gitignore << 'EOF'
# Fine-tuning outputs
finetuning/lora_output/
finetuning/*.jsonl
EOF
For production use, upload to Hugging Face Hub:
model.push_to_hub("your-username/my-lora-model")
Common Errors
| Error | Cause | Fix |
|---|---|---|
unexpected keyword argument 'no_cuda' |
Old parameter name | Use use_cpu=True
|
torch_dtype is deprecated |
Old parameter name | Use dtype=torch.float32
|
CUDA out of memory |
Insufficient GPU memory | Reduce per_device_train_batch_size to 1 |
ModuleNotFoundError: peft |
Not installed | pip install peft |
| Training too slow | Running on CPU | Use GPU or reduce dataset |
| Answer quality unchanged | Too little data | Use 100+ samples or switch to a Japanese model |
Next Steps
- [Chapter 7: Multi-Agent] — Design systems where Orchestrators and Workers collaborate
- RAG + Fine-tuning hybrid — Fine-tune for style, use RAG for knowledge
- Vertex AI Fine-tuning (paid) — For Gemini 2.5 Flash fine-tuning on your own data