Build Your Own Second Brain: RAG-Powered Knowledge Tools That Never Leave Your Machine

Tags: #ai #python #rag #productivity

Every day, we generate an enormous volume of personal knowledge — research papers we read, journal entries we write, PDFs we annotate, news articles we bookmark. Most of this knowledge ends up scattered across folders, apps, and cloud services, never to be retrieved when we actually need it.

What if you could build AI-powered tools that understand your knowledge, answer questions about your documents, and run entirely on your local machine — no API keys, no cloud costs, no data leaving your laptop?

That is exactly what I have been building. Over the past several months, I have created a suite of open-source RAG (Retrieval-Augmented Generation) tools for personal knowledge management, all powered by local LLMs through Ollama. In this post, I will walk you through the RAG architecture patterns behind these tools, share working code, and explain why local-first AI is the future of personal productivity.

Why RAG? Why Local?

Large Language Models are powerful, but they have a fundamental limitation: they only know what they were trained on. Ask a vanilla LLM about your research notes from last Tuesday, and it will hallucinate an answer.

RAG solves this by giving the LLM access to your actual documents at inference time. The pattern is straightforward:

Chunk your documents into manageable pieces
Embed each chunk into a vector representation
Store the vectors in a local database
Retrieve the most relevant chunks for a given query
Generate an answer grounded in the retrieved context

The "local" part matters enormously for personal knowledge. Your journal entries, medical records, research notes, and private documents should never need to leave your machine. Running everything locally with Ollama and Gemma 3 means zero data exposure, zero API costs, and full control.

The Core RAG Pipeline

Here is the foundational pipeline I use across all five projects. Every tool builds on this same pattern:

import ollama
import chromadb
from pathlib import Path

# Initialize local vector store
client = chromadb.PersistentClient(path="./knowledge_db")
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks for better retrieval."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

def embed_and_store(doc_id, text):
    """Embed document chunks and store in ChromaDB."""
    chunks = chunk_text(text)
    for i, chunk in enumerate(chunks):
        response = ollama.embed(model="gemma3", input=chunk)
        collection.add(
            ids=[f"{doc_id}_chunk_{i}"],
            embeddings=[response["embeddings"][0]],
            documents=[chunk],
            metadatas=[{"source": doc_id, "chunk_index": i}]
        )

def retrieve(query, n_results=5):
    """Retrieve the most relevant chunks for a query."""
    query_embedding = ollama.embed(model="gemma3", input=query)
    results = collection.query(
        query_embeddings=[query_embedding["embeddings"][0]],
        n_results=n_results
    )
    return results["documents"][0]

def generate_answer(query, context_chunks):
    """Generate an answer grounded in retrieved context."""
    context = "\\n\\n---\\n\\n".join(context_chunks)
    prompt = f"""Based on the following context, answer the question.
If the answer is not in the context, say so.

Context:
{context}

Question: {query}

Answer:"""
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

This is the skeleton. Each project extends it with domain-specific parsing, chunking strategies, and prompt engineering. Let me walk through all five.

Project 1: Personal Knowledge Base

Repo: personal-knowledge-base

This is the central hub — a system that ingests markdown notes, text files, and documents into a searchable, queryable knowledge graph. Think of it as a second brain you can actually have a conversation with.

The key architectural decision here is hierarchical chunking. Rather than treating every 500-character block equally, the system preserves document structure:

import re

def hierarchical_chunk(markdown_text, source_file):
    """Chunk markdown while preserving heading hierarchy."""
    sections = re.split(r'(^#{1,3}\\s+.+$)', markdown_text, flags=re.MULTILINE)
    chunks = []
    current_heading = "Introduction"

    for section in sections:
        if re.match(r'^#{1,3}\\s+', section):
            current_heading = section.strip('# \\n')
        else:
            if section.strip():
                for chunk in chunk_text(section.strip(), chunk_size=400):
                    chunks.append({
                        "text": chunk,
                        "heading": current_heading,
                        "source": source_file
                    })
    return chunks

This means when you ask "What did I write about transformer architectures?", the retrieval step returns chunks with their original heading context — dramatically improving answer quality.

Project 2: PDF Chat Assistant

Repo: pdf-chat-assistant

PDFs are the lingua franca of professional knowledge — research papers, contracts, reports, whitepapers. This tool lets you drop in a PDF and start asking questions.

The interesting RAG challenge here is table and figure handling. Raw PDF text extraction often mangles tables. The solution is a multi-pass extraction strategy:

import fitz  # PyMuPDF

def extract_pdf_with_structure(pdf_path):
    """Extract text from PDF preserving structural elements."""
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        blocks = page.get_text("dict")["blocks"]
        page_content = []

        for block in blocks:
            if block["type"] == 0:  # Text block
                lines = []
                for line in block["lines"]:
                    text = "".join(span["text"] for span in line["spans"])
                    font_size = line["spans"][0]["size"] if line["spans"] else 12
                    lines.append({"text": text, "font_size": font_size})
                page_content.append({
                    "type": "text",
                    "lines": lines,
                    "bbox": block["bbox"]
                })

        pages.append({
            "page_num": page_num + 1,
            "content": page_content
        })
    return pages

Each chunk retains its page number and structural role (heading, body, caption), so when the LLM generates an answer, it can cite "page 7, section 3" — making the tool genuinely useful for academic research.

Project 3: News Digest Generator

Repo: news-digest-generator

This project flips the RAG pattern slightly. Instead of querying a static corpus, it continuously ingests news feeds and generates personalized digests. The RAG component handles deduplication and topic clustering:

def generate_digest(articles, user_interests):
    """Generate a personalized news digest using RAG."""
    # Embed and store today's articles
    for article in articles:
        embed_and_store(article["id"], article["content"])

    # Retrieve articles matching user interests
    relevant = []
    for interest in user_interests:
        chunks = retrieve(interest, n_results=3)
        relevant.extend(chunks)

    # Generate digest
    prompt = f"""Create a concise news digest from these articles.
Group by topic. Highlight key insights.

Articles:
{chr(10).join(relevant)}

Digest:"""
    response = ollama.chat(
        model="gemma3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

The power here is that your interests drive the retrieval, so the digest is automatically personalized — no algorithm deciding what you should see.

Project 4: Diary Journal Organizer

Repo: diary-journal-organizer

This is perhaps the most personal of the five tools. It ingests your journal entries and lets you explore patterns in your own thinking over time. The RAG pattern here emphasizes temporal retrieval — not just semantic similarity, but time-aware search:

from datetime import datetime

def temporal_retrieve(query, date_range=None, n_results=5):
    """Retrieve journal entries with optional date filtering."""
    query_embedding = ollama.embed(model="gemma3", input=query)
    where_filter = None

    if date_range:
        where_filter = {
            "$and": [
                {"date": {"$gte": date_range[0]}},
                {"date": {"$lte": date_range[1]}}
            ]
        }

    results = collection.query(
        query_embeddings=[query_embedding["embedding"]],
        n_results=n_results,
        where=where_filter
    )
    return results

Ask it "What was I stressed about in January?" and it retrieves semantically relevant entries scoped to that month. This is deeply personal data — exactly the kind of information that should never touch a cloud API.

Project 5: Research Paper QA

Repo: research-paper-qa

Built for the workflow of reading dozens of papers for a literature review. Drop a folder of PDFs, and the system builds a cross-paper knowledge base you can query:

def cross_paper_query(question, paper_collection):
    """Query across multiple research papers with source attribution."""

    query_embedding = ollama.embed(model="gemma3", input=question)
    results = paper_collection.query(
        query_embeddings=[query_embedding["embeddings"][0]],
        n_results=8,
        include=["documents", "metadatas"]
    )

    context_with_sources = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        citation = f"[{meta['paper_title']}, p.{meta['page']}]"
        context_with_sources.append(f"{doc}\\n— {citation}")

    answer = generate_answer(question, context_with_sources)
    return answer

The killer feature is cross-paper synthesis. Ask "How do different authors define retrieval-augmented generation?" and it pulls relevant definitions from every paper in your collection, with proper citations.

Getting Started

All five projects run on the same local stack. Here is the setup:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3
ollama pull gemma3

# Install Python dependencies
pip install ollama chromadb pymupdf

Clone any of the repos from my GitHub and you are up and running. No API keys. No cloud accounts. Just your machine, your data, and an LLM that respects your privacy.

What I Have Learned

Building these tools has reinforced a few convictions:

Chunking strategy matters more than model size. A well-chunked document with a smaller model consistently outperforms sloppy chunking with a larger model. Invest time in understanding your document structure.

Overlap in chunks is not optional. Without overlap, you lose context at chunk boundaries. A 10-15% overlap catches most cross-boundary information.

Local LLMs are good enough. Gemma 3 running through Ollama handles personal knowledge tasks remarkably well. You do not need GPT-4 to search your own notes.

The best knowledge system is the one you control. Cloud AI services are powerful, but for personal data — journals, health records, private research — local is the only option that makes sense.

All of these projects are open source and actively maintained. I would love for you to try them, break them, and contribute. The future of personal AI is local, private, and open.

Written by Nrk Raju Guthikonda, Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. Builder of 110+ open-source AI projects. Find more on dev.to/kennedyraju55 and connect on LinkedIn.