If you've ever tried using ChatGPT to answer questions about your company's docs or codebase, you know the pain: hallucinations, half-right answers, or just plain nonsense. Retrieval-Augmented Generation (RAG) is supposed to fix this, right? But what happens when you swap out the fancy OpenAI API for open-source models? That’s where things got interesting for me—and not always in a good way.
Why Go Open-Source for RAG?
I wanted to build a RAG pipeline that didn’t rely on proprietary APIs or cloud costs. Maybe you’ve hit rate limits, or your data can’t leave your network. Open-source LLMs promise freedom, but they come with their own quirks.
The thing is, RAG isn’t a magic bullet. The idea is simple: retrieve relevant context, shove it into an LLM, and hope for better answers. But when you switch to open-source models, you start seeing the seams.
Building the Pipeline: What Actually Works
I’ll walk through the stack I landed on, with code you can actually run. For context, I used sentence-transformers for embeddings and llama.cpp (via llama-cpp-python) for the LLM. I chose these because they’re popular, actively maintained, and don’t require a GPU (though you’ll want one if your docs are big).
Step 1: Chunking and Embedding Documents
You have to chop your docs up before embedding. If you feed giant blobs, retrieval gets fuzzy and slow. Here’s a basic way to chunk and embed using sentence-transformers:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained embedding model (all-MiniLM-L6-v2 is small and fast)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example document (could be your README, docs, etc.)
doc = """
RAG combines retrieval and text generation.
It retrieves relevant info from your docs, then generates answers.
Open-source LLMs can be used instead of OpenAI.
"""
# Simple chunking: split by sentences
chunks = doc.strip().split('\n')
# Embed each chunk
embeddings = model.encode(chunks) # shape: (num_chunks, embedding_dim)
# Store embeddings for later retrieval
chunk_db = dict(zip(chunks, embeddings))
Key lines explained:
-
SentenceTransformer('all-MiniLM-L6-v2')loads a small, fast embedding model. -
chunksis just splitting by line/sentence. For real docs, you'll want smarter chunking (paragraphs, sliding windows). -
embeddingsgives you a vector for each chunk—these are what you’ll use to find relevant context.
Step 2: Retrieval — Matching Queries to Chunks
Now, when a user asks a question, you want to find the most relevant chunk(s). Cosine similarity is your friend.
from sklearn.metrics.pairwise import cosine_similarity
def retrieve_context(query, chunk_db, model, top_k=2):
# Embed the query
query_vec = model.encode([query])
# Calculate similarities to each chunk
chunk_texts = list(chunk_db.keys())
chunk_vecs = np.array(list(chunk_db.values()))
sims = cosine_similarity(query_vec, chunk_vecs)[0]
# Get top_k chunks
top_indices = np.argsort(sims)[-top_k:][::-1]
return [chunk_texts[i] for i in top_indices]
# Example usage
user_query = "How does RAG use open-source LLMs?"
context_chunks = retrieve_context(user_query, chunk_db, model)
print("Retrieved context:", context_chunks)
Key lines explained:
-
cosine_similarityfinds which chunks are closest to your query. -
top_kgets the most relevant pieces. If your docs are big, tune this. - The retrieval step is surprisingly fast even on a laptop.
Step 3: Generation — Using llama.cpp as the LLM
Here’s where things get real. Open-source LLMs are slower and have stricter input limits than OpenAI’s APIs. You often have to trim context or pick smaller models.
I used llama-cpp-python to run a local Llama 2 model. Here’s a basic generation example:
from llama_cpp import Llama
# Initialize the model (point to where you downloaded your Llama. e.g. 'llama-2-7b.Q4_0.gguf')
llm = Llama(model_path="llama-2-7b.Q4_0.gguf", n_ctx=2048)
def rag_answer(query, context_chunks, llm):
# Construct prompt (simple, but effective)
context = "\n".join(context_chunks)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
# Generate
output = llm(prompt, max_tokens=200, stop=["\n"])
return output['choices'][0]['text'].strip()
# Example usage
answer = rag_answer(user_query, context_chunks, llm)
print("LLM Answer:", answer)
Key lines explained:
-
Llama(model_path, n_ctx=2048)loads the model with a context window. If you go bigger, you need more RAM. - The prompt is simple: paste the retrieved context, add the question, ask for an answer.
-
llm(prompt, max_tokens=200, stop=["\n"])generates text, capped at 200 tokens.
Heads up: Running Llama 2 locally is slower than the OpenAI API, especially on CPU. Small models (like 7B) are faster, but less capable. Don’t expect miracles.
What Surprised Me
I expected RAG to be plug-and-play. It’s not. Here are a few things that caught me off guard:
- Context window limits are tighter. Llama 2 (7B) has a 2048 token window by default. You have to be careful not to overload it, or your prompt gets truncated.
- Prompt formatting matters more. OpenAI’s models are forgiving, but open-source LLMs really care how you phrase things. A small tweak can make or break your answers.
- Retrieval quality makes or breaks it. If your chunks are too big or too small, retrieval gets noisy. I spent a weekend fiddling with chunk sizes and overlap.
Common Mistakes
Here are a few pitfalls I’ve seen (and fallen into myself):
- Ignoring token limits. You paste in a ton of context, but the model quietly ignores half of it. Always check your model’s max context length.
- Bad chunking strategy. If you just split by lines or random sizes, retrieval gets messy. Use semantic chunking—paragraphs, or even sentence windows with overlap.
- Unclear prompts. Open-source LLMs aren’t as robust as GPT-4. If your prompt doesn’t clearly separate context from question, or you don’t specify what kind of answer you want, you’ll get garbage.
Key Takeaways
- Open-source RAG pipelines are doable, but you need to tune chunking, retrieval, and prompts more carefully than with OpenAI.
- Context window size is a hard limit—don’t ignore it, or your answers will suffer.
- Retrieval quality directly impacts generation quality. Invest time in good chunking and embedding strategies.
- Running LLMs locally is slower and less “magic”—you trade API convenience for control and privacy.
- Prompt engineering is not optional: test and iterate to get reliable answers.
Building a RAG pipeline with open-source tools taught me a lot about the nitty-gritty details you never see in demos. If you’re willing to tinker, you’ll get a system that’s yours—and honestly, that’s pretty satisfying.
If you found this helpful, check out more programming tutorials on our blog. We cover Python, JavaScript, Java, Data Science, and more.