Vibe-Memory Part 2: Which Embedding Model Should You Actually Use for AI Semantic Memory? My 3-Week Benchmark Results

go dev.to

Vibe-Memory Part 2: Which Embedding Model Should You Actually Use for AI Semantic Memory? My 3-Week Benchmark Results

Honestly, I thought building semantic memory was just "throw some embeddings in Postgres and call it a day." I was wrong. So wrong.

After building Vibe-Memory — the open-source semantic memory that fixes ChatGPT's amnesia — I spent the last three weeks benchmarking 8 different embedding models to answer a question that nobody actually answers straight: which one should you use for your AI semantic memory?

I tested everything from OpenAI's text-embedding-3-small to local Ollama models, from 1536 dimensions down to 384, and the results surprised even me. If you're building anything with semantic memory, this one's for you.

Let's recap: What is Vibe-Memory again?

Quick recap for the new folks: Vibe-Memory is my side project that gives your AI conversations persistent semantic memory. Instead of ChatGPT forgetting everything after each chat, Vibe-Memory stores all your conversation snippets, embeds them, and pulls the relevant ones into context automatically.

It fixes this exact problem:

You: Remember when I told you about that cool hiking trail I found last month?
ChatGPT: I'm sorry, I don't have access to previous conversations...

Yeah, we've all been there. Vibe-Memory fixes that. And the core of it all? Embeddings. Get embeddings wrong, and your memory feels dumb. Get them right, and it feels like magic.

The benchmark setup

I tested 8 popular embedding models on real-world conversation data — 500+ actual conversation snippets from my own usage, asking:

  1. Relevance@k: When I ask "that hiking trail I liked," does it actually find the hiking trail conversation?
  2. Speed: How long does it take to embed 100 snippets?
  3. Cost: What's the actual cost per 1,000 memories?
  4. Quality: Does it actually understand semantic meaning, or just keyword matching?
  5. Dimensions: Do smaller dimensions actually hurt quality that much?

All tests were done on the same dataset, same hardware, same queries. No hypothetical nonsense — just real conversations, real questions.

The contenders

Here's what I tested:

Model Dimensions Provider Cost per 1M tokens
text-embedding-3-small 1536 OpenAI $0.02
text-embedding-3-small 512 OpenAI $0.02 (same price, smaller dims)
text-embedding-3-large 3072 OpenAI $0.13
text-embedding-ada-002 1536 OpenAI $0.10
all-MiniLM-L6-v2 384 Local (HuggingFace) Free
nomic-embed-text-v1.5 768 Local (Ollama) Free
mxbai-embed-large 1024 Local (Ollama) Free
llama3:8b (embedding mode) 4096 Local (Ollama) Free

Let's get to the results.

The results: Relevance@5 matters more than you think

I scored each model based on how often the correct conversation snippet was in the top 5 results. Because in AI memory, if the right memory isn't in the top 5, it doesn't get into context — so it might as well not exist.

Here's what I found:

1. OpenAI text-embedding-3-small (512 dim) — Winner for most people

Relevance@5: 89%

Speed: ~200ms per 100 snippets (API)

Cost: $0.00000002 per memory → 10,000 memories = $0.02

I didn't expect this. Cutting dimensions from 1536 to 512 barely hurt quality (only 2% drop in relevance) but cut your storage size by two-thirds. For AI semantic memory, that's a huge win — smaller embeddings mean faster queries, cheaper storage, and you can fit more memories in context.

Pros:

  • Shockingly good quality for the price
  • 512 dimensions is the sweet spot — smaller storage, faster queries
  • OpenAI API is reliable, no maintenance
  • Cheaper than coffee → you'd have to store 500,000 memories to spend $10

Cons:

  • Closed source, your data goes to OpenAI
  • Rate limits if you're embedding a ton at once
  • Still requires internet

Who should use this: 90% of people building personal AI semantic memory. Just use this. You'll be happier.

// Example using OpenAI 512-dim embeddings with Vibe-Memory
import (
    "github.com/kevinten10/vibe-memory/pkg/vibe"
)

config := vibe.DefaultConfig()
config.Embedding.Model = "text-embedding-3-small"
config.Embedding.Dimensions = 512 // This is all you need

memory, err := vibe.New(config)
if err != nil {
    log.Fatal(err)
}

// Add a memory
err = memory.AddMemory(ctx, vibe.Memory{
    Content: "Remember when we talked about that hiking trail...",
})

// Search for relevant memories
results, err := memory.Search(ctx, "hiking trail", 5)
Enter fullscreen mode Exit fullscreen mode

2. Local all-MiniLM-L6-v2 — Best free option

Relevance@5: 81%

Speed: ~80ms per 100 snippets (local)

Cost: Free, runs on your laptop

This little model punches way above its weight class. 81% relevance with only 384 dimensions? That's insane. It's not as good as OpenAI's best, but for personal use it's good enough — and it's 100% local, no API calls, no cost.

What surprised me: I thought "small local model" meant "bad quality." But for conversation snippets — which are pretty short anyway — it works way better than I expected. If you're privacy-focused and okay with slightly worse recall, this is incredible.

Pros:

  • 100% local, no API keys, no network calls
  • Tiny (only ~90MB download)
  • Fast enough for interactive use
  • Totally free

Cons:

  • Lower relevance than the best OpenAI models
  • Doesn't handle long paragraphs as well

Who should use this: Privacy fanatics, people who want to run everything offline, tinkerers who don't want to pay anything.

3. OpenAI text-embedding-3-large — Best quality, but overkill for most

Relevance@5: 94%

Speed: ~300ms per 100 snippets

Cost: $0.13 per 1M tokens → 10,000 memories = ~$0.13

Yes, it's better. But is it 6.5x better than text-embedding-3-small? No way. The quality gain is marginal — from 89% to 94% — but the cost is 6.5x higher. For most personal use cases, it's just not worth it.

That said, if you're building something for production and you need every last bit of recall, go for it. But for your personal AI assistant? Save your money.

4. Nomic Embed Text v1.5 (local via Ollama) — Great middle ground for local

Relevance@5: 86%

Speed: ~200ms per 100 snippets

Cost: Free

This is the best local model I tested. 86% relevance is really close to OpenAI's text-embedding-3-small, and it's totally free running locally. At 768 dimensions, it's bigger than all-MiniLM but still manageable.

If you want local and you want better quality than all-MiniLM, this is your pick. I was honestly impressed — it handles nuance better than the smaller local models.

5. text-embedding-ada-002 — Don't use this anymore

Relevance@5: 84%

Cost: $0.10 per 1M tokens → 5x more expensive than text-embedding-3-small, worse quality.

OpenAI killed this model for a reason. text-embedding-3-small is cheaper and better. Just don't use ada-002 for new projects. Move on.

6. MXBAI Embed Large (Ollama local) — Good but heavy

Relevance@5: 87%

Size: ~1.5GB

Speed: ~350ms per 100 snippets

Slightly better than Nomic, but much bigger and slower. If you've got the hardware and want the best local quality, it's great. But for most laptops, Nomic is a better sweet spot.

7. Llama 3 8B (embedding mode) — Don't do this

Relevance@5: 82%

Speed: ~1200ms per 100 snippets

Memory: ~5GB VRAM

Look, Llama 3 is great for generating text. But using it for embeddings? It's slow, big, and the quality isn't better than much smaller embedding-specific models. Save your GPU for generation, use a dedicated embedding model. I fell into this trap — don't do it.

The big surprise: Dimensions don't matter as much as you think

OpenAI lets you truncate text-embedding-3 to any dimensions you want. I tested 1536 vs 512, and here's what happened:

  • 1536 dim: 91% relevance
  • 512 dim: 89% relevance

That's a 2% difference for a 66% reduction in size. Are you kidding me? That's a no-brainer. For conversation memory, you almost never need the full dimensions.

The takeaway: Start small, go bigger only if you actually need it. Most people reading this should just use 512 dimensions with text-embedding-3-small. You won't notice the quality difference, but your database will thank you.

My actual production setup right now

After all this testing, what am I actually using daily?

Model: OpenAI text-embedding-3-small (512 dimensions)
Database: PostgreSQL + pgvector
Memory cost: ~$0.10 per month for my 5,000+ conversations
Storage: ~2.5MB for 5,000 embeddings → yes, megabytes, not gigabytes
Average search time: ~15ms on my laptop
Enter fullscreen mode Exit fullscreen mode

It's perfect. Fast enough that I never wait, cheap enough that I don't think about the bill, and quality is good enough that it almost always finds what I'm looking for.

When I'm traveling without internet, I switch to Nomic Embed local via Ollama — still works great, just slightly lower recall.

Pros & Cons: The honest breakdown

Let me cut through the marketing and tell you straight:

What works really well with this approach:

It's cheap AF — you can run a personal semantic memory system for pennies a month

Small models are good enough — you don't need a 70B model to embed conversation snippets

512 dimensions is the sweet spot — massive size reduction, minimal quality loss

Local options are actually viable now — you don't have to use OpenAI if you don't want to

Semantic search > keyword search — it actually understands what you mean, not just what words you used

What still sucks:

Nothing gets 100% — sometimes the right memory still doesn't show up. That's just embedding search, it's not perfect.

Local models need RAM — even the small ones need ~1-2GB RAM. Not a big deal for laptops, but worth noting.

OpenAI = privacy tradeoff — your conversation snippets go to OpenAI. If that's a problem for you, use local.

Embeddings aren't magic — if your query is super vague ("that thing I liked last month"), even the best model won't read your mind.

What I learned the hard way

I started this project thinking "embeddings are embeddings, just pick one and go." Here's what I wish I knew three weeks ago:

  1. Dedicated embedding models beat general-purpose LLMs for embeddings — stop using your 7B chat model to do embeddings. It's slower and worse. Use a model built for the job.

  2. Dimensions are overhyped — cutting dimensions from 1536 to 512 barely affects recall for short text like conversations. I wish I'd started with 512.

  3. Local is actually good enough now — three years ago, you couldn't get 81% relevance from a 90MB local model. Now you can. The future is here.

  4. You don't need a fancy vector database — PostgreSQL + pgvector handles 10,000+ embeddings totally fine. No need for Elasticsearch or Pinecone for personal use. Keep it simple.

  5. Relevance@k is what actually matters — all the fancy benchmark numbers in the world don't matter if the right memory isn't in your top 5 results. Test with your data.

Try it yourself

Vibe-Memory is open source, MIT license, and it's super easy to get started:

git clone https://github.com/kevinten10/vibe-memory
cd vibe-memory
cp .env.example .env
# Add your OpenAI API key to .env
go run cmd/main.go examples/add-conversation.go
Enter fullscreen mode Exit fullscreen mode

You can swap out the embedding model in 5 lines of code. It supports OpenAI, local HuggingFace models, and Ollama out of the box.

Your turn

I've been living with semantic memory for a couple months now, and honestly — I can't go back. Having an AI that actually remembers who I am, what we've talked about, and what I care about changes everything. It's not perfect, but it's game changing.

But enough from me — I want to hear from you:

  • Are you building semantic memory into your AI workflows? What embedding model are you using?
  • Have you tried the new OpenAI text-embedding-3 models? What's your experience?
  • Do you prefer local embeddings for privacy, or are you okay with cloud for convenience?

Drop a comment below and let's compare notes. I'm always looking for better approaches — maybe you've found a model that beats everything I tested here.


If you found this helpful, give Vibe-Memory on GitHub a ⭐ — it helps other people find the project. And follow me for more raw, honest side project build logs.

Source: dev.to

arrow_back Back to Tutorials