I built a conversational RAG memory for my wife's LinkedIn agency for $44

python dev.to

My wife runs an agency that helps tech founders grow their LinkedIn presence. That means dozens of calls every week — clients, their teams, new leads, partner agencies, and more. All recorded through TL;DV.

The problem: when she needs to pull together context from an interview with a founder, something the CTO mentioned two months ago, and an insight from last week's call with a marketer — there's no good way to do it. Pasting full transcripts into Claude burns through context limits fast, and every new session starts from scratch.


What it does

TLBrain indexes call transcripts and gives Claude a persistent, searchable memory of every client conversation.

When my wife asks "what did we discuss with Acme last month?" — Claude queries the index, retrieves the relevant transcript segments, and answers with actual context from the calls.

The flow looks like this:

  1. TL;DV records a call and fires a webhook
  2. An import service converts the transcript into a Google Doc and places it in the right client folder in Google Drive — automatically
  3. A sync service picks up new and changed documents, parses them, generates summaries and facts via Gemini, and indexes everything into Qdrant
  4. A remote MCP server connects to Claude — accessible as a tool in both Claude.ai chat and Claude Cowork, on any device

By default, Claude only gets the relevant segments retrieved for each query — but there's also a tool to fetch the full transcript when needed.


What it costs

232 transcripts indexed for $44 — one-time cost. Each new transcript costs ~$0.19 to index.

Infrastructure runs entirely on free tiers:

  • Cloud Run
  • Firestore
  • Google Drive
  • Qdrant Cloud

All four stay within free tier limits for a small agency workload.

The only meaningful cost is Gemini API (Tier 1) — generation and embeddings during indexing. Embeddings are generated only for summaries and facts — not for utterances. Utterances are stored with BM25 sparse vectors and retrieved by range. This keeps both cost and vector storage size low. Free tier has strict rate limits that would make indexing 200+ transcripts impractical.

A few details that keep the cost low: embeddings use text-embedding-004 with output_dimensionality=768 — 4× cheaper than the default 3072. Summary and facts are generated in a single Gemini request per window. And if a file hasn't changed, it's skipped entirely — Gemini is never called again for the same content.

The $0.19 per transcript is a one-time indexing cost — you pay to embed the conversation once, not every time Claude searches it. Stop recording calls for a month, go on vacation, pause the business — the system costs nothing during that time. You only pay again when new transcripts are indexed. That said, this assumes you stay within the free tiers — once your call volume grows beyond those limits, infrastructure costs will kick in.

Without an index, asking Claude about a specific client means pasting entire transcripts into the context window — hitting message limits fast, and starting from scratch every session. TLBrain sends Claude only the relevant fragments: up to 75 utterances out of hundreds.


Why not just use Claude Projects?

Claude Projects is the obvious first answer. But it has two hard limits for this use case:

  • Context ceiling — paste enough transcripts and you hit the limit. Every new session starts from scratch.
  • No structure — there's no concept of clients, dates, or searchable facts. Everything is a flat pile of documents.

Imagine she needs to find what a client said about budget across three calls from different months. With Claude Projects, she'd need to manually find the right transcripts, paste them one by one, and hope they fit in context. With 232 calls in the archive, that's not a workflow — that's a research project. TLBrain returns the relevant fragments in seconds.


Architecture

TL;DV API  ←─────────────────  Reconciliation (Cloud Run, daily)
     ↓
Webhook Handler (Cloud Function)
     ↓
Import Service (Cloud Run)
     ↓
Google Drive + Firestore  ←───  Sync Checker (Cloud Function, every 15 min)
     ↓
Vector Sync (Cloud Run)
     ↓
Qdrant Cloud + Firestore
     ↓
MCP Server (Cloud Run)
     ↓
Claude (chat / Cowork)
Enter fullscreen mode Exit fullscreen mode

Note: Firestore is used throughout as the state store — tracking import status, content hashes, and sync state.

Six services sounds like a lot — but each split is intentional. The webhook handler must respond to TL;DV in under 2 seconds or the delivery is marked as failed, so import runs in a separate service. The MCP server is isolated from the sync pipeline so a slow indexing job never blocks Claude's queries. Services are also split by runtime pattern: Cloud Functions wake up, check something, and go back to sleep — no idle cost. Cloud Run containers handle long-running tasks and stay warm longer: the MCP server keeps its instance alive for 15 minutes after the last request, so there are no cold starts during an active session.

Client folders in Google Drive are the source of truth for data organization. Each subfolder under the root is a client name. The sync service doesn't know about TL;DV — it only reads Google Docs from Drive. The import service and sync service are fully decoupled.

A daily reconciliation job cross-checks TL;DV's API against Firestore and queues anything the webhook missed — so no transcript gets lost silently.


One unexpected benefit

If a transcript contains transcription errors, I simply edit the Google Doc.

The sync service detects the change, regenerates summaries, re-embeds affected chunks, and updates Qdrant automatically.

No admin panel required.


Why traditional RAG chunking fails on conversations

Standard RAG splits text by token count. That works for documents but breaks on transcripts:

  • speaker boundaries are lost
  • replies get split mid-thought
  • retrieval returns fragments without context

The fix: treat each utterance as the atomic unit.

{
    "type": "utterance",
    "doc_id": "1BxK...drive_file_id",
    "client_name": "Acme Corp",
    "dialog_date": "2025-03-12",
    "speaker": "Alice",
    "text": "I're planning to launch in Q3, budget is around 5000.",
    "order_index": 42,
    "version": "sha256_of_content",
}
Enter fullscreen mode Exit fullscreen mode

But utterances alone lose context. "Price is 5000" is meaningless without the surrounding conversation. So I generate summaries over sliding windows using anchor-based windowing:

def generate_windows(
    utterances: list[dict],
    anchor_step: int = 3,
    half_window: int = 2,
) -> list[dict]:
    if not utterances:
        return []
    windows = []
    n = len(utterances)
    for i in range(0, n, anchor_step):
        start = max(0, i - half_window)
        end = min(n - 1, i + half_window)
        window_utterances = utterances[start : end + 1]
        windows.append({
            "center_index": utterances[i]["order_index"],
            "covered_range": [
                utterances[start]["order_index"],
                utterances[end]["order_index"],
            ],
            "utterances": window_utterances,
        })
    return windows
Enter fullscreen mode Exit fullscreen mode

Summaries and facts are generated in English regardless of the original language — so Claude always queries in English for consistent retrieval quality.


Why semantic search alone isn't enough

With summaries indexed, semantic search works well — most of the time.

The failure case: "what was the price she mentioned?" Specific numbers, names, short factual statements — these rarely survive summarization. The summary might say "discussed pricing" but the actual figure only lives in the raw utterance.

Semantic miss. Keyword hit.

The fix: three parallel searches.

Facts handle structured values like prices and dates. BM25 catches what semantic search misses — exact company names, abbreviations, or foreign words that don't survive summarization. If a client mentioned a specific vendor by name, semantic search might return "discussed partnerships" — BM25 finds the exact utterance.

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=2) as executor:
    semantic_future = executor.submit(
        search_summaries_and_facts,
        query=query, client_name=client_name,
        date_from=date_from, date_to=date_to,
        top_k=15,
    )
    keyword_future = executor.submit(
        keyword_search_utterances,
        query=query, client_name=client_name,
        date_from=date_from, date_to=date_to,
    )
semantic_hits = semantic_future.result()   # dense, score >= 0.6
keyword_hits  = keyword_future.result()    # BM25, no threshold

# Pin: user_facts bypass score threshold entirely
user_fact_hits = search_user_facts(query, client_name=client_name)
pinned_hits = []
if user_fact_hits:
    hits_by_doc = {}
    for h in user_fact_hits:
        hits_by_doc[h["doc_id"]] = hits_by_doc.get(h["doc_id"], 0) + 1
    for doc_id, hit_count in hits_by_doc.items():
        pinned_hits.extend(search_summaries_for_doc(doc_id, query, top_k=hit_count * 5))
Enter fullscreen mode Exit fullscreen mode

Note: ThreadPoolExecutor instead of asyncio.gather — the Qdrant Python SDK is synchronous. Real parallelism here comes from a thread pool, not coroutines.

Each search serves a different purpose:

  • Dense (semantic) over summaries and facts — finds topically relevant conversations
  • BM25 (keyword) over raw utterances — catches exact matches that don't survive summarization
  • Pin over user-added facts — forces specific documents into results regardless of score

Results are merged, overlapping ranges within the same document are combined, and the final
utterances are fetched by index range — no second search needed.

The output is a list of segments:

{"doc_id":"1BxKmN9vQ2rTzAp_drivefile","client_name":"Acme Corp","dialog_date":"2025-03-12","segments":[{"range":[40,46],"dialog":[{"speaker":"Alice","text":"So what's the timeline looking like?","order_index":40},{"speaker":"Bob","text":"I need to be live by end of Q3.","order_index":41},{"speaker":"Alice","text":"I're planning to launch in Q3, budget is 5000.","order_index":42},{"speaker":"Bob","text":"That works. Can you send a proposal by Friday?","order_index":43},{"speaker":"Alice","text":"Sure, I'll have it over by Thursday.","order_index":44}]}]}
Enter fullscreen mode Exit fullscreen mode

This is what gets sent to Claude as context. Not the full transcript — just the relevant
fragments.


What's next

Today TLBrain indexes 232 conversations and gives Claude access to years of client history — without loading entire transcripts into context.

The whole project is open source: TLBrain on GitHub

In the next post I'll cover how I turned this into a production remote MCP server with Google OAuth on Cloud Run.

Source: dev.to

arrow_back Back to Tutorials