How I Built a Code Review Agent That Gets Smarter Every Time a Developer Hits Accept

Most AI tools forget everything the moment you close the tab. I wanted to build one that didn't.

The idea was straightforward: a code review agent that reads your pull request diff, flags real issues, and — crucially — remembers which suggestions your team actually liked and which ones they ignored. After a few reviews, it stops nagging you about trailing whitespace if you've been rejecting those comments all along. It starts sounding less like a linter and more like a senior engineer who's worked with your team for six months.

That's the system I built with Hindsight agent memory, Groq, and FastAPI. Here's how it works, where it got interesting, and what I learned.

What the system does

The agent sits in front of your pull requests. A developer opens a PR, clicks "Run AI Review," and the system does three things in order:

Recall — it fetches team standards and past review patterns from Hindsight, the memory layer
Review — it sends the diff plus that memory context to Groq's qwen/qwen3-32b model, which returns structured JSON comments
Retain — when the developer clicks Accept or Reject on a comment, that decision gets stored back into Hindsight

The frontend is a three-panel layout: PR sidebar on the left, syntax-highlighted diff in the middle, review comments with Accept/Reject buttons on the right. Each comment shows severity (critical, warning, suggestion, praise), the file and line it refers to, and an optional code fix.

The loop closes when a developer rejects "use logging instead of print" for the third time. The next review won't bring it up.

The memory layer: retain() and recall()

The part I found most interesting to build was the memory integration. Most agents I've seen treat each request as stateless — the LLM gets a system prompt, does its job, and forgets. Hindsight changes that with two primitives: retain() and recall().

Here's the actual wrapper I wrote:

async def retain_feedback(repo: str, pr_id: str, comment: str, file: str, action: str):
    payload = {
        "collection": f"reviews:{repo}",
        "content": f"PR #{pr_id} | File: {file} | Comment: {comment} | Developer {action} this suggestion.",
        "metadata": {"pr_id": pr_id, "file": file, "action": action}
    }
    async with httpx.AsyncClient() as client:
        resp = await client.post(f"{BASE_URL}/retain", json=payload, headers=HEADERS)
        resp.raise_for_status()

And recall:

async def recall_context(repo: str) -> dict:
    async with httpx.AsyncClient() as client:
        patterns_resp = await client.post(
            f"{BASE_URL}/recall",
            json={"collection": f"reviews:{repo}", "query": "accepted and rejected review comments", "top_k": 10},
            headers=HEADERS
        )
    past_patterns = "\n".join([r["content"] for r in patterns_resp.json().get("results", [])])
    return {"past_patterns": past_patterns or "No past patterns yet."}

The key design decision was storing human-readable strings rather than embeddings or structured records. The content field reads like a sentence — "PR #42 | File: auth.py | Comment: Use parameterized queries | Developer accepted this suggestion." — because that's what gets injected into the LLM prompt later. The model can read it and understand it without any additional parsing.

The review pipeline

The backend is a FastAPI app with three endpoints that matter:

GET /prs — serves the pull request list to the frontend
POST /review — runs the full pipeline: recall → parse diff → call Groq → return structured comments
POST /feedback — called on each Accept/Reject click, stores the decision via retain()

The /review endpoint is where everything connects:

@app.post("/review")
async def review_pr(request: ReviewRequest):
    memory = await recall_context(request.repo)
    chunks = parse_diff(request.diff)
    comments = await generate_review(
        pr_id=request.pr_id,
        title=request.title,
        chunks=chunks,
        team_standards=memory["team_standards"],
        past_patterns=memory["past_patterns"],
    )
    return {"pr_id": request.pr_id, "comments": comments, "memory_used": memory}

The diff parser splits a unified diff into per-file chunks with addition and deletion counts. This matters because the LLM prompt includes each file's diff separately with a header — it makes the model's references to specific line numbers more reliable.

The Groq call uses qwen/qwen3-32b with a system prompt that instructs the model to return only a JSON array. Each element has a file, line number, severity, category, comment, and optional suggestion field. I added a fallback that wraps raw text in a single comment object if the JSON parse fails — which happened more often than I expected during early testing before I tightened the system prompt.

What a real review looks like

Here's what the agent returned on a PR adding a user authentication endpoint:

Critical — auth.py:9 — security

User input is interpolated directly into the SQL query string — this is a textbook SQL injection vulnerability.
Fix: cursor.execute('SELECT * FROM users WHERE username = ? AND password = ?', (username, password))

Critical — auth.py:12 — security

MD5 is a broken hashing algorithm and must not be used for tokens or passwords. It is trivially reversible with rainbow tables.
Fix: Use secrets.token_hex(32) for session tokens, bcrypt or argon2 for passwords.

Warning — auth.py:7 — bug

The database connection is never closed. On high traffic this exhausts the SQLite file handle limit.
Fix: Use a context manager: with sqlite3.connect('users.db') as conn:

Praise — auth.py:5 — documentation

Good docstring — clearly describes what the function returns. This makes it easy to use correctly.

The praise comment matters. A reviewer that only flags problems is easy to tune out. Including something genuine when the code does something well makes the rest of the review land differently.

Lessons learned

1. Human-readable memory beats structured records — for LLMs.
I initially considered storing feedback as JSON objects with typed fields. I switched to plain sentences because the LLM uses the recalled context directly in the prompt. It doesn't need to deserialize anything; it just reads "Developer rejected this suggestion" and adjusts accordingly.

2. The feedback loop is the product.
The first review an agent produces is mediocre. The tenth is genuinely useful. If you're building an agent like this, the Accept/Reject UI is not a nice-to-have — it is the core mechanism that makes everything else worth building.

3. Diff parsing is harder than it looks.
Unified diffs have edge cases everywhere — files with no standard headers, diffs that start mid-hunk, filenames with spaces. I ended up with a fallback that treats the entire input as a single chunk if no file headers are detected. It is not elegant but it is resilient.

4. Groq is genuinely fast.
The full pipeline — recall from Hindsight, parse diff, call Groq, return JSON — completes in about 2–3 seconds end to end. For a code review tool that needs to feel interactive, that matters.

5. Design for the no-API-key case from the start.
Both Hindsight and Groq integrations have graceful fallbacks — hardcoded team standards and mock comments respectively. This made local development and demos dramatically easier. The app works end to end without any keys, which meant I could build and test the frontend without ever blocking on API setup.

What's next

The obvious next step is connecting real GitHub PRs via the API instead of the static fake PR data. The diff format is identical — it's just a matter of fetching it from api.github.com/repos/{repo}/pulls/{number} with an Accept: application/vnd.github.v3.diff header.

The more interesting extension is per-team memory segmentation. Right now, all feedback goes into a single reviews:{repo} collection. With multiple teams working on the same repo, you could store feedback in team-specific collections and recall from the right one based on who opened the PR.

The core insight though is simpler: agent memory turns a one-shot tool into something that compounds. Every accept or reject makes the next review marginally more accurate. Across hundreds of reviews and thousands of decisions, that compounding adds up to an agent that actually understands how your team writes code.

That's worth building for.