A pattern I keep seeing in early AI-in-the-SDLC teams: someone wires an LLM into the PR-review pipeline as a quality gate, the LLM marks one perfectly fine PR as "risky" two weeks in, the team lead overrides it once and grumbles about it twice, and within a month the AI gate is silently disabled.
You can't recover from that. Once a TL has spent a Friday afternoon explaining to engineering why the AI thinks their PR is dangerous when it isn't, "AI dev tools" become a punchline in their next 1:1 with the CTO. And the AI was probably right some of the time — you just lost the chance to find out which times.
I'm building a SaaS that touches this problem (Codens, an AI dev harness — happy to talk about it but it's not the point of this post). When I designed the PR risk evaluation service for it, I started with five non-negotiable design rules that I think apply to any AI-in-workflow product:
- AI is advisory only. Never auto-blocks a merge.
- The TL owns code quality. Not the AI.
- OK needs no reason. NG needs one line. Don't ask humans to defend approvals.
- Raw events, derived facts, and AI suggestions are stored separately. Audit trail must survive any single layer being wrong.
- Configuration > code. Risk rules are repository-editable, not in source control of the AI service.
This post is the implementation of those constraints — what the engine actually looks like, with the code that ships in production.
The architecture: rules first, AI second
The flow for a PR walks through three stages, in order:
PR webhook
↓
[1] Rules engine ← pure deterministic, no LLM
↓ rule matches + risk score
[2] Threshold check ← rule-driven "is this high-risk"
↓ if high-risk
[3] AI advisory layer ← LLM looks at the matched rules and PR diff,
offers an opinion, never blocks
The TL still presses the merge button. The AI's role is at most "raise a concern in a comment that the TL can dismiss." The rules engine's role is "tell the TL to look closer at certain files." Nothing fans out from the AI to anything that gates a deploy.
The rules engine: 4 match types, and that's it
Every rule has one of four match types. I deliberately resisted adding a fifth (regex was tempting; I kept saying no).
class RuleMatchType(Enum):
FILE_PATH_GLOB = "file_path_glob" # auth/**, billing/**, migrations/**
KEYWORD = "keyword" # "permission", "PII", "encryption"
LABEL = "label" # GitHub label "risk:high"
FILE_EXTENSION = "file_extension" # .sql, .tf, .pem
@dataclass
class RiskRule:
pattern: str
match_type: RuleMatchType
risk_area: RiskArea # "auth" / "billing" / "data" / etc.
weight: float # 0.5, 1.0, 2.0
Why only four? Because the people writing these rules are repository owners, not the AI service author. If I add regex, the rules become unmaintainable for anyone who isn't a regex hobbyist. With glob + keyword + label + extension, you cover roughly 90% of useful "this PR touches a sensitive area" detection, and the remaining 10% probably needs a human anyway.
The evaluation is dead simple — fnmatch for the path glob, lowercased substring match for keywords, label lookup for labels:
async def assess_pr(self, pr: GitHubPullRequest) -> AssessmentResult:
rulesets = await self.ruleset_repository.get_applicable_rulesets(
project_id=pr.project_id,
repository_id=pr.repository_id,
)
matches: list[RuleMatch] = []
for ruleset in rulesets:
for rule in ruleset.rules:
if rule.match_type == RuleMatchType.FILE_PATH_GLOB:
matches.extend(self._match_file_paths(pr, rule))
elif rule.match_type == RuleMatchType.KEYWORD:
matches.extend(self._match_keywords(pr, rule))
elif rule.match_type == RuleMatchType.LABEL:
matches.extend(self._match_labels(pr, rule))
elif rule.match_type == RuleMatchType.FILE_EXTENSION:
matches.extend(self._match_file_extensions(pr, rule))
total_score = sum(m.weight for m in matches)
is_high_risk = total_score >= self.high_risk_threshold # default 1.0
Path matching uses fnmatch.fnmatch with ** collapsed to *:
def _match_file_paths(self, pr: GitHubPullRequest, rule: RiskRule) -> list[RuleMatch]:
matches: list[RuleMatch] = []
glob_pattern = rule.pattern.replace("**", "*")
for file_path in pr.changed_file_paths:
if fnmatch.fnmatch(file_path, glob_pattern):
matches.append(RuleMatch(
rule_pattern=rule.pattern,
match_type=rule.match_type.value,
risk_area=rule.risk_area,
matched_value=file_path,
weight=rule.weight,
))
return matches
Up to this point: zero LLM calls. A PR touching migrations/0042_add_user_pii.sql with a risk:high label and "permission" in the title gets a deterministic score derived from the rule weights, and that score is reproducible by anyone reading the ruleset.
High-risk → precheck profile, not AI judgment
When is_high_risk is true, the next thing that happens is not an LLM call. It's a precheck profile lookup:
if is_high_risk and assessment.risk_areas:
precheck_profile = await self.profile_repository.get_best_profile_for_areas(
risk_areas=assessment.risk_areas,
project_id=pr.project_id,
)
if precheck_profile:
assessment.precheck_results = [
PrecheckItemResult(
item_description=item.description,
is_required=item.is_required,
)
for item in precheck_profile.items
]
A precheck profile is a checklist. For each risk area, the team has authored a list of "if you touched this area, did you remember to do X?" items. Examples:
-
autharea: "Did you isolate OAuth scope changes into a separate PR?" -
billingarea: "Does the price calculation route throughbcp_price()?" -
migrationsarea: "Did you verify the migration is reversible with--dry-run?"
The PR author or reviewer ticks ✅ / ❌ + a one-line reason for ❌. The AI doesn't tick these. Ever. It can't, by design — the precheck profile is a human checklist, and the AI has no API to mark items.
If the engineer wants to know "is my migration reversible," they run --dry-run. If they want a second opinion, they ask the TL. The AI sees this whole exchange, but its participation is bounded.
Where the AI finally shows up
After the rules-first stages run, the AI gets to chime in. Two narrow things only:
1. Per-precheck-item advisory. The AI can attach a comment to a checklist item: "Looking at the related files, I think items 2 and 3 are the ones worth reviewing closely." It cannot mark them done. It cannot add new items. It can only annotate.
2. Anomaly detection. Things like "this PR is 4× larger than the median PR for this area" or "the test count went down by 30% but the implementation grew." Statistical pattern matching on top of historical data, surfaced as a comment.
@dataclass
class AISuggestion:
suggestion_type: str # "anomaly" | "precheck_advisory" | "scope_warning"
severity: str # "info" | "warning" ← never "blocking"
description: str
confidence: float # 0.0-1.0
is_dismissed: bool # TL can dismiss → never re-shown for this PR
The is_dismissed flag does a lot of work. A TL who decides the AI's anomaly detection is too noisy on a particular repo can dismiss its suggestions for a few PRs in a row, and they stop appearing. The AI gets quieter automatically when the team finds it unhelpful, instead of making them open a settings page to silence it.
This is, I think, the single most important UX move in any AI-in-workflow product: let the human turn it down without leaving the workflow.
Three-table audit log
The other constraint that paid off in practice was separating the storage of three things that look similar but aren't:
GitHub webhook
↓
raw_events (immutable, the bytes that arrived)
↓ parser
gate_events / pr_stats (derived facts: "PR opened", "review submitted")
↓ AI evaluation (non-blocking)
ai_suggestions (the AI's commentary on the facts)
Why three tables instead of one row-per-PR-event with everything denormalized:
-
A bug in the parser (the path from raw_events → derived) shows up as wrong facts. You can re-run the parser against
raw_eventsand regenerategate_eventswithout touching the AI's history. Recoverable. -
A bad AI prompt (the path from facts → suggestions) shows up as wrong opinions. You can re-run the AI evaluator with a corrected prompt and regenerate
ai_suggestions. The facts are unchanged, the events are unchanged. Recoverable. -
A compliance question ("show me what the AI told the TL on PR #4912") returns from
ai_suggestionsdirectly. The fact that something was suggested, the fact that it was dismissed, the fact that the TL still merged anyway — all separately queryable.
If you put it all in one table, your "the AI was wrong here" investigation becomes "we lost the original event because we overwrote it with the corrected derivation". You can't get back to ground truth.
This costs more storage (raw events are big) and a slightly more complex backfill story, but I'd take that trade every time over having to apologize to a customer with "we don't actually know what happened."
What I'd push back on if I were rebuilding from scratch
A 5th match type. I still want regex sometimes. I haven't added it because the day I do, three teams will write regexes, two will be wrong, and one will write a catastrophically slow one that DoSes the rule engine. If I do add it, it'll be safe_regex with a strict timeout and a max-length cap, and I'll fight the urge to allow lookarounds.
More AI in the audit suggestions. Current prompts are conservative on purpose — short context, explicit "advisory only" framing, low temperature. There's a part of me that wants to let the AI chase suspicious patterns more aggressively across the codebase. But every time I've drafted that more-ambitious prompt, my mental simulation ends with the same false-positive-kills-trust failure mode that started this post.
Three takeaways
An AI in your workflow that can be dismissed at the workflow level (one click, no settings page) survives much longer than an AI that requires reconfiguration to ignore. The dismiss button is the trust-preserving move.
Rules + thresholds + checklists do most of the gating work. AI is an excellent annotator for "these specific files in this specific PR are worth a closer look", but it's a bad replacement for "should this merge." Don't promote it past advisory.
Separate raw events, derived facts, and AI suggestions in storage from day one. It's cheap to do upfront and impossible to retrofit when you discover your AI was systematically wrong about a class of PRs for two months and you need to figure out where to draw the rollback line.
I'm building this engine inside a larger AI dev harness called Codens — 5 specialized agents that share a credit pool across PRD writing, error auto-fix, test gen, and engineering activity tracking. The risk evaluation in this post is the "Yellow Codens" service. LP: https://www.codens.ai/
If you've shipped an AI gate that survived team contact for >6 months, I'd love to hear what kept it alive. Or if you've shipped one that died from a single false positive — even more interested in that.