Grammarly costs $12/mo — a local LLM does it for free (Chrome + Ollama)

I write a lot in the browser — email, GitHub comments, contact forms — and I wanted proofreading without uploading every keystroke to a company's cloud. My workplace bans Grammarly for exactly that reason.

So I built inline-scribe: a Chrome extension that proofreads your text with an AI that runs on your own machine (Ollama). Nothing leaves your computer. And the fixes show up like Word's Track Changes — accept or reject each one individually with ✓ / ✕.

This post is about the two design decisions I found most interesting while building it. Both generalize to anyone wiring a local LLM into a product.

The LLM never produces the diff.
Silencing Ollama's 403 with declarativeNetRequest — zero-config, no OLLAMA_ORIGINS.

The missing ingredient was never the AI

If you write in a browser today, you pick one of three bad options:

Grammarly — great UX, but every keystroke goes to their cloud, the good features are behind a subscription, and many workplaces ban it (legal docs, unreleased code, patient data).
Paste into ChatGPT — you get one big rewritten blob back. Which words changed? Did it alter your meaning? You re-read everything, every time, and your text still went to someone else's server.
Nothing — and ship the typos.

The thing is, the AI isn't the hard part anymore. Anyone can run a capable model locally with Ollama in two commands, for free. What's missing is the interface. The reason Grammarly was worth paying for was never the grammar engine — it was the friendly diff that lets you see and control each change.

That interface, on top of a model you own, is the whole product.

corrections	your text goes to	inline diff, per-fix accept/reject	price
Grammarly	cloud AI	their servers	✅ (the reason people pay)	$12+/mo
Harper (10k★)	local, rule-based	nowhere ✅	❌ underlines typos only — can't rewrite a clumsy sentence	free
scramble / Typollama	local LLM ✅	nowhere ✅	❌ whole-text replacement or popup	free
inline-scribe	local LLM ✅	nowhere ✅	✅	free

Design decision #1: the LLM never produces the diff

This is the one I most want to share.

The intuitive move is to ask the model for structured output: "return the changes as JSON," something like [{ "delete": "...", "insert": "..." }, ...], and pipe it straight into the UI.

But small local models break when you do this. A model like llama3.2 (3B) is surprisingly good at fixing prose and terrible at structured output: it breaks the JSON, adds explanations, wraps everything in a code fence, renames your keys. A chatty 3B model means a broken UI.

So I split the responsibilities:

The model's job: return corrected prose — just text.
The extension's job: compute the changes (hunks) from (original, corrected) with a deterministic algorithm.

you press Alt+G in a text field
   │
   ▼
the extension sends your text to YOUR endpoint     ← default: Ollama on 127.0.0.1
(an OpenAI-compatible /chat/completions API)          model: llama3.2 (~2GB, free)
   │
   ▼
the model returns corrected prose — just text
   │
   ▼
inline-scribe computes a word-level diff           ← deterministic algorithm,
between your text and the correction                  NOT the LLM's opinion
   │
   ▼
review panel: accept ✓ / reject ✕ each change → Apply writes back only what you approved

The diff tokenizes into words + whitespace + punctuation runs, then does an LCS (longest common subsequence) walk:

// Tokenize into words/whitespace/punctuation, preserving everything
export function tokenize(text: string): string[] {
  return text.match(/\s+|[^\s\w]+|\w+/gu) ?? [];
}

export function diffText(original: string, corrected: string): Hunk[] {
  const a = tokenize(original);
  const b = tokenize(corrected);
  // DP table of LCS lengths over a × b (Uint32Array rows)
  // Walk the table emitting equal / delete / insert, merging adjacent ops.
  // Collapse delete+insert neighbours into one `replace` so a phrase rewrite
  // reads as a single reviewable hunk instead of three.
  ...
}

This split has a lot of happy side effects:

Model-agnostic. Any OpenAI-compatible endpoint works (llama.cpp, LM Studio, vLLM, or your own key). Since nothing depends on structured-output quality, the UI behaves the same whether you run 3B or 70B.
Deterministic, so reproducible. Same input/output → same hunks. Easy to unit-test.
Accept/reject is trivial. A hunk is { kind, original, corrected }. Accepted hunks take corrected, rejected take original, concatenate — done.

export function applyDecisions(hunks: Hunk[], accepted: boolean[]): string {
  let result = '';
  hunks.forEach((h, idx) => {
    if (h.kind === 'equal') result += h.original;
    else result += accepted[idx] ? h.corrected : h.original;
  });
  return result;
}

Even with "return only text," small models still wrap output in fences or quotes. I gave up on prompting that away and instead strip the obvious wrappers in post — without touching real content:

export function stripWrapping(reply: string, original: string): string {
  let out = reply.replace(/\r\n/g, '\n');
  const fence = out.match(/^\s*```[a-z]*\n([\s\S]*?)\n```\s*$/);
  if (fence) out = fence[1];                       // strip ```...```
  out = out.replace(/^\s+|\s+$/g, '');
  if (/^".*"$/s.test(out) && !/^"/.test(original.trim())) out = out.slice(1, -1); // strip whole-reply quotes
  if (original.endsWith('\n') && !out.endsWith('\n')) out += '\n'; // preserve trailing-newline convention
  return out;
}

Takeaway: let small local models do what they're good at (return natural language) and keep the structure — diffs, JSON, state — in deterministic code. This isn't specific to proofreading; it's a general principle for putting a local LLM into a product.

Design decision #2: silencing Ollama's 403 with declarativeNetRequest

This is the pothole every Chrome-extension × Ollama project hits.

Stock Ollama rejects requests carrying a chrome-extension://... Origin with a 403 — a guard against cross-origin access from extensions. The official workaround is to have the user set the OLLAMA_ORIGINS env var. But asking for that means:

exporting OLLAMA_ORIGINS=chrome-extension://<id-that-changes>
different steps depending on shell config and how Ollama is launched
the single biggest "I installed it and it doesn't work" drop-off point

In other words, the moment your README documents an env var, you've lost. It should just work with a stock ollama serve.

The fix: use MV3's declarativeNetRequest (DNR) to strip the Origin header from requests to the configured endpoint with a dynamic rule. No Origin, no 403.

async function syncOriginRule(): Promise<void> {
  const stored = await chrome.storage.sync.get('config');
  const endpoint = stored.config?.endpoint ?? DEFAULT_CONFIG.endpoint;
  const host = new URL(endpoint).host; // e.g. 127.0.0.1:11434

  await chrome.declarativeNetRequest.updateDynamicRules({
    removeRuleIds: [1],
    addRules: [{
      id: 1,
      priority: 1,
      condition: { urlFilter: `||${host}/`, resourceTypes: ['xmlhttprequest'] },
      action: {
        type: 'modifyHeaders',
        requestHeaders: [{ header: 'origin', operation: 'remove' }],
      },
    }],
  });
}

Key points:

Scope the rule to the user's configured host only (urlFilter). It's not a dangerous "strip Origin everywhere" rule.
The endpoint is configurable, so watch chrome.storage.onChanged and re-apply the rule whenever config changes.
The only permissions needed are declarativeNetRequest plus localhost host_permissions.

// manifest.json (excerpt)"permissions":["storage","activeTab","declarativeNetRequest","contextMenus"],"host_permissions":["http://127.0.0.1/*","http://localhost/*"],

The result: the user's steps are "install Ollama, ollama serve, install the extension." Zero env vars.

MV3 architecture: do the fetch in the service worker

One more thing. The actual fetch (the request to 127.0.0.1) happens in the service worker, not the content script:

// content → background message; background runs the check and replies
chrome.runtime.onMessage.addListener((msg, _sender, sendResponse) => {
  if (msg?.type !== 'inline-scribe:check') return undefined;
  (async () => {
    const config = { ...DEFAULT_CONFIG, ...(await chrome.storage.sync.get('config')).config };
    try {
      const corrected = await new OllamaChecker(config).check(msg.text);
      sendResponse({ ok: true, corrected, model: config.model });
    } catch (err) {
      sendResponse({ ok: false, error: /* CheckerError message */ });
    }
  })();
  return true; // keep the channel open for the async response
});

Two reasons:

It isn't bound by the page's CSP. A fetch from a content script gets blocked when the page's Content-Security-Policy restricts connect-src. The service worker runs in the extension's context and is unaffected.
It plays well with the DNR Origin strip. The request now originates from the extension's service worker as an xmlhttprequest, so the rule above applies cleanly.

The UI (review panel, the ✎ selection icon, in-place replacement) is rendered in a shadow DOM from the content script so it doesn't collide with the page's CSS.

Wrapping up

inline-scribe is, at its core, "Grammarly's diff UX on top of a local LLM." The design decisions that made it work:

Don't let the LLM build the diff — use a deterministic algorithm. The UI never breaks on small models, it's model-agnostic, and it's easy to test.
Strip the Origin with DNR. No OLLAMA_ORIGINS, zero config.
Fetch in the service worker. Not bound by page CSP.

Swap the system prompt and the same diff UI becomes a translator or a tone-shifter. Source is MIT.

Repo: https://github.com/mk668a/inline-scribe

If you're putting a local LLM into a product, the leverage is in deciding what the model does — and what it doesn't.