Parsing Claude Code's JSONL: patterns for a schema that keeps moving

Every conversation you have with Claude Code is written to disk as JSONL, under ~/.claude/projects/. Your decisions, your dead ends, the bug hunt that took three sessions: it is all there. You have probably never opened one.

The catch: the format is an internal implementation detail. No documentation, no version field, no stability guarantee. The schema changes whenever the CLI updates, which is, at the current pace, almost daily.

The patterns below come from building a read-only replay and search tool on top of these files, and from keeping it alive through a dozen CLI releases. Each pattern survived contact with real data. Four were learned the hard way, from bugs worth retelling.

The ground rule: you are an archaeologist, not a validator

A parser for someone else's internal format has a different job than a parser for your own API. Rejecting malformed input is not an option: the input is the historical record, and whatever is on disk is all there will ever be.

The contract that follows from this:

A bad line never kills the file. Skip it, count it, move on.
An unknown shape is preserved, not dropped. You can re-parse later; you cannot un-drop data.
Normalize at the boundary, once. Downstream code (search index, UI, export) should never see the mess.

Everything below is one of these rules meeting reality.

Pattern 1: tolerant line parsing with an explicit whitelist

The naive loop (JSON.parse each line, switch on type) works on day one. The question is what happens when a CLI update introduces a type nobody has seen before. This is not hypothetical; a real batch of them appears at the end of this post.

The approach that holds up: keep an explicit whitelist of known types, and treat everything outside it as "parse failed, but preserved":

const KNOWN_MESSAGE_TYPES = new Set([
  'user', 'assistant', 'system',
  'queue-operation', 'last-prompt',
  'progress', 'attachment', 'file-history-snapshot',
  'permission-mode', 'custom-title', 'ai-title',
  'agent-name', 'pr-link',
])

function parseLine(line: string): ParsedLine | null {
  if (!line.trim()) return null

  let obj: Record<string, unknown>
  try {
    obj = JSON.parse(line)
  } catch {
    return null // malformed line: skip, never throw
  }

  const type = typeof obj.type === 'string' ? obj.type : 'unknown'
  const parseFailed = !KNOWN_MESSAGE_TYPES.has(type)
  // unknown type → keep the raw JSON string for later re-parse
  // known type → extracted fields are enough, raw can be dropped
  ...
}

The whitelist does double duty as a storage policy. For known types, the extracted columns are sufficient and the raw JSON can be discarded; that alone reclaims most of the disk space. For unknown types, the raw line goes into an archive table untouched. When a future version of the parser learns the new shape, the evidence is still there.

One more detail that pays off: cap the length of identifier fields (uuid, requestId) at something sane like 128 chars before trusting them. Parsing files you do not control calls for a little paranoia at the boundary, and it is cheap.

Pattern 2: version your derived data, not just your schema

Preserving unknowns only matters if you can act on them later. The mechanism is a SUMMARY_VERSION integer stored per session. When the parser learns new tricks, bump the version; the indexer sees stale versions and re-parses those sessions automatically on the next sync.

This turns "the schema changed again" from a migration crisis into a routine: extend the parser, bump the version, let the backfill run. No manual steps, no data loss, no "please delete your index and start over" release notes.

War story 1: the lone surrogates

One day the indexer started producing strings that crashed downstream consumers. The cause: some JSONL lines contained unpaired UTF-16 surrogates. Half an emoji, lurking in a tool-error message.

How does half an emoji end up on disk? Older Claude Code versions (up to around 2.1.132, judging by the archived sessions) truncated long tool outputs by byte length, and the cut sometimes landed mid-emoji. JSON.stringify happily writes the lone surrogate as a \udXXX escape, the file looks like clean ASCII, and JSON.parse faithfully reconstructs the broken string at read time. The corruption stays invisible until something refuses it: SQLite, an IPC bridge, a TextEncoder.

The fix is one line, if it lands in the right place:

// at the parser's exit boundary, applied to every extracted string
export function ensureWellFormed(s: string): string {
  return s.toWellFormed() // lone surrogates → U+FFFD
}

The placement is the actual lesson. Normalize once, at ingestion, and every consumer downstream (search index, renderer, Markdown export) gets to assume well-formed strings forever. Unicode normalization (NFC/NFD) deliberately stays out of this step: it would change user-visible text, which an archival tool has no business doing. Fix what is broken, touch nothing else.

(String.prototype.toWellFormed() needs Node.js 20+. Before that, the surrogate scan has to be written by hand.)

War story 2: the tokens that counted themselves twice

The tool's token dashboard once reported usage numbers roughly 2.3× higher than reality, measured across a few hundred real sessions. The cause is a JSONL quirk worth knowing even if you never touch tokens:

One API response can become several JSONL lines. When a response contains multiple content blocks (text plus tool calls, for instance), Claude Code writes one assistant entry per block, and each entry carries a copy of the same usage object. Sum them naively and every multi-block turn is counted once per block.

The entries share a requestId, which is the dedup key. But there is a trap inside the trap: it is tempting to merge the entries into one logical message. Don't: entries of different requests can interleave on disk (streaming order), and merging would scramble the conversation. The entries themselves are fine; only the usage is duplicated.

So: keep every entry, zero out the usage on all but the last entry per requestId:

function deduplicateTokensByRequestId(lines: ParsedLine[]): ParsedLine[] {
  const lastIndex = new Map<string, number>()
  lines.forEach((line, i) => {
    if (line.role === 'assistant' && line.requestId) {
      lastIndex.set(line.requestId, i)
    }
  })
  return lines.map((line, i) =>
    line.role === 'assistant' && line.requestId && lastIndex.get(line.requestId) !== i
      ? { ...line, inputTokens: null, outputTokens: null,
          cacheReadTokens: null, cacheCreationTokens: null }
      : line,
  )
}

The general lesson: one JSONL line is not one logical event. Never assume a 1:1 mapping between physical lines and semantic units in a format you do not own.

War story 3: resumed sessions replay the past

When a Claude Code session is resumed, the new JSONL file starts with copies of messages from the original session: same uuid, same content, written again. Index both files naively and every resumed conversation shows up with duplicated history.

The remedy is UUID-level dedup against what is already indexed. The trap hiding inside that fix: the dedup query must exclude the session currently being indexed. Otherwise, re-indexing an existing session matches its own previously-indexed messages, concludes that every line is a duplicate, and quietly drops the entire session. A dedup check that can self-match is a data-loss machine with good intentions.

War story 4: screenshots will eat your index

Claude Code conversations can contain images: screenshots pasted into the prompt, arriving as content blocks with base64 data inline. Store message content verbatim and a handful of screenshots will outweigh thousands of text messages in the database.

The pattern: strip the payload, keep the shape.

if (block.type === 'image' && block.source?.type === 'base64') {
  return { ...block, source: { ...block.source, data: '[base64-stripped]' } }
}

The block structure survives, so the UI can still render an "image was here" placeholder at the right position, and a has_image flag stays queryable. Only the megabytes are gone. Same archaeology principle as everywhere else: preserve the evidence of structure, not necessarily every byte of payload.

The schema will move again

In case "the schema keeps moving" sounds abstract, here is what diffing real session files before and after one CLI release (v2.1.168, June 2026) turned up: top-level attribution fields on assistant entries (which skill, plugin, or MCP server produced a reply), image content blocks, structured system subtypes carrying API error status codes, and an edited_text_file attachment type. Four schema extensions, zero announcements. A normal month.

With the patterns above, absorbing that release was: extend the whitelist, extract the new fields, bump SUMMARY_VERSION, ship. The sessions written before the parser update backfilled themselves on the next sync. Nothing was lost in the weeks where the parser did not yet understand the new shapes: the unknown parts were sitting in the archive table, waiting.

Takeaways

Skip bad lines, never throw. The file is the historical record; the parser's opinion of it is irrelevant.
Whitelist known shapes; archive unknown ones raw. Storage policy and forward compatibility in one mechanism.
Version your derived data. Re-parsing should be a routine background event, not a migration.
Normalize at the ingestion boundary, exactly once, and only what is actually broken (toWellFormed, yes; NFC, no).
Distrust the line/event mapping. Duplicated usage across entries, replayed messages across files: physical lines lie.

To see these patterns in production context, the tool is open source: ccRewind on GitHub, a read-only, offline replay and search tool for Claude Code history. It never writes a byte to ~/.claude/. Why that constraint exists, and what it cost, is the next post.

Disclosure: this post was drafted with Claude and edited by the human who debugged every story in it. The drafting sessions are, naturally, JSONL files under ~/.claude/projects/ now.