Originally published on andrew.ooo — visit the original for any updates, code snippets that aged out, or follow-up posts.
TL;DR
LiteParse v2 is LlamaIndex's June 2026 rewrite of their open-source document parser — the same spatial-layout extraction core that powers their LlamaParse cloud product, but now written entirely in Rust and shipped as native packages for Python, Node.js/TypeScript, and the browser via WASM. It's positioned as the answer to a real, specific problem: agents need to read PDFs fast during a reasoning loop, and existing tools either choke on layout (pypdf, pdfplumber) or block on a VLM call (Docling, LlamaParse cloud).
Key facts:
- Rust core built on top of the PDFium C library for native text extraction
-
Multi-language bindings: Rust crate, Python (
pip install liteparse), Node.js (@llamaindex/liteparse), browser WASM -
One CLI (
lit) that ships with every package — same flags whether you installed via cargo, npm, or pip - Formats: PDF natively; DOCX/XLSX/PPTX via LibreOffice; PNG/JPG/TIFF via ImageMagick
- Selective OCR: bundled Tesseract.js, or plug in PaddleOCR/EasyOCR HTTP servers for higher accuracy
- Spatial output: text + bounding boxes, layout-preserved plain text, or rendered PNG page screenshots for multimodal LLM follow-up
- 8,557 GitHub stars, ~3,006 stars this week — currently trending on the GitHub Rust board
- Apache-2.0 license, zero cloud calls, zero Python dependencies at the core
- The README claims up to 100x faster than the v1 Python implementation
If you've been writing one-off pypdf + regex hacks every time an agent needs to skim a PDF, LiteParse is the first OSS parser that's clearly designed for agents first and humans second.
Quick Reference
# Install (pick one — they all give you the same `lit` CLI)
npm i -g @llamaindex/liteparse
pip install liteparse
cargo install liteparse
# Parse a PDF, write layout-preserved text
lit parse report.pdf -o report.txt
# Structured JSON with bounding boxes
lit parse report.pdf --format json -o report.json
# Just pages 1–5 and page 10
lit parse report.pdf --target-pages "1-5,10"
# Generate page screenshots for visual reasoning
lit screenshot report.pdf -o ./screens --pages "1-3"
Why LiteParse Exists
The pitch from LlamaIndex's launch post is unusually specific. They've spent years building LlamaParse into a production document-intelligence cloud service, and along the way noticed that most of the time, agents don't need the heavyweight VLM pipeline. They need text — quickly — to decide their next move.
The current landscape forces a bad choice:
- Fast but inaccurate — pypdf, pdfminer, Markitdown will hand you a string of text in milliseconds but mangle tables, lose column boundaries, and silently skip scanned pages with no OCR fallback.
- Accurate but slow — Docling, MarkItDown-pro, LlamaParse cloud all run a vision model. Quality is great, but a 50-page PDF can take 30–120 seconds, which is forever inside an agent loop.
- Local but ugly — Tesseract alone, no spatial reconstruction, no screenshot fallback, no agent-friendly API.
LiteParse picks a deliberate middle: native text extraction from PDFium with grid-based spatial projection (so columns and tables survive), selective OCR for scanned pages, and a screenshot command for the moments when the agent decides it wants to look at the page itself. The whole thing is a CLI subprocess away from any agent framework, no API key, no network.
"LiteParse is for coding agents and real-time pipelines where speed, simplicity, and local execution matter. It's the core processing from LlamaParse, open-sourced." — Logan Markewich, LlamaIndex
Installation
The smartest design choice is that the CLI is identical across runtimes. Pick whatever your project already has:
Node.js / TypeScript:
npm i -g @llamaindex/liteparse
# or as a library
npm i @llamaindex/liteparse
Python:
pip install liteparse
Rust:
# CLI
cargo install liteparse
# As a library
cargo add liteparse
Browser (WASM):
npm i @llamaindex/liteparse-wasm
The first run will download the PDFium native library (≈10 MB) and the Tesseract.js OCR data files (~22 MB for English; more per added language). After that everything is offline.
A nice detail: the Python package ships as a thin wrapper over the same Rust binary, so Python, Node, Rust, and the CLI all hit the same code path. Same trick ruff and uv use.
Real Code Examples
Node.js / TypeScript
import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("./contract.pdf");
console.log(result.text); // layout-preserved plain text
console.log(result.pages[0].bbox); // bounding boxes per page
// Render specific pages to PNG for VLM follow-up
const screenshots = await parser.screenshot("./contract.pdf", {
pages: [1, 2, 3],
outputDir: "./screens",
});
Python
from liteparse import LiteParse
parser = LiteParse(ocr_enabled=True)
result = parser.parse("contract.pdf")
# Plain text, layout preserved
print(result.text)
# Per-page structured data with bounding boxes
for page in result.pages:
for block in page.blocks:
print(block.text, block.bbox)
Inside an Agent Loop
This is the pattern LlamaIndex actually built it for — text-first, screenshot-fallback. Pseudocode for any agent framework (LangGraph, LlamaIndex agents, OpenClaw, raw OpenAI tool-calling):
def read_document(path: str, question: str) -> str:
# Fast pass: try text only
text = LiteParse().parse(path).text
if model_can_answer_from(text, question):
return text
# Slow path: render screenshots, send to a VLM
screenshots = LiteParse().screenshot(path, pages="all")
return vlm_describe(screenshots, question=question)
The text pass on a 50-page PDF takes ~1.5 seconds on a 2023 MacBook Pro. That's a number you can put inside an agent's reasoning step without thinking about it.
Layout Preservation in Practice
The key idea is "preserve layout rather than detect structure." Most parsers try to recognize "this is a table" and convert it to markdown — which adds failure modes. LiteParse projects text onto a spatial grid and emits something like:
Name Age City
John 25 NYC
Jane 30 LA
Modern LLMs already read ASCII tables, code indentation, and READMEs natively. Skipping the table-detection step makes the parser faster, simpler, and — counterintuitively — often more accurate for downstream LLM reading.
Architecture, Briefly
The pipeline is short: PDFs pass through directly; DOCX/XLSX/PPTX convert via LibreOffice; images via ImageMagick — everything becomes a PDF internally. PDFium pulls native text and positions in one pass; pages without an extractable text layer get routed to Tesseract (or your configured external OCR server); results merge with positions preserved and project onto a 2D grid that reconstructs columns and tables as ASCII. Output is JSON with bounding boxes, plain text, or PNG screenshots. No model weights, no GPU, no Python interpreter for the Python package.
Community Reactions
The reception has been notably warm for an LLM-adjacent tool launch:
- The Show HN post from March 2026 for LiteParse v1 stayed on the front page most of the day, with commenters comparing it favorably to pypdf and Markitdown for agent use cases. The recurring sentiment: "finally a PDF parser that doesn't need to be a microservice."
- LocalLLaMA Reddit picked up the v2 Rust rewrite within hours of release — the comment thread immediately compared latency numbers to Markitdown and pypdf4llm, with one user reporting 80–100x speedups on a 200-page legal contract. Caveat: speedups are largest on text-heavy PDFs; scanned-image PDFs are bound by Tesseract, which is the same library every parser uses.
- The r/LangChain crowd flagged the OpenAI-skill packaging as the smartest part —
npx skills add run-llama/llamaparse-agent-skills --skill liteparsedrops a SKILL.md straight into Claude Code, Codex, or any other agent harness that follows the agentskills.io spec. No glue code. - A few sharper critiques on HN around the LibreOffice dependency for DOCX/XLSX/PPTX — it's heavy (~400 MB install) and a known source of crashes in containers. LiteParse's authors acknowledge this and have an open issue exploring a pure-Rust DOCX path.
Benchmarks
The team published their own benchmark dataset on HuggingFace along with the eval pipeline in-repo. The methodology is honest about its limits: they generate Q&A pairs from page screenshots, manually audit the dataset, then evaluate parsers with an LLM judge.
Two takeaways from their numbers:
- Against non-VLM parsers (pypdf, PyMuPDF, Markitdown), LiteParse wins on QA accuracy across most document types and is the latency leader on large documents.
- They explicitly don't claim to beat VLM-based parsers like LlamaParse cloud, Docling, or Mistral OCR on hard layouts (dense tables, multi-column scientific papers, charts). The README routes you to LlamaParse for those — fair and refreshingly upfront.
If you want to verify the claims on your own corpus, the eval scripts will run on your local PDFs in 10–15 minutes.
Honest Limitations
LiteParse is a focused tool, and the things it doesn't do are mostly intentional. Plan around these:
- Complex tables get flattened, not structured. A multi-row-header pivot table from an SEC 10-K will become ASCII with column alignment, not a JSON schema. For structured table extraction, you want LlamaParse cloud, Docling, or a dedicated table model like Unstructured.io's hi_res.
- Handwritten or low-quality scans are Tesseract-limited. Pointing it at a PaddleOCR or EasyOCR HTTP server helps a lot, but you're now running a Python service alongside, which partially defeats the "single binary" pitch.
-
LibreOffice is the DOCX/XLSX/PPTX backend. It works, but it's a ~400 MB system dependency. In Docker, you'll want
apt install libreoffice-coreand accept the image bloat, or pre-render Office docs to PDF elsewhere. - No semantic block detection. It won't tell you "this is a heading", "this is a caption", "this is a footnote". You get spatial text. If you need structural roles, that's a downstream step (a small LLM call on the parsed output usually works).
-
No streaming output. The CLI buffers the entire result before printing. For very large documents (1,000+ pages) you'll want to chunk via
--target-pages. - Bounding-box coordinates are in PDF user-space units, not pixels. Fine once you understand it, but the docs could be clearer for first-time users plotting boxes on rendered screenshots.
How It Compares
A quick honest map of where LiteParse fits versus the tools you might already be using:
-
vs
pypdf/pdfminer.six: LiteParse wins on layout, OCR fallback, and agent ergonomics.pypdfwins on zero dependencies and pure-Python install (if that matters for your stack). -
vs Microsoft's
markitdown(covered in our skim-list): MarkItDown is broader (covers audio transcripts, HTML, Outlook) but slower on PDFs and Markdown-output-oriented. Use both: MarkItDown for non-PDF formats, LiteParse for PDFs in the hot path. - vs LlamaParse cloud: LiteParse is for agents in a reasoning loop; LlamaParse is for "I need to nail every table on every page once at ingest time." Use LiteParse for runtime reads, LlamaParse for nightly batch ingestion of your knowledge corpus.
- vs Docling: Docling is more accurate on hard layouts because it uses a vision model, but it requires GPU compute or a long CPU run. LiteParse is the right default; reach for Docling when LiteParse's output isn't enough.
- vs Mistral OCR API: Cloud-only, paid, but excellent on hard scans. Use LiteParse first; fall back to Mistral OCR only on pages where Tesseract failed.
For a longer treatment of how parsing fits into an AI-agent retrieval pipeline, see our review of PageIndex's vectorless RAG approach and our walkthrough of CocoIndex's incremental indexer — both pair naturally with LiteParse as the document-extraction stage.
FAQ
Is LiteParse a replacement for LlamaParse cloud?
No, and LlamaIndex is explicit about that. LiteParse handles the fast-path text-extraction case for agents. LlamaParse cloud is the heavyweight VLM-powered pipeline for production document-intelligence work where you need perfect tables, structured JSON outputs, and premium OCR. They're complementary: use LiteParse in your agent loop, LlamaParse at batch-ingest time for your most valuable documents.
Do I need a GPU to run LiteParse?
No. LiteParse is pure CPU. PDFium and Tesseract both run on CPU and parallelize across cores automatically. A 50-page text-only PDF takes ~1–2 seconds on a modern laptop. Scanned PDFs are slower because Tesseract is the bottleneck — a 50-page scanned PDF takes 30–60 seconds depending on core count. There's no cuda flag because there's no model to accelerate.
Can I use it in a serverless function?
Yes for PDF-only workflows — the Node.js package bundles PDFium and Tesseract as native binaries that work on AWS Lambda's Amazon Linux 2 environment. Cold start is around 800 ms because of the binary unpack. For DOCX/XLSX/PPTX, you'll need LibreOffice in the runtime, which usually means a container-image-based Lambda. The browser WASM build is also viable for client-side parsing — handy when documents shouldn't leave the user's machine.
How does it handle scanned PDFs?
Automatically. If a page has no extractable text layer, LiteParse routes it to Tesseract.js. You can opt in to a higher-accuracy external OCR by running PaddleOCR or EasyOCR as an HTTP server and passing --ocr-server http://localhost:8000/ocr. The repo includes reference servers for both. Any OCR engine that returns text + bounding boxes works.
Will it work with LangChain / LlamaIndex / OpenClaw / Claude Code?
Yes, in a few ways. As a library, import it directly in Python or TypeScript. As a CLI, shell out from any agent that can run subprocesses. As a skill file, drop it into any AgentSkills-compatible harness (Claude Code, Codex, OpenCode, Cursor) with one command. For LlamaIndex specifically, there's a first-party LiteParseReader that mirrors the existing LlamaParseReader interface.
What's the license and can I use it commercially?
Apache 2.0 — as permissive as it gets. Commercial use, redistribution, and modification are all explicitly allowed. The PDFium dependency is BSD-licensed; Tesseract is Apache-2.0; LibreOffice is LGPLv3 (but only used as an external subprocess for Office-format conversion, which keeps your application licensing clean). No usage caps, no telemetry, no API key, no rate limits.
The Bottom Line
LiteParse v2 is the first OSS PDF parser I'd actually drop into an agent without a wrapper. The combination of one CLI across four runtimes, a single Rust binary, no Python interpreter for Python users, sub-2-second runtimes on real-world PDFs, and the agent-first design (text-first, screenshot-fallback) hits a sweet spot the existing tools have all missed in one direction or another.
If you're building anything where an LLM needs to skim a PDF inside a reasoning loop — RAG ingestion, contract review agents, research assistants, invoice readers — install it today and replace your pypdf calls. For batch ingestion of your most valuable corpus, keep LlamaParse cloud (or Docling) for the heavy lifting. The two are designed to coexist, and the LlamaIndex team built LiteParse exactly so you don't have to round-trip to a cloud service every time an agent gets curious about a document.
Related reads on andrew.ooo: PageIndex Review: Vectorless RAG That Actually Works, CocoIndex: Incremental RAG Engine for AI Agents, RAG-Anything by HKU.