AI Integration & Development

Stop Hand-Rolling PDF Parsers: Build Intelligent Document Pipelines with LangExtract

LangExtract brings schema enforcement, source grounding, and multi-pass extraction to any LLM. Here's how to build a real PDF intelligence pipeline.

PDF extraction is one of those problems that looks simple until you're three weeks deep in a pile of brittle regex, broken table parsers, and LLM responses that refuse to stay in the schema you asked for.

If you've been there, LangExtract is worth your attention. It's an open-source Python library from Google that sits on top of the LLM of your choice and handles the unglamorous middle layer: structured output enforcement, few-shot prompting, smart chunking, multi-pass extraction, and — crucially — grounding every extracted value back to its exact character offset in the source text.

Designing Solutions Architecture for Enterprise Integration: A Comprehensive Guide

Designing Solutions Architecture for Enterprise Integration: A Comprehensive Guide

Stop fighting data silos. Build enterprise integration that scales. Real-world patterns, checklists, and case studies. For IT professionals.

Learn More

This article walks through how it works, how to pair it with a PDF parser for end-to-end document intelligence, and how to design a pipeline that scales past toy examples.

Stop Hand-Rolling PDF Parsers


Why PDF Data Extraction Is Still a Mess

Structured data locked inside PDFs is a solved problem in theory. In practice, it's a nightmare.

The layout varies wildly. An invoice from one vendor might have line items in a clean table. Another dumps them in a paragraph. A third uses a custom font that breaks half your text extraction tools. Add scanned documents, multi-column layouts, embedded headers and footers, and you've got a problem space that rewards flexibility over rigid rules.

Regex-based and template-based approaches work when you control the source format. They collapse the moment variability enters the picture. And when they fail, they fail silently — you get a wrong value with no indication that anything went wrong.

LLM prompts help with the variability, but raw "return JSON only" prompts are fragile too. The model drifts from your schema. It paraphrases when you needed an exact span. You can't trace where it got the value. Debugging becomes archaeology.

LangExtract attacks these problems at the orchestration layer.


What LangExtract Actually Does

LangExtract positions itself as the "extraction orchestrator" rather than the model. You still choose the LLM — OpenAI, Gemini, Anthropic, or a local model via Ollama. LangExtract handles everything else:

  • Schema enforcement via few-shot examples, not fragile system prompts
  • Structured output with Extraction objects that carry the extracted class, text span, and custom attributes
  • Source grounding — every extraction maps back to character offsets in the original text
  • Smart chunking and multi-pass extraction for long documents
  • HTML visualization so you can see exactly what was extracted and where

The key design decision is requiring examples, not just a prompt. You define what good extractions look like by showing the model actual text and actual expected outputs. This is more work upfront than writing a vague "extract invoice fields" prompt, but it produces dramatically more consistent results at scale.


Basic LangExtract Workflow

Here's a minimal working example for invoice extraction:

import langextract as lx
import os

prompt = """
Extract invoice metadata and line items.
Use exact spans from the source text. Do not summarize or paraphrase.
"""

examples = [
    lx.data.ExampleData(
        text="Invoice #INV-12345 from ACME Corp dated 2025-09-01. Total: $4,200.00",
        extractions=[
            lx.data.Extraction(
                extraction_class="invoice_header",
                extraction_text="Invoice #INV-12345 from ACME Corp dated 2025-09-01",
                attributes={
                    "invoice_number": "INV-12345",
                    "supplier_name": "ACME Corp",
                    "invoice_date": "2025-09-01",
                    "total": "4200.00"
                },
            )
        ],
    )
]

result_gen = lx.extract(
    text_or_documents=invoice_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",
    api_key=os.environ["OPENAI_API_KEY"],
    extraction_passes=2,
    fence_output=True,
)

results = list(result_gen)

for doc_result in results:
    for extraction in doc_result.extractions:
        print(extraction.extraction_class)
        print(extraction.attributes)
        print(f"Source span: {extraction.start_offset}–{extraction.end_offset}")

A few things worth noting:

fence_output=True adds structure that helps prevent the model from bleeding outside its schema. extraction_passes=2 makes a second pass over the document, which improves recall on complex documents where the first pass misses edge cases. And the start_offset/end_offset fields on each extraction are what make traceability possible — you always know exactly what text produced the result.


Adding PDFs: A Two-Stage Architecture

LangExtract expects text input. PDFs require a separate parsing step before you can use it, and how you handle that step determines whether you get real traceability or just extracted strings floating in a void.

The architecture that works well in practice looks like this:

Stage 1: PDF → Rich Text with Metadata

Use a PDF parser that gives you more than raw text. You want structural metadata: page numbers, block IDs, bounding boxes, reading order, and table boundaries. Docling is a strong open-source option here. PyMuPDF and pdfplumber are solid alternatives depending on your needs.

The output of this stage isn't just a string — it's a structured representation you can map back to physical positions on the page.

Stage 2: Rich Text → LangExtract Input

Feed the parsed text (or chunks of it) into lx.extract(). As you build each chunk, maintain a mapping structure that records the document ID, page number, and character offsets so you can trace any LangExtract result back to its source.

Here's what that mapping might look like:

import docling
import langextract as lx

def build_chunks_from_pdf(pdf_path):
    """Parse a PDF with Docling and build text chunks with offset tracking."""
    doc = docling.parse(pdf_path)
    chunks = []

    for page in doc.pages:
        for block in page.blocks:
            chunk = {
                "doc_id": pdf_path,
                "page": page.number,
                "block_id": block.id,
                "text": block.text,
                "start_char": block.start_char,
                "end_char": block.end_char,
                "bbox": block.bbox
            }
            chunks.append(chunk)

    return chunks

def remap_extraction_to_pdf(extraction, chunk):
    """Map a LangExtract character offset back to a PDF page and bounding box."""
    return {
        "extraction_class": extraction.extraction_class,
        "attributes": extraction.attributes,
        "page": chunk["page"],
        "block_id": chunk["block_id"],
        "bbox": chunk["bbox"],
        "source_span": extraction.extraction_text
    }

This gives you something genuinely useful in production: an extracted field that you can point to on a specific page, in a specific region, with the exact text that produced it. That's the difference between an extraction pipeline and an auditable document intelligence system.


Scaling Up: Chunking, Long Documents, and Multi-Pass Extraction

Single-page documents are easy. The interesting engineering starts when you have 50-page contracts, multi-section reports, or hundreds of PDFs in a processing queue.

Chunking strategy depends on your document type. For invoices, one invoice per chunk works well — each invoice is a logical unit. For contracts, chunking by section heading preserves semantic boundaries. For reports, chunking by page is a reasonable default when you don't have structural metadata to work with.

Don't chunk blindly by token count without considering logical boundaries. If your chunk splits a table in half, you'll get inconsistent extractions on both sides.

Multi-pass extraction (extraction_passes > 1) is worth enabling for complex documents. The first pass gets the obvious extractions. Subsequent passes catch things the model missed — edge-case formats, fields buried in footnotes, or data that requires context from earlier in the document. The cost overhead is real, so tune based on your document complexity and budget.

Cross-model strategies make sense at scale. For high-volume, low-complexity documents — utility bills, standard receipts — a cheaper, faster model handles the bulk of the work. Reserve GPT-4o or Claude Sonnet for documents flagged as ambiguous by validation rules or low-confidence extractions.

LangExtract supports multiple providers through a uniform interface, so this kind of routing is straightforward to implement at the pipeline level without rewriting your extraction logic.


Post-Processing: Validation, De-duplication, and Storage

Extraction is only the beginning. Raw LangExtract output needs validation before it goes anywhere near a database or downstream system.

Domain rules catch what the LLM won't. Date fields should parse as valid dates. Totals should be numeric. Cross-field consistency checks — line item subtotals summing to the invoice total, for example — catch model hallucinations that stay inside the schema.

De-duplication becomes relevant when you run multiple passes or process overlapping chunks. The same entity might be extracted twice. Build a merging step that deduplicates based on span overlap and attribute similarity.

Second-pass enrichment is an optional pattern worth knowing about. After your primary extraction, you can run a second LLM call that operates on the extracted fields as context (not the raw document) to perform classification, linking, or enrichment. Classifying document type, linking line items to their parent invoice header, or normalizing supplier names to a canonical registry all fit this pattern cleanly.

For storage, the most flexible approach is keeping the raw JSONL output per document alongside structured rows in a relational database:

CREATE TABLE extracted_fields (
    id SERIAL PRIMARY KEY,
    document_id VARCHAR(255) NOT NULL,
    extraction_class VARCHAR(100) NOT NULL,
    field_name VARCHAR(100) NOT NULL,
    field_value TEXT,
    page_number INTEGER,
    source_span TEXT,
    confidence FLOAT,
    extracted_at TIMESTAMP DEFAULT NOW()
);

Keep the source_span and page_number columns. They're what make audits possible six months from now when someone questions where a number came from.


Design Patterns Worth Knowing

Schema-first extraction is the most productive way to approach a new document type. Start from your target schema — what fields do you actually need downstream — then design your prompt and few-shot examples to mirror that schema. This eliminates the mismatch between what the model extracts and what your database expects, and it forces you to be explicit about every field before you write a line of extraction code.

Separation of concerns keeps the pipeline maintainable. Your PDF parser doesn't need to know about LangExtract. Your LangExtract runner doesn't need to know about your database schema. Your validation layer doesn't need to know about the model. Build these as independent stages with clean interfaces and you'll spend a lot less time debugging when something breaks in production.

Human-in-the-loop review closes the feedback loop. LangExtract's HTML visualization shows extracted entities highlighted in context — use it to build a review interface where humans can correct extractions on documents that failed validation. Feed those corrections back into your few-shot examples over time. The system gets better with use rather than drifting toward entropy.


Getting Started

LangExtract is open-source and available on GitHub at github.com/google/langextract. The documentation covers provider setup for OpenAI, Gemini, Anthropic, and Ollama.

The honest starting point is picking one document type you actually need to process, building two or three high-quality few-shot examples, and running it against a real batch before worrying about scale. The patterns above — chunking, multi-pass extraction, validation, remapping to PDF coordinates — add real value, but they're not required on day one.

What LangExtract gives you from the start is something that most hand-rolled extraction pipelines never achieve: every extracted value traced back to exactly where it came from in the source. That alone makes debugging and auditing a fundamentally different experience than staring at a JSON blob with no idea how the model produced it.


Shane is the founder of Grizzly Peak Software — a technical resource hub for software engineers, written from a cabin in Caswell Lakes, Alaska.

Powered by Contentful