Context
A loan servicing firm whose internal staff process loans that arrive as stacks of unstructured PDFs. Mixed templates, scanned pages, multi-document packets. Everything downstream waits on someone reading the docs.
Problem
The bottleneck was document review. Every reporting and customer workflow downstream of it inherited the latency. The data that mattered was inside the PDFs and effectively invisible to the rest of the company.
What we tried first
OCR plus a single extraction prompt over the raw text. It worked on clean docs. It fell apart on the messy ones, which were most of them. Multi-page packets and mixed templates broke the positional assumptions the prompt was quietly making. The prototype was good enough to know we needed structured extraction with validation, not a one-shot prompt.
What we shipped
Phase 1: a document understanding pipeline with structured extraction, schema validation, and confidence scoring. Failures route to staff with the exact field that needs review, not the whole packet.
Phase 2: a natural-language retrieval assistant grounded in the indexed loan corpus, with LangSmith traces on every retrieval. Staff query loans in plain English; the assistant cites the source page.
Phase 3: process automation on top of the now-structured data. Recurring reports run themselves. Email workflows that used to live in someone's head are agent-driven, with a human approval step on anything customer-facing.
Outcome
Three phases delivered. Loan review time on packets that previously needed a full read moved from days to hours. The retrieval assistant runs in production for internal staff. Recurring reports and email automations replaced work that used to live in calendars and inboxes.
What we got wrong
We pushed the eval scaffolding to phase 2. It should have been part of phase 1. Once goldens existed for extraction, debugging a regression went from "find the bad PDF" to "look at the failing eval." Build the test harness with the system, not after.