Turning inbound PDFs into structured leads with an AI intake agent
A professional services firm was manually reading and re-keying 200+ inbound PDFs per week into their CRM. We deployed a bounded AI extraction agent that handles 94% of them without human touch.
- PDFs/wk
- 94% auto
- Time saved
- 18 hrs/wk
- Accuracy
- 99.1%
- Response time
- 2 hrs → 20 min
The situation
The client was a regional professional services firm that received most of its inbound work as PDF packages — a mix of submission forms, supporting documents, and structured cover sheets, usually 10 to 40 pages each. About 200 of these arrived per week.
Every package had to be read, summarized, and re-entered as structured fields into the CRM before the deal could enter the pipeline. One person did almost all of it. She was 4 to 5 days behind on any given morning, and any package that slipped through the cracks meant the firm missed a response window and lost the opportunity.
What we found in discovery
Two weeks of shadowing and stakeholder interviews surfaced the real picture:
- The packages were 80% structured. Most fields appeared in consistent places across submissions. Only a long tail of edge cases required human judgment.
- Response time was the revenue metric. Deals that received a response in under an hour closed at 2.3× the rate of deals that took more than a day. The current 4–5 day lag was leaving meaningful money on the table.
- The team was nervous about AI. Previous experiments with generic AI tools had produced embarrassing hallucinations. The bar to earn trust was high.
The approach
We scoped a bounded extraction agent with three guardrails in place from day one:
- Confidence gating: every field extracted came with a confidence score. Anything below threshold routed to a human review queue.
- Evaluation set: before we shipped, we built a labeled set of 120 real packages with ground-truth extractions. Every iteration was measured against that set.
- Shadow mode for two weeks: the agent ran in parallel to the human workflow and its outputs were compared, not used. This earned the team's trust in the accuracy numbers before anything touched the CRM.
Build sequence:
- Week 1: Labeling the eval set and defining the extraction schema.
- Weeks 2–3: Building the extraction pipeline — document parsing, field-level prompts with structured output, confidence scoring.
- Week 4: Shadow mode evaluation. We hit 99.1% field-level accuracy on the eval set and 94% "fully auto-approvable" on full packages.
- Week 5: Go-live with a review queue, dashboards, and weekly calibration.
The outcomes
Measured eight weeks after go-live:
- 94% of inbound PDFs are fully auto-processed — extracted, structured, and inserted into the CRM without human touch.
- 6% route to the review queue where a human approves or corrects in ~90 seconds per package.
- Time to CRM: from 2 hours (best case, single coordinator) to 20 minutes (worst case, during a review queue spike).
- Field-level accuracy: 99.1% measured against a rolling eval set that's refreshed monthly.
- Labor savings: ~18 hours per week redeployed to client response and follow-up — the high-leverage work the team wanted to do.
- Response time on qualified deals: down from days to under an hour on most inbound.
What didn't happen
- The coordinator didn't get replaced. Her role moved from data entry to owning the review queue, the evaluation set, and the calibration process. She's now the firm's internal AI-ops lead.
- No hallucinations in production. Confidence gating plus structured output plus the review queue combined to make "wrong answer in the CRM" effectively impossible. In eight weeks post-launch, zero such incidents.
- No runaway spend. Token costs settled at about $60/week — less than a rounding error against the labor redeployed.
Why it worked
Three disciplines.
First, we picked a bounded workflow — extraction from structured documents — not an ambitious end-to-end agent. The scope is boring. The reliability is not.
Second, we built the evaluation set before the agent. Every decision about model, prompt, and threshold was anchored to measured accuracy on real data. Not demos.
Third, we kept a human in the loop by design. The review queue isn't a fallback for failure — it's the architecture. That's what let us confidently route 94% of work to full automation.
This is what "AI at work" looks like when it's built to last: narrow scope, strong evaluation, measured impact, and a team that owns the system.