A production-ready invoice extraction pipeline. Takes raw documents in any format and returns structured, validated JSON via forced tool use, two-layer validation, retry with feedback, confidence-based routing, and batch processing with selective retry. Covers the full D4 + D5 exam content.
| Domain | Name | Weight | Coverage |
|---|---|---|---|
| D4 | Prompt Engineering | 20% | Forced tool use, few-shot examples (4 types), normalization rules, retry prompt anatomy |
| D5 | Context & Reliability | 15% | JSON Schema design, Pydantic semantic validation, retry loop, confidence routing, Batch API |
Defines the InvoiceExtractionTool — the tool definition passed to tools parameter in messages.create(). Contains both the JSON Schema (API-level syntactic enforcement) and the Pydantic models (application-level semantic enforcement). Two separate validation layers for two different failure modes.
# Design rule for nullable vs required # Make nullable when "I don't know" is a valid answer "payment_terms": {"type": ["string", "null"]} # Required because extraction is useless without it "total_amount": {"type": "number"} # in "required" array
Builds the extraction prompt with 4 few-shot examples and a normalization rules block. Few-shot outperforms textual instructions for document extraction because the model pattern-matches to demonstrated transformations under uncertainty rather than re-deriving abstract rules.
| Example | Type | What It Teaches |
|---|---|---|
| 1 — Skyline Office Supplies | Happy path | Baseline extraction, date normalization ("March 12, 2024"→"2024-03-12"), currency_detail=null for USD |
| 2 — Jake's Freelance Design | Format variation | "five hundred dollars"→500.00, "no rush"→null, quantity=1 for "a dozen variations", confidence=0.78 |
| 3 — Scholarium Academic | Edge case | Non-standard layout (bibliography style), numbered-list format, "09/30/2024"→"2024-09-30" |
| 4 — Maria's Catering (draft) | Missing data | invoice_number="UNKNOWN", date="1900-01-01" sentinel, confidence=0.42 for ambiguous document |
NORMALIZATION_RULES block embedded in every prompt: dates→YYYY-MM-DD, strip currency symbols, written-out numbers→numeric, null over fabrication<example_document>, <example_extraction>, <example_notes>Core extraction module. Uses tool_choice = {"type": "tool", "name": "extract_invoice_data"} to force Claude to always call exactly that tool. No prose output, no markdown, no choice — guaranteed structured output every call.
MockClaudeClient has 3 modes: normal (happy path), force_error (line items sum ≠ total, tests retry), force_semantic_error (confidence=0.97 for ambiguous doc, proves tool_use ≠ semantic correctness)# Problems solved by forced tool use: # 1. Claude might write a text response instead of JSON # 2. Claude might add explanatory text before/after JSON # 3. JSON might be wrapped in ```json ... ``` code blocks tool_choice = {"type": "tool", "name": "extract_invoice_data"}
Two-layer validation architecture and retry loop. Layer 1 (JSON Schema, API-level) catches syntax errors before you receive the response. Layer 2 (Pydantic, application-level) catches semantic errors after. The retry loop provides specific, quantified feedback to maximize correction success.
Routes extractions to AUTO_PROCESS, HUMAN_REVIEW, or REJECT based on confidence score, validation errors/warnings, and invoice amount. Confidence scores are self-assessments, not calibrated probabilities — requires regular auditing.
| Route | Condition | Human Involvement |
|---|---|---|
| AUTO_PROCESS | confidence ≥ 0.85 AND valid AND no warnings AND amount < $10K | 5% random audit only |
| HUMAN_REVIEW | 0.60 ≤ confidence < 0.85 OR warnings OR amount ≥ $10K | Explicit review required |
| REJECT | confidence < 0.60 OR any validation errors | Investigate and re-extract |
Batch processing with selective retry. Uses the Anthropic Message Batches API for ~50% cost reduction on large-volume, non-time-critical workloads.
| Parameter | Value |
|---|---|
| Cost reduction | ~50% vs. standard API (~$0.003→~$0.0015 per 1K input tokens) |
| Maximum processing time | 24 hours (hard SLA — not a typical time) |
| Maximum batch size | 100,000 requests per batch |
| Result order | May differ from submission order — use custom_id to correlate |
"doc-{document_type}-{index:04d}" — encodes type (targeted failure analysis) + zero-padded index (lexicographic sorting)"{original_id}-retry"Demonstrates the full pipeline on 3 sample documents from sample_docs/, covering the complete range of extraction scenarios.
Pipeline configuration: routing thresholds (0.85 for auto-process, 0.60 for reject), retry limits (max_retries=2), batch settings, model selection. Centralizes all tunable parameters so thresholds can be adjusted as ground truth calibration data accumulates.
Source: explanation ex3.md