⚙️ Exercise 3 of 4

Structured Data Extraction Pipeline

A production-ready invoice extraction pipeline. Takes raw documents in any format and returns structured, validated JSON via forced tool use, two-layer validation, retry with feedback, confidence-based routing, and batch processing with selective retry. Covers the full D4 + D5 exam content.

D4 Prompt Engineering — 20% D5 Context & Reliability — 15%

Exam Domains Covered

Domain	Name	Weight	Coverage
D4	Prompt Engineering	20%	Forced tool use, few-shot examples (4 types), normalization rules, retry prompt anatomy
D5	Context & Reliability	15%	JSON Schema design, Pydantic semantic validation, retry loop, confidence routing, Batch API

Project Files — Code Walkthrough

📐 schemas.py D4 · D5 · JSON Schema Design

Defines the InvoiceExtractionTool — the tool definition passed to tools parameter in messages.create(). Contains both the JSON Schema (API-level syntactic enforcement) and the Pydantic models (application-level semantic enforcement). Two separate validation layers for two different failure modes.

Required fields (6): vendor_name, invoice_number, invoice_date, total_amount, currency, confidence_score — without any of these the extraction is useless
Nullable fields: currency_detail, line_items, payment_terms — "I don't know" is a legitimate real-world answer; null > hallucinated value
"other" enum + currency_detail pattern: prevents silent data loss for unlisted currencies (JPY, CAD, etc.); free-form strings would create normalization chaos downstream
Pydantic catches: date not in ISO 8601 format, confidence_score outside [0,1], currency=="other" requires currency_detail, line_item arithmetic (quantity × unit_price = line_total ± 0.01), sum(line_items) ≈ total_amount ±1%

Exam Trap: "tool_use with a JSON schema guarantees the extracted values are correct." Wrong. Tool_use guarantees syntactic validity (fields present, correct types, valid enum). It does not guarantee semantic correctness (total_amount could be 0.0, confidence could be 0.97 on an ambiguous document). Semantic validation = Pydantic layer.

# Design rule for nullable vs required
# Make nullable when "I don't know" is a valid answer
"payment_terms": {"type": ["string", "null"]}

# Required because extraction is useless without it
"total_amount": {"type": "number"}  # in "required" array

📚 few_shot_examples.py D4 · Few-Shot Prompting

Builds the extraction prompt with 4 few-shot examples and a normalization rules block. Few-shot outperforms textual instructions for document extraction because the model pattern-matches to demonstrated transformations under uncertainty rather than re-deriving abstract rules.

Example	Type	What It Teaches
1 — Skyline Office Supplies	Happy path	Baseline extraction, date normalization ("March 12, 2024"→"2024-03-12"), currency_detail=null for USD
2 — Jake's Freelance Design	Format variation	"five hundred dollars"→500.00, "no rush"→null, quantity=1 for "a dozen variations", confidence=0.78
3 — Scholarium Academic	Edge case	Non-standard layout (bibliography style), numbered-list format, "09/30/2024"→"2024-09-30"
4 — Maria's Catering (draft)	Missing data	invoice_number="UNKNOWN", date="1900-01-01" sentinel, confidence=0.42 for ambiguous document

NORMALIZATION_RULES block embedded in every prompt: dates→YYYY-MM-DD, strip currency symbols, written-out numbers→numeric, null over fabrication
Prompt order: task description → normalization rules → examples → target document → closing instruction
XML tags delimit examples: <example_document>, <example_extraction>, <example_notes>

Exam Trap: "Adding more examples always improves accuracy." Not necessarily. 4 carefully chosen examples (happy path, informal, unusual format, missing data) outperform 20 slight variations of the same standard invoice. More examples = more context window tokens + diminishing returns.

⚗️ extractor.py D4 · Forced Tool Use

Core extraction module. Uses tool_choice = {"type": "tool", "name": "extract_invoice_data"} to force Claude to always call exactly that tool. No prose output, no markdown, no choice — guaranteed structured output every call.

tool_choice: {"type": "tool"} — Claude MUST call this exact tool; stop_reason is always "tool_use"; block.input is already a Python dict
tool_choice: {"type": "auto"} — Claude may write text instead of calling a tool
tool_choice: {"type": "any"} — Claude must call one of the tools but can choose which
MockClaudeClient has 3 modes: normal (happy path), force_error (line items sum ≠ total, tests retry), force_semantic_error (confidence=0.97 for ambiguous doc, proves tool_use ≠ semantic correctness)

# Problems solved by forced tool use:
# 1. Claude might write a text response instead of JSON
# 2. Claude might add explanatory text before/after JSON
# 3. JSON might be wrapped in ```json ... ``` code blocks

tool_choice = {"type": "tool", "name": "extract_invoice_data"}

✅ validator.py D5 · Two-Layer Validation + Retry Loop

Two-layer validation architecture and retry loop. Layer 1 (JSON Schema, API-level) catches syntax errors before you receive the response. Layer 2 (Pydantic, application-level) catches semantic errors after. The retry loop provides specific, quantified feedback to maximize correction success.

Document ↓ [extract_document()] ↓ [validate_extraction()] ├── VALID → return (attempts=1, final_valid=True) └── INVALID → [retry_with_feedback(doc + bad_extraction + SPECIFIC errors)] ↓ [validate_extraction()] ├── VALID → return (attempts=2, final_valid=True) └── INVALID → return best result (final_valid=False)

Good retry feedback: "sum(line_items)=$145, total_amount=$200, difference=$55. Look for additional line items, fees, or taxes."
Bad retry feedback: "Your extraction failed. Please try again." — Claude has nothing to work with
Retry helps: arithmetic errors, date format errors — info exists in document but was missed
Retry doesn't help: missing invoice number, ambiguous amounts — information doesn't exist; retrying risks hallucination
max_retries=2 — beyond 2 attempts, rare improvement; adds cost and latency

Exam Trap: "If validation fails, always retry until the model gets it right." Wrong for 3 reasons: (1) missing data can't be fixed by retrying; (2) beyond 2 attempts unlikely to succeed; (3) retrying missing data can produce hallucinated values that pass validation but are factually wrong.

🔀 confidence_router.py D5 · Confidence-Based Routing

Routes extractions to AUTO_PROCESS, HUMAN_REVIEW, or REJECT based on confidence score, validation errors/warnings, and invoice amount. Confidence scores are self-assessments, not calibrated probabilities — requires regular auditing.

Route	Condition	Human Involvement
AUTO_PROCESS	confidence ≥ 0.85 AND valid AND no warnings AND amount < $10K	5% random audit only
HUMAN_REVIEW	0.60 ≤ confidence < 0.85 OR warnings OR amount ≥ $10K	Explicit review required
REJECT	confidence < 0.60 OR any validation errors	Investigate and re-extract

5% random audit of AUTO_PROCESS items catches systematic miscalibration before it compounds
97% overall accuracy can hide 40% error rate on a specific document subtype (3% of volume)
Stratified sampling: audit across document types, not just from the full pool proportionally
NOT valid routing signals: sentiment/tone of document, response length, secondary classifiers

📦 batch_processor.py D5 · Message Batches API

Batch processing with selective retry. Uses the Anthropic Message Batches API for ~50% cost reduction on large-volume, non-time-critical workloads.

Parameter	Value
Cost reduction	~50% vs. standard API (~$0.003→~$0.0015 per 1K input tokens)
Maximum processing time	24 hours (hard SLA — not a typical time)
Maximum batch size	100,000 requests per batch
Result order	May differ from submission order — use custom_id to correlate

custom_id pattern: "doc-{document_type}-{index:04d}" — encodes type (targeted failure analysis) + zero-padded index (lexicographic sorting)
Selective retry: 1,000 docs, 5% failure = 50 failures → re-submit only 50; saves 950 API calls (95% retry cost reduction)
context_length_exceeded failure: apply 50% truncation (headers at top), resubmit with "{original_id}-retry"
SLA calculation: deadline 30h from now → submit within 6h. Deadline 18h → batch API cannot guarantee delivery; use synchronous API

Exam Trap: "Use Batch API for real-time document processing during user sessions." Wrong. Batch API has up to 24h processing time. Real-time sessions need seconds/minutes — use the synchronous messages.create() API.

▶️ run_demo.py Demo — 3 Sample Documents

Demonstrates the full pipeline on 3 sample documents from sample_docs/, covering the complete range of extraction scenarios.

invoice_standard.txt (Northstar Consulting) — clean professional invoice, happy path, confidence ~0.95
invoice_informal.txt (Dan Kowalski) — handwritten-style, written-out numbers, no formal structure, confidence ~0.78
invoice_ambiguous.txt (Waverly Creative Studio) — draft with missing fields, approximate amounts, "DO NOT PROCESS" warning, routes to REJECT

⚙️ config.py Configuration

Pipeline configuration: routing thresholds (0.85 for auto-process, 0.60 for reject), retry limits (max_retries=2), batch settings, model selection. Centralizes all tunable parameters so thresholds can be adjusted as ground truth calibration data accumulates.

Thresholds are starting points, not universal constants — calibrate against labeled data
Lower auto-process threshold (e.g., 0.90) for high-stakes contexts (regulatory, healthcare)
Increase audit rate (above 5%) for document types with historically higher error rates

Practice Questions (15)

Source: explanation ex3.md

An invoice extraction pipeline uses Claude's tool_use with a JSON schema marking total_amount as a required number. In production, 3% of extractions have total_amount = 0.0 despite invoices clearly stating totals. The developer concludes the JSON schema is not working. Are they correct?

A) Correct — if the schema marks total_amount as required, Claude must return the correct total
B) Incorrect — tool_use guarantees total_amount is a number (syntactic validity), but 0.0 is a valid number; the schema cannot enforce accuracy
C) Switch from tool_use to JSON-in-prompt for better semantic validation
D) The problem resolves by adding "minimum": 0.01 to the JSON schema

Correct: B — Tool_use guarantees syntactic validity. Zero (0.0) is a valid number; the schema accepts it. This is a semantic error: structure is valid, value is wrong. JSON-in-prompt (C) provides no validation — strictly worse. "minimum": 0.01 (D) catches explicit zeros but not misread amounts (150.00 vs correct 1500.00). Fix: Pydantic layer checking total_amount > 0 and line items sum.

Your schema marks payment_terms as a required string. In production, Claude returns "Net 30" for invoices with no payment terms. A colleague suggests making it nullable. Your manager argues that will cause downstream systems to crash on null values. What is correct?

A) Keep as required; add a Pydantic validator to check if the value was explicitly in the document
B) Make nullable; downstream systems should be updated to handle null because null is more honest than fabricated "Net 30"
C) Keep as required; add a few-shot example showing empty string "" when absent
D) Make nullable; use a default "Net 30" in the downstream system so null never reaches it

Correct: B — Fabricated "Net 30" flowing silently into downstream systems is a data integrity problem — vendor might have "Due on receipt" terms. Making the field nullable is correct; the downstream system fix is the right engineering response. Option A can't work — Pydantic validates structure, not provenance. Option D solves the crash but preserves false data.

Your currency schema uses "enum": ["USD","EUR","GBP"]. In testing, Japanese vendor invoices return currency: "USD" despite the invoice clearly stating "¥50,000". A colleague proposes a free-form currency_string field. What is the correct fix?

A) The free-form currency_string approach — allows Claude to capture any currency
B) Add "JPY" to the enum — fixes the specific problem without changing structure
C) Add "other" to the enum plus a currency_detail field for the ISO 4217 code
D) The current schema is correct; use post-processing to detect currency: "USD" from Japanese vendors

Correct: C — Free-form strings (A) cause normalization chaos ("Japanese Yen", "JPY", "Yen", "¥" all mean the same thing). Adding only "JPY" (B) fixes one case but not hundreds of others. Post-processing (D) is fragile and doesn't scale. The "other" + currency_detail pattern maintains standardization for common currencies with a structured escape hatch for all others.

You're building a document feature where users can either extract structured data or ask natural language questions. Which tool_choice is appropriate for each use case?

A) Both should use {"type": "tool", "name": "extract_invoice_data"} for consistent output
B) Extraction: {"type": "tool", "name": "extract_invoice_data"}; Q&A: {"type": "auto"}
C) Both should use {"type": "auto"} for flexibility
D) Extraction: {"type": "any"}; Q&A: {"type": "auto"}

Correct: B — For extraction, always force the tool call (type: "tool") to guarantee structured output. For Q&A, Claude should respond naturally in text — type: "auto" allows text or tool based on what's appropriate. Option A breaks Q&A by forcing structured output. Option C breaks extraction by allowing Claude to skip the tool. Option D uses "any" for extraction — less specific, allows calling wrong tools.

Your prompt includes: "Convert all amounts to numeric values." In production, Claude handles "$1,250.00" correctly but fails to normalize "eight hundred and fifty dollars" to 850.00. Most effective fix?

A) Rewrite the instruction: "Convert written-out numbers such as 'eight hundred and fifty dollars' to numeric values like 850.00"
B) Add a few-shot example showing the transformation: "eight hundred and fifty dollars" → total_amount: 850.00
C) Add a Pydantic validator that uses a number-words library to detect non-normalized values
D) Switch to chain-of-thought prompting before extracting

Correct: B — Textual instructions work for clear cases but fail on novel variants. A few-shot example that demonstrates the specific transformation removes all ambiguity — Claude pattern-matches to the example. Option A is better than the original but still abstract. Option C catches the problem after extraction but doesn't fix it. Option D (CoT) can help with reasoning but the core issue is demonstrating normalization concretely.

Your pipeline needs to handle invoices with "about a dozen poster designs" and "roughly forty hours of work." Which few-shot example type best addresses this?

A) A happy-path example with a clearly formatted standard invoice
B) A format-variation example showing informal amounts normalized to numeric values with lower confidence
C) An edge-case example showing an academic license invoice
D) A missing-data example showing null handling for absent fields

Correct: B — Format-variation examples specifically address informal language. Example 2 (Jake's Freelance) demonstrates: "five hundred dollars"→500.00, "a dozen variations"→quantity=1 with description, confidence_score=0.78 for informal nature. Happy-path (A) establishes the baseline only. Academic license (C) addresses unusual layouts. Missing-data (D) addresses null handling.

After validation you have two errors: (1) sum(line_items)=$245, total_amount=$300 — difference $55; (2) invoice_number is null — document states "invoice number to be filled in later." How should you handle these?

A) Retry for both — Claude can find correct values with specific feedback
B) Retry only for error (1); route error (2) to human review with note that invoice number is pending assignment
C) Don't retry for either — both indicate missing data and retrying produces hallucinations
D) Retry for error (2); don't retry for error (1) because arithmetic can't be corrected by retry

Correct: B — Error (1) is arithmetic: the $55 difference suggests Claude missed a line item that IS in the document. Specific feedback gives Claude a target to find — retry is appropriate. Error (2) is genuine missing data: document explicitly says invoice number will be assigned later. It doesn't exist yet. Retrying produces the same null or a hallucinated number. Human review is correct.

Which retry prompt structure is most effective after a validation failure?

A) "Your previous extraction failed. Please try again and be more careful."
B) "Please re-extract this invoice. Pay special attention to payment terms and line items."
C) "Your previous extraction had this error: sum(line_items)=$145, total_amount=$200 (difference: $55). Re-read the document for additional line items, fees, or taxes. Your previous extraction: [JSON]. Original document follows."
D) "The line items in your previous extraction were incorrect. Please recalculate them from scratch."

Correct: C — Effective retry prompts include: (1) specific error with quantified values, (2) actionable guidance on where to look, (3) previous incorrect extraction for comparison, (4) original document for re-reading. Option A is vague. Option B is slightly better but still abstract. Option D tells Claude the items are wrong but gives no information about how wrong or where to look.

A customer support team needs to acknowledge incoming claim documents within 2 hours with an initial extraction result. Which processing approach is correct?

A) Use the Batch API — it offers 50% cost savings and large volume
B) Use the Batch API with a 1-hour submission window — results should arrive within 24 hours
C) Use the synchronous API — the 2-hour requirement cannot be met by the Batch API with its 24-hour maximum
D) Use Batch API for claims before noon, synchronous for afternoon claims

Correct: C — The Batch API has a maximum processing time of 24 hours — it cannot guarantee results within 2 hours. Any use case with a response SLA under 24 hours must use the synchronous API. Option A identifies the cost benefit but ignores the SLA. Option B: submitting immediately doesn't change the 24-hour maximum. Option D is operational complexity without solving the fundamental SLA mismatch.

You submit 1,000 invoice extractions as a batch. When results arrive 8 hours later, they are in a different order than submitted. Which feature handles this correctly?

A) The batch ID ensures results are in submission order
B) The custom_id you assigned to each request allows matching each result to its source document regardless of return order
C) The timestamp in each result indicates the original submission order
D) The Anthropic API guarantees FIFO ordering for batch results

Correct: B — Batch results may arrive in any order — the API explicitly states this. custom_id is the mechanism for correlating each result to its source document. Index the results by custom_id, then reconstruct original order using the index embedded in the ID (e.g., "doc-standard-0042" → index 42). Batch ID (A) identifies the batch, not individual items. FIFO ordering (D) is not guaranteed.

You submit a batch of 2,000 invoices. Results arrive with 1,900 successes and 100 failures (all context_length_exceeded). How should you handle the retry?

A) Re-submit all 2,000 with a smaller max_tokens setting
B) Re-submit only the 100 failed documents with truncation applied (take first 50% of each document's text)
C) Don't retry — context_length_exceeded means the documents are too complex
D) Re-submit all 2,000 with a different model with a larger context window

Correct: B — Selective retry is the core efficiency of the custom_id pattern. Re-submitting all 2,000 (A, D) wastes 1,900 API calls on already-successful documents. Option C incorrectly treats context failures as permanent — truncation frequently recovers the most important fields (headers are usually at the top). Note: max_tokens affects output length, not input context window size (A is also wrong on this point).

A finance team needs all invoice extractions by 9:00 AM Monday. They receive invoices Friday afternoon and plan to use Batch API (max 24h). What is the latest time Friday they can submit?

A) 9:00 AM Friday — 24 hours before Sunday 9:00 AM, which is before the Monday deadline
B) 5:00 PM Friday — end of business
C) Any time Friday — the batch always completes well before 24 hours
D) The Batch API should not be used; process synchronously as invoices arrive Friday

Correct: A — Batch API maximum is 24 hours. To guarantee results by 9AM Monday, submit no later than 9AM Sunday. Submitting at 9AM Friday (48 hours before deadline) gives a 24-hour buffer for the batch plus 24 hours buffer for failure handling/retry batch. 5PM Friday (B) is only ~40 hours before Monday 9AM — no guarantee with a 24h maximum. "Often faster" (C) is not a guarantee.

Your pipeline shows 97% overall accuracy in weekly audits. Manager proposes reducing the auto-process audit rate from 5% to 1% to save human review costs. What is the risk?

A) No significant risk — 97% accuracy means the pipeline is reliable
B) Reducing audit rate increases per-unit audit costs without improving accuracy
C) The 97% overall accuracy may hide much higher error rates for specific document subtypes; reducing audit rate makes it harder to detect systematic miscalibration
D) Reducing to 1% violates Batch API terms of service

Correct: C — 97% overall accuracy can mask a 40% error rate for a small but important subtype. Example: 940 standard invoices at 99% accuracy (9 errors) + 60 academic licenses at 60% accuracy (24 errors) = 96.7% overall — but sampling mostly from the 94% standard invoices would rarely catch the academic license problem. Stratified sampling specifically audits across document types.

Your team debates adding a "formality score" — a second model call that rates how formal the invoice language is — as a routing signal, arguing formal invoices are more reliable. What is the correct assessment?

A) Good addition — formal language correlates with structured data, which correlates with extraction accuracy
B) Should replace confidence_score — more objective than model self-assessment
C) Not a reliable signal — formality of language does not determine extraction accuracy; the existing confidence_score already captures extraction uncertainty directly
D) Should be used only for invoices below $1,000 to avoid over-engineering

Correct: C — Formality correlates weakly with data quality. Jake's informal invoice is perfectly extractable — all required data is present, just expressed informally. A formal-looking document with fabricated amounts would score high on formality but fail validation. Adding a second model call doubles cost without reliable signal. confidence_score from the extraction call is a direct measure of extraction uncertainty.

Claude extracts vendor_name: "Acme Corp" but a line item description reads "Payment from Pinnacle Tech — per agreement dated Oct 15." Your validation detects this conflict. Which pattern best handles this?

A) Reject the extraction — conflicting information means it is invalid
B) Route to auto_process — vendor_name and invoice_number are structurally valid
C) Add a conflict_detected flag to metadata and route to human review with the specific conflict noted
D) Retry with feedback asking Claude to re-read and confirm the vendor name

Correct: C — The conflict_detected pattern surfaces the inconsistency to human reviewers. Acme Corp might be correct (invoicing on behalf of a pass-through agreement with Pinnacle Tech) or the extraction might be wrong. Only a human with business context can resolve this. Option A is too aggressive — extraction may be correct. Option B ignores a real quality signal. Option D won't help — the ambiguity is business-context, not a parsing error.