Exercise 3 — Structured Data Extraction Pipeline D4 · D5 ← Back to Study Guide
⚙️ Exercise 3 of 4

Structured Data Extraction Pipeline

A production-ready invoice extraction pipeline. Takes raw documents in any format and returns structured, validated JSON via forced tool use, two-layer validation, retry with feedback, confidence-based routing, and batch processing with selective retry. Covers the full D4 + D5 exam content.

D4 Prompt Engineering — 20% D5 Context & Reliability — 15%

Exam Domains Covered

DomainNameWeightCoverage
D4Prompt Engineering20%Forced tool use, few-shot examples (4 types), normalization rules, retry prompt anatomy
D5Context & Reliability15%JSON Schema design, Pydantic semantic validation, retry loop, confidence routing, Batch API

Project Files — Code Walkthrough

📐 schemas.py D4 · D5 · JSON Schema Design

Defines the InvoiceExtractionTool — the tool definition passed to tools parameter in messages.create(). Contains both the JSON Schema (API-level syntactic enforcement) and the Pydantic models (application-level semantic enforcement). Two separate validation layers for two different failure modes.

  • Required fields (6): vendor_name, invoice_number, invoice_date, total_amount, currency, confidence_score — without any of these the extraction is useless
  • Nullable fields: currency_detail, line_items, payment_terms — "I don't know" is a legitimate real-world answer; null > hallucinated value
  • "other" enum + currency_detail pattern: prevents silent data loss for unlisted currencies (JPY, CAD, etc.); free-form strings would create normalization chaos downstream
  • Pydantic catches: date not in ISO 8601 format, confidence_score outside [0,1], currency=="other" requires currency_detail, line_item arithmetic (quantity × unit_price = line_total ± 0.01), sum(line_items) ≈ total_amount ±1%
Exam Trap: "tool_use with a JSON schema guarantees the extracted values are correct." Wrong. Tool_use guarantees syntactic validity (fields present, correct types, valid enum). It does not guarantee semantic correctness (total_amount could be 0.0, confidence could be 0.97 on an ambiguous document). Semantic validation = Pydantic layer.
# Design rule for nullable vs required
# Make nullable when "I don't know" is a valid answer
"payment_terms": {"type": ["string", "null"]}

# Required because extraction is useless without it
"total_amount": {"type": "number"}  # in "required" array
📚 few_shot_examples.py D4 · Few-Shot Prompting

Builds the extraction prompt with 4 few-shot examples and a normalization rules block. Few-shot outperforms textual instructions for document extraction because the model pattern-matches to demonstrated transformations under uncertainty rather than re-deriving abstract rules.

ExampleTypeWhat It Teaches
1 — Skyline Office SuppliesHappy pathBaseline extraction, date normalization ("March 12, 2024"→"2024-03-12"), currency_detail=null for USD
2 — Jake's Freelance DesignFormat variation"five hundred dollars"→500.00, "no rush"→null, quantity=1 for "a dozen variations", confidence=0.78
3 — Scholarium AcademicEdge caseNon-standard layout (bibliography style), numbered-list format, "09/30/2024"→"2024-09-30"
4 — Maria's Catering (draft)Missing datainvoice_number="UNKNOWN", date="1900-01-01" sentinel, confidence=0.42 for ambiguous document
  • NORMALIZATION_RULES block embedded in every prompt: dates→YYYY-MM-DD, strip currency symbols, written-out numbers→numeric, null over fabrication
  • Prompt order: task description → normalization rules → examples → target document → closing instruction
  • XML tags delimit examples: <example_document>, <example_extraction>, <example_notes>
Exam Trap: "Adding more examples always improves accuracy." Not necessarily. 4 carefully chosen examples (happy path, informal, unusual format, missing data) outperform 20 slight variations of the same standard invoice. More examples = more context window tokens + diminishing returns.
⚗️ extractor.py D4 · Forced Tool Use

Core extraction module. Uses tool_choice = {"type": "tool", "name": "extract_invoice_data"} to force Claude to always call exactly that tool. No prose output, no markdown, no choice — guaranteed structured output every call.

  • tool_choice: {"type": "tool"} — Claude MUST call this exact tool; stop_reason is always "tool_use"; block.input is already a Python dict
  • tool_choice: {"type": "auto"} — Claude may write text instead of calling a tool
  • tool_choice: {"type": "any"} — Claude must call one of the tools but can choose which
  • MockClaudeClient has 3 modes: normal (happy path), force_error (line items sum ≠ total, tests retry), force_semantic_error (confidence=0.97 for ambiguous doc, proves tool_use ≠ semantic correctness)
# Problems solved by forced tool use:
# 1. Claude might write a text response instead of JSON
# 2. Claude might add explanatory text before/after JSON
# 3. JSON might be wrapped in ```json ... ``` code blocks

tool_choice = {"type": "tool", "name": "extract_invoice_data"}
validator.py D5 · Two-Layer Validation + Retry Loop

Two-layer validation architecture and retry loop. Layer 1 (JSON Schema, API-level) catches syntax errors before you receive the response. Layer 2 (Pydantic, application-level) catches semantic errors after. The retry loop provides specific, quantified feedback to maximize correction success.

Document ↓ [extract_document()] ↓ [validate_extraction()] ├── VALID → return (attempts=1, final_valid=True) └── INVALID → [retry_with_feedback(doc + bad_extraction + SPECIFIC errors)] ↓ [validate_extraction()] ├── VALID → return (attempts=2, final_valid=True) └── INVALID → return best result (final_valid=False)
  • Good retry feedback: "sum(line_items)=$145, total_amount=$200, difference=$55. Look for additional line items, fees, or taxes."
  • Bad retry feedback: "Your extraction failed. Please try again." — Claude has nothing to work with
  • Retry helps: arithmetic errors, date format errors — info exists in document but was missed
  • Retry doesn't help: missing invoice number, ambiguous amounts — information doesn't exist; retrying risks hallucination
  • max_retries=2 — beyond 2 attempts, rare improvement; adds cost and latency
Exam Trap: "If validation fails, always retry until the model gets it right." Wrong for 3 reasons: (1) missing data can't be fixed by retrying; (2) beyond 2 attempts unlikely to succeed; (3) retrying missing data can produce hallucinated values that pass validation but are factually wrong.
🔀 confidence_router.py D5 · Confidence-Based Routing

Routes extractions to AUTO_PROCESS, HUMAN_REVIEW, or REJECT based on confidence score, validation errors/warnings, and invoice amount. Confidence scores are self-assessments, not calibrated probabilities — requires regular auditing.

RouteConditionHuman Involvement
AUTO_PROCESSconfidence ≥ 0.85 AND valid AND no warnings AND amount < $10K5% random audit only
HUMAN_REVIEW0.60 ≤ confidence < 0.85 OR warnings OR amount ≥ $10KExplicit review required
REJECTconfidence < 0.60 OR any validation errorsInvestigate and re-extract
  • 5% random audit of AUTO_PROCESS items catches systematic miscalibration before it compounds
  • 97% overall accuracy can hide 40% error rate on a specific document subtype (3% of volume)
  • Stratified sampling: audit across document types, not just from the full pool proportionally
  • NOT valid routing signals: sentiment/tone of document, response length, secondary classifiers
📦 batch_processor.py D5 · Message Batches API

Batch processing with selective retry. Uses the Anthropic Message Batches API for ~50% cost reduction on large-volume, non-time-critical workloads.

ParameterValue
Cost reduction~50% vs. standard API (~$0.003→~$0.0015 per 1K input tokens)
Maximum processing time24 hours (hard SLA — not a typical time)
Maximum batch size100,000 requests per batch
Result orderMay differ from submission order — use custom_id to correlate
  • custom_id pattern: "doc-{document_type}-{index:04d}" — encodes type (targeted failure analysis) + zero-padded index (lexicographic sorting)
  • Selective retry: 1,000 docs, 5% failure = 50 failures → re-submit only 50; saves 950 API calls (95% retry cost reduction)
  • context_length_exceeded failure: apply 50% truncation (headers at top), resubmit with "{original_id}-retry"
  • SLA calculation: deadline 30h from now → submit within 6h. Deadline 18h → batch API cannot guarantee delivery; use synchronous API
Exam Trap: "Use Batch API for real-time document processing during user sessions." Wrong. Batch API has up to 24h processing time. Real-time sessions need seconds/minutes — use the synchronous messages.create() API.
▶️ run_demo.py Demo — 3 Sample Documents

Demonstrates the full pipeline on 3 sample documents from sample_docs/, covering the complete range of extraction scenarios.

  • invoice_standard.txt (Northstar Consulting) — clean professional invoice, happy path, confidence ~0.95
  • invoice_informal.txt (Dan Kowalski) — handwritten-style, written-out numbers, no formal structure, confidence ~0.78
  • invoice_ambiguous.txt (Waverly Creative Studio) — draft with missing fields, approximate amounts, "DO NOT PROCESS" warning, routes to REJECT
⚙️ config.py Configuration

Pipeline configuration: routing thresholds (0.85 for auto-process, 0.60 for reject), retry limits (max_retries=2), batch settings, model selection. Centralizes all tunable parameters so thresholds can be adjusted as ground truth calibration data accumulates.

  • Thresholds are starting points, not universal constants — calibrate against labeled data
  • Lower auto-process threshold (e.g., 0.90) for high-stakes contexts (regulatory, healthcare)
  • Increase audit rate (above 5%) for document types with historically higher error rates

Practice Questions (15)

Source: explanation ex3.md

1
An invoice extraction pipeline uses Claude's tool_use with a JSON schema marking total_amount as a required number. In production, 3% of extractions have total_amount = 0.0 despite invoices clearly stating totals. The developer concludes the JSON schema is not working. Are they correct?
D5
+
  • A) Correct — if the schema marks total_amount as required, Claude must return the correct total
  • B) Incorrect — tool_use guarantees total_amount is a number (syntactic validity), but 0.0 is a valid number; the schema cannot enforce accuracy
  • C) Switch from tool_use to JSON-in-prompt for better semantic validation
  • D) The problem resolves by adding "minimum": 0.01 to the JSON schema
Correct: B — Tool_use guarantees syntactic validity. Zero (0.0) is a valid number; the schema accepts it. This is a semantic error: structure is valid, value is wrong. JSON-in-prompt (C) provides no validation — strictly worse. "minimum": 0.01 (D) catches explicit zeros but not misread amounts (150.00 vs correct 1500.00). Fix: Pydantic layer checking total_amount > 0 and line items sum.
2
Your schema marks payment_terms as a required string. In production, Claude returns "Net 30" for invoices with no payment terms. A colleague suggests making it nullable. Your manager argues that will cause downstream systems to crash on null values. What is correct?
D5
+
  • A) Keep as required; add a Pydantic validator to check if the value was explicitly in the document
  • B) Make nullable; downstream systems should be updated to handle null because null is more honest than fabricated "Net 30"
  • C) Keep as required; add a few-shot example showing empty string "" when absent
  • D) Make nullable; use a default "Net 30" in the downstream system so null never reaches it
Correct: B — Fabricated "Net 30" flowing silently into downstream systems is a data integrity problem — vendor might have "Due on receipt" terms. Making the field nullable is correct; the downstream system fix is the right engineering response. Option A can't work — Pydantic validates structure, not provenance. Option D solves the crash but preserves false data.
3
Your currency schema uses "enum": ["USD","EUR","GBP"]. In testing, Japanese vendor invoices return currency: "USD" despite the invoice clearly stating "¥50,000". A colleague proposes a free-form currency_string field. What is the correct fix?
D4
+
  • A) The free-form currency_string approach — allows Claude to capture any currency
  • B) Add "JPY" to the enum — fixes the specific problem without changing structure
  • C) Add "other" to the enum plus a currency_detail field for the ISO 4217 code
  • D) The current schema is correct; use post-processing to detect currency: "USD" from Japanese vendors
Correct: C — Free-form strings (A) cause normalization chaos ("Japanese Yen", "JPY", "Yen", "¥" all mean the same thing). Adding only "JPY" (B) fixes one case but not hundreds of others. Post-processing (D) is fragile and doesn't scale. The "other" + currency_detail pattern maintains standardization for common currencies with a structured escape hatch for all others.
4
You're building a document feature where users can either extract structured data or ask natural language questions. Which tool_choice is appropriate for each use case?
D4
+
  • A) Both should use {"type": "tool", "name": "extract_invoice_data"} for consistent output
  • B) Extraction: {"type": "tool", "name": "extract_invoice_data"}; Q&A: {"type": "auto"}
  • C) Both should use {"type": "auto"} for flexibility
  • D) Extraction: {"type": "any"}; Q&A: {"type": "auto"}
Correct: B — For extraction, always force the tool call (type: "tool") to guarantee structured output. For Q&A, Claude should respond naturally in text — type: "auto" allows text or tool based on what's appropriate. Option A breaks Q&A by forcing structured output. Option C breaks extraction by allowing Claude to skip the tool. Option D uses "any" for extraction — less specific, allows calling wrong tools.
5
Your prompt includes: "Convert all amounts to numeric values." In production, Claude handles "$1,250.00" correctly but fails to normalize "eight hundred and fifty dollars" to 850.00. Most effective fix?
D4
+
  • A) Rewrite the instruction: "Convert written-out numbers such as 'eight hundred and fifty dollars' to numeric values like 850.00"
  • B) Add a few-shot example showing the transformation: "eight hundred and fifty dollars" → total_amount: 850.00
  • C) Add a Pydantic validator that uses a number-words library to detect non-normalized values
  • D) Switch to chain-of-thought prompting before extracting
Correct: B — Textual instructions work for clear cases but fail on novel variants. A few-shot example that demonstrates the specific transformation removes all ambiguity — Claude pattern-matches to the example. Option A is better than the original but still abstract. Option C catches the problem after extraction but doesn't fix it. Option D (CoT) can help with reasoning but the core issue is demonstrating normalization concretely.
6
Your pipeline needs to handle invoices with "about a dozen poster designs" and "roughly forty hours of work." Which few-shot example type best addresses this?
D4
+
  • A) A happy-path example with a clearly formatted standard invoice
  • B) A format-variation example showing informal amounts normalized to numeric values with lower confidence
  • C) An edge-case example showing an academic license invoice
  • D) A missing-data example showing null handling for absent fields
Correct: B — Format-variation examples specifically address informal language. Example 2 (Jake's Freelance) demonstrates: "five hundred dollars"→500.00, "a dozen variations"→quantity=1 with description, confidence_score=0.78 for informal nature. Happy-path (A) establishes the baseline only. Academic license (C) addresses unusual layouts. Missing-data (D) addresses null handling.
7
After validation you have two errors: (1) sum(line_items)=$245, total_amount=$300 — difference $55; (2) invoice_number is null — document states "invoice number to be filled in later." How should you handle these?
D5
+
  • A) Retry for both — Claude can find correct values with specific feedback
  • B) Retry only for error (1); route error (2) to human review with note that invoice number is pending assignment
  • C) Don't retry for either — both indicate missing data and retrying produces hallucinations
  • D) Retry for error (2); don't retry for error (1) because arithmetic can't be corrected by retry
Correct: B — Error (1) is arithmetic: the $55 difference suggests Claude missed a line item that IS in the document. Specific feedback gives Claude a target to find — retry is appropriate. Error (2) is genuine missing data: document explicitly says invoice number will be assigned later. It doesn't exist yet. Retrying produces the same null or a hallucinated number. Human review is correct.
8
Which retry prompt structure is most effective after a validation failure?
D4
+
  • A) "Your previous extraction failed. Please try again and be more careful."
  • B) "Please re-extract this invoice. Pay special attention to payment terms and line items."
  • C) "Your previous extraction had this error: sum(line_items)=$145, total_amount=$200 (difference: $55). Re-read the document for additional line items, fees, or taxes. Your previous extraction: [JSON]. Original document follows."
  • D) "The line items in your previous extraction were incorrect. Please recalculate them from scratch."
Correct: C — Effective retry prompts include: (1) specific error with quantified values, (2) actionable guidance on where to look, (3) previous incorrect extraction for comparison, (4) original document for re-reading. Option A is vague. Option B is slightly better but still abstract. Option D tells Claude the items are wrong but gives no information about how wrong or where to look.
9
A customer support team needs to acknowledge incoming claim documents within 2 hours with an initial extraction result. Which processing approach is correct?
D5
+
  • A) Use the Batch API — it offers 50% cost savings and large volume
  • B) Use the Batch API with a 1-hour submission window — results should arrive within 24 hours
  • C) Use the synchronous API — the 2-hour requirement cannot be met by the Batch API with its 24-hour maximum
  • D) Use Batch API for claims before noon, synchronous for afternoon claims
Correct: C — The Batch API has a maximum processing time of 24 hours — it cannot guarantee results within 2 hours. Any use case with a response SLA under 24 hours must use the synchronous API. Option A identifies the cost benefit but ignores the SLA. Option B: submitting immediately doesn't change the 24-hour maximum. Option D is operational complexity without solving the fundamental SLA mismatch.
10
You submit 1,000 invoice extractions as a batch. When results arrive 8 hours later, they are in a different order than submitted. Which feature handles this correctly?
D5
+
  • A) The batch ID ensures results are in submission order
  • B) The custom_id you assigned to each request allows matching each result to its source document regardless of return order
  • C) The timestamp in each result indicates the original submission order
  • D) The Anthropic API guarantees FIFO ordering for batch results
Correct: B — Batch results may arrive in any order — the API explicitly states this. custom_id is the mechanism for correlating each result to its source document. Index the results by custom_id, then reconstruct original order using the index embedded in the ID (e.g., "doc-standard-0042" → index 42). Batch ID (A) identifies the batch, not individual items. FIFO ordering (D) is not guaranteed.
11
You submit a batch of 2,000 invoices. Results arrive with 1,900 successes and 100 failures (all context_length_exceeded). How should you handle the retry?
D5
+
  • A) Re-submit all 2,000 with a smaller max_tokens setting
  • B) Re-submit only the 100 failed documents with truncation applied (take first 50% of each document's text)
  • C) Don't retry — context_length_exceeded means the documents are too complex
  • D) Re-submit all 2,000 with a different model with a larger context window
Correct: B — Selective retry is the core efficiency of the custom_id pattern. Re-submitting all 2,000 (A, D) wastes 1,900 API calls on already-successful documents. Option C incorrectly treats context failures as permanent — truncation frequently recovers the most important fields (headers are usually at the top). Note: max_tokens affects output length, not input context window size (A is also wrong on this point).
12
A finance team needs all invoice extractions by 9:00 AM Monday. They receive invoices Friday afternoon and plan to use Batch API (max 24h). What is the latest time Friday they can submit?
D5
+
  • A) 9:00 AM Friday — 24 hours before Sunday 9:00 AM, which is before the Monday deadline
  • B) 5:00 PM Friday — end of business
  • C) Any time Friday — the batch always completes well before 24 hours
  • D) The Batch API should not be used; process synchronously as invoices arrive Friday
Correct: A — Batch API maximum is 24 hours. To guarantee results by 9AM Monday, submit no later than 9AM Sunday. Submitting at 9AM Friday (48 hours before deadline) gives a 24-hour buffer for the batch plus 24 hours buffer for failure handling/retry batch. 5PM Friday (B) is only ~40 hours before Monday 9AM — no guarantee with a 24h maximum. "Often faster" (C) is not a guarantee.
13
Your pipeline shows 97% overall accuracy in weekly audits. Manager proposes reducing the auto-process audit rate from 5% to 1% to save human review costs. What is the risk?
D5
+
  • A) No significant risk — 97% accuracy means the pipeline is reliable
  • B) Reducing audit rate increases per-unit audit costs without improving accuracy
  • C) The 97% overall accuracy may hide much higher error rates for specific document subtypes; reducing audit rate makes it harder to detect systematic miscalibration
  • D) Reducing to 1% violates Batch API terms of service
Correct: C — 97% overall accuracy can mask a 40% error rate for a small but important subtype. Example: 940 standard invoices at 99% accuracy (9 errors) + 60 academic licenses at 60% accuracy (24 errors) = 96.7% overall — but sampling mostly from the 94% standard invoices would rarely catch the academic license problem. Stratified sampling specifically audits across document types.
14
Your team debates adding a "formality score" — a second model call that rates how formal the invoice language is — as a routing signal, arguing formal invoices are more reliable. What is the correct assessment?
D5
+
  • A) Good addition — formal language correlates with structured data, which correlates with extraction accuracy
  • B) Should replace confidence_score — more objective than model self-assessment
  • C) Not a reliable signal — formality of language does not determine extraction accuracy; the existing confidence_score already captures extraction uncertainty directly
  • D) Should be used only for invoices below $1,000 to avoid over-engineering
Correct: C — Formality correlates weakly with data quality. Jake's informal invoice is perfectly extractable — all required data is present, just expressed informally. A formal-looking document with fabricated amounts would score high on formality but fail validation. Adding a second model call doubles cost without reliable signal. confidence_score from the extraction call is a direct measure of extraction uncertainty.
15
Claude extracts vendor_name: "Acme Corp" but a line item description reads "Payment from Pinnacle Tech — per agreement dated Oct 15." Your validation detects this conflict. Which pattern best handles this?
D5
+
  • A) Reject the extraction — conflicting information means it is invalid
  • B) Route to auto_process — vendor_name and invoice_number are structurally valid
  • C) Add a conflict_detected flag to metadata and route to human review with the specific conflict noted
  • D) Retry with feedback asking Claude to re-read and confirm the vendor name
Correct: C — The conflict_detected pattern surfaces the inconsistency to human reviewers. Acme Corp might be correct (invoicing on behalf of a pass-through agreement with Pinnacle Tech) or the extraction might be wrong. Only a human with business context can resolve this. Option A is too aggressive — extraction may be correct. Option B ignores a real quality signal. Option D won't help — the ambiguity is business-context, not a parsing error.