Exercise 4 — Multi-Agent Research Pipeline D1 · D2 · D5 ← Back to Study Guide
🤖 Exercise 4 of 4

Multi-Agent Research Pipeline

A production multi-agent system for researching competitive landscapes. A coordinator spawns parallel subagents for company research, market analysis, and financial data gathering. Results are deduplicated, conflicts surfaced, coverage gaps annotated, and state persisted for fault-tolerant resumption. Covers D1, D2, and D5 exam content comprehensively.

D1 Agent Architecture — 27% D2 Tool & MCP Design — 18% D5 Context & Reliability — 15%

Exam Domains Covered

DomainNameWeightCoverage
D1Agent Architecture27%Hub-and-spoke topology, coordinator/subagent roles, parallel spawning, anti-abort pattern, context isolation
D2Tool & MCP Design18%AgentDefinition config, allowed-tools least privilege, explicit context passing in Task prompts
D5Context & Reliability15%Conflict detection (epistemic honesty), coverage annotations, state persistence with manifest.json

System Architecture

┌─────────────────────────────┐ │ COORDINATOR │ │ (hub — orchestrates all) │ │ Task tool × N subagents │ └────────────┬────────────────┘ │ spawns (parallel) ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ COMPANY │ │ MARKET │ │ FINANCIAL │ │ RESEARCHER │ │ ANALYST │ │ ANALYST │ │ │ │ │ │ │ │ web_search │ │ web_search │ │ web_search │ │ web_fetch │ │ market_data │ │ financial_data│ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └───────────────────┼───────────────────┘ ▼ ┌───────────────────┐ │ synthesis.py │ │ conflict_detect │ │ coverage_annot │ └───────────────────┘

Project Files — Code Walkthrough

📋 models.py D1 · D2 · Data Models

Defines all shared data structures: AgentDefinition, ResearchTask, AgentResult, and ResearchReport. The AgentDefinition config pattern is the exam's primary tool design concept — it enforces least-privilege access at the structural level.

  • AgentDefinition fields: name, role, system_prompt, allowed_tools (list), max_tokens, temperature — all configuration for one subagent type
  • allowed_tools: company researcher gets ["web_search", "web_fetch"]; financial analyst gets ["web_search", "financial_data"]; market analyst gets ["web_search", "market_data"] — no agent can use tools outside its mandate
  • ResearchTask fields: task_id, task_type, query, context (injected explicitly, not inherited), priority, max_retries
  • AgentResult fields: task_id, agent_type, success, content, sources, coverage (full/partial/none), error_type (for failure analysis)
Exam note: Allowed-tools least privilege appears in both D1 (agent design) and D2 (tool design). An agent with web_search but without file_system tools cannot leak sensitive data by accidentally writing to disk — the restriction is architectural, not just a prompt instruction.
🤖 subagents.py D1 · Subagent Execution + Context Isolation

The SubagentExecutor class runs individual subagent tasks. The critical exam concept here is context isolation — each subagent is a fresh Claude instance that receives only its system prompt and the Task prompt. It has no memory of coordinator decisions, other subagents, or previous pipeline runs.

  • Context isolation: no shared memory between subagents; coordinator state is invisible to subagents; each subagent starts fresh
  • Explicit context injection: company_name, industry, focus areas are passed in the Task prompt text — not assumed from coordinator conversation history
  • Context budget management: company_researcher max_tokens=4096 (broad coverage); financial_analyst max_tokens=2048 (structured data); market_analyst max_tokens=3072
# WRONG: assuming subagent inherits coordinator context
task_prompt = "Research the company's financials."

# RIGHT: inject all context explicitly in the Task prompt
task_prompt = f"""
Company: {company_name}
Industry: {industry}
Focus areas: revenue, growth rate, profitability
Task: Research recent financial performance.
Sources: SEC filings, earnings reports.
Return format: JSON with fields: revenue, growth, key_metrics
"""
Exam Trap: "Subagents can access the coordinator's conversation history via shared memory." False. Each subagent is a completely isolated Claude instance. It only knows what was passed in its system prompt + the Task prompt. The coordinator must explicitly pass all relevant context.
🎯 coordinator.py D1 · Hub-and-Spoke + Parallel Spawning

The ResearchCoordinator is the hub. It orchestrates the pipeline: checks state, spawns parallel subagents via Task tool, waits for results, handles failures via the anti-abort pattern, then triggers synthesis. The exam's most tested D1 concept: parallel spawning and failure handling.

Parallel Spawning Rule: Multiple Task calls in ONE coordinator response run concurrently. This is the difference between 3 × 45-second tasks taking 45 seconds total (parallel) vs. 135 seconds (sequential). The coordinator must emit all Task calls in a single response to get parallelism.
  • Anti-abort pattern: if 1 of 3 subagents fails, coordinator uses the 2 successful results, annotates gaps, continues pipeline rather than rejecting all work done
  • Failure triage: TRANSIENT errors (rate limit, timeout) → retry with exponential backoff; PERMISSION errors → skip with annotation; BUSINESS errors → annotate and continue
  • State checkpoint: save results to manifest.json before synthesis so a crash during synthesis doesn't re-run subagent tasks
  • Hub role: only the coordinator has the Task tool; subagents are workers with domain-specific tools only
# Parallel spawning — ALL Task calls in ONE response
# These run concurrently, not sequentially
results = await asyncio.gather(
    executor.run(company_task),   # Task call 1
    executor.run(market_task),    # Task call 2
    executor.run(financial_task), # Task call 3
    return_exceptions=True       # don't abort on single failure
)
⚠️ conflict_detector.py D5 · Conflict Detection + Epistemic Honesty

Detects contradictions between subagent results for the same factual claim. When two sources disagree, the system preserves both values rather than silently resolving to one. This is the "epistemic honesty" principle — annotated uncertainty is better than false confidence.

Conflict TypeExampleResolution
Numeric divergence >10%Source A: revenue $2.1B; Source B: $3.4BPreserve both, add conflict_flag, route to human
Temporal: sources 1+ year apart2022 report: 500 employees; 2024 report: 1,200 employeesKeep both, add temporal_explanation, both valid
Factual contradictionFounded 2015 vs. Founded 2018Preserve both, flag conflict, don't guess
  • ConflictingClaim structure: field_name, value_a (with source_a, date_a), value_b (with source_b, date_b), conflict_type, recommended_action
  • Temporal threshold: sources from the same year → flag as conflict; sources 1+ year apart → temporal_explanation (both may be correct at their respective times)
  • 10% numeric threshold: small variation (<10%) for revenue/users = rounding differences, not a real conflict; large variation (>10%) = flag for review
Exam Trap: "When two sources conflict, pick the more recent source." Wrong. More recent is usually better for temporal facts (employee count), but not always (founding date — earlier source is correct). The system should preserve both values and let humans or downstream logic decide based on context.
📊 synthesis.py D5 · Coverage Annotations + Lost-in-Middle

Synthesizes results from all subagents into a ResearchReport. Implements coverage annotations for incomplete data and the "KEY FINDINGS at top, ACTION ITEMS at bottom" layout for long-context reliability.

Coverage Annotation Protocol: Every section gets a coverage field: "full" (complete, verified), "partial" (some data missing or unverified), or "none" (completely unavailable). Silent gaps — missing sections with no annotation — are worse than annotated gaps because they are invisible.
  • Lost-in-the-middle mitigation: KEY FINDINGS at top of report, ACTION ITEMS at bottom, CONFLICTS at top of relevant sections — the model attends to beginning and end of context most reliably
  • Partial section annotation: "Company Overview [partial — financial_data agent failed; revenue data missing; market position data complete]"
  • None annotation: "Regulatory Compliance [coverage: none — no relevant tool available for regulatory database lookup]"
  • Human review trigger: any conflict → add to review_required list with explanation; any partial coverage on critical fields → flag
💾 state_persistence.py D5 · State Persistence + Fault Tolerance

Fault-tolerant state management. Saves each agent's result to a per-agent JSON file and tracks completion in manifest.json. On resumption, completed tasks are skipped, failed/running tasks are restarted, and tasks not in the manifest run fresh — enabling cheap incremental recovery from crashes.

  • manifest.json: top-level index of task IDs → status (completed/running/failed/not_started), timestamps, file paths to results
  • Per-agent result files: results/{task_id}.json — complete AgentResult object, loaded on resume
  • Resume logic: "completed" → skip and load from file; "running" → restart (crash during execution); "not in manifest" → run fresh
  • Crash safety: save to file BEFORE updating manifest → prevents corrupted partial state
# Resume logic from manifest
for task_id, status in manifest.items():
    if status == "completed":
        results[task_id] = load_result(task_id)  # skip re-run
    elif status == "running":
        pending.append(task_id)  # crashed; restart
    # not in manifest → run fresh (appended to pending)
Why crash at "running" matters: A "running" status means the task started but never wrote a result — the process was killed mid-execution. It's safer to restart than to treat it as complete with no result file.
⚙️ config.py Configuration

Pipeline configuration: max parallel agents (3), retry limits per agent (3), conflict numeric threshold (10%), temporal conflict threshold (365 days), output paths. The 3-agent parallel limit prevents API rate limiting while maximizing throughput for typical research pipelines.

  • Increase max_parallel_agents for data pipelines with many independent tasks
  • Lower numeric_conflict_threshold (e.g., 5%) for high-stakes financial analysis where small discrepancies matter
  • Adjust temporal_conflict_threshold based on industry — fast-moving tech sectors may warrant 180 days
▶️ run_demo.py Demo — 3 Research Scenarios

Demonstrates the pipeline on 3 scenarios: fresh run (all agents succeed), resume (simulated crash after 2 of 3 agents), and conflict detection (company researcher and financial analyst return conflicting revenue figures).

  • Scenario 1 — fresh_run: TechCorp Inc., all 3 agents succeed, full coverage synthesis, no conflicts
  • Scenario 2 — resume_after_crash: StartupXYZ, agents 1+2 complete and saved, agent 3 crashes → on resume, agents 1+2 loaded from disk, agent 3 re-executed
  • Scenario 3 — conflict_detection: MegaCorp Ltd., company research says revenue $2.1B, financial agent says $3.4B → ConflictingClaim generated, flagged in report, human review required

Practice Questions (15)

Source: explanation Ex4.md

1
A research pipeline has 3 parallel subagents — company, market, and financial. The financial agent fails with a rate limit error. What is the correct coordinator response?
D1
+
  • A) Abort the entire pipeline and report failure — incomplete data cannot produce a valid report
  • B) Retry the financial agent up to 3 times; if still failing, abort the pipeline
  • C) Use results from company and market agents, annotate the financial section as coverage: "none" with the specific reason, continue synthesis
  • D) Ask the company and market agents to supply the missing financial data
Correct: C — This is the anti-abort pattern. Rate limit is a TRANSIENT error; retrying is appropriate, but if retries fail, the coordinator should not discard hours of work from the other two agents. Coverage annotation ("none") is more valuable than no report at all. Option A discards all work. Option B is partially right (retry TRANSIENT errors) but wrong to abort on failure. Option D breaks context isolation — agents cannot be given tasks outside their AgentDefinition.
2
A coordinator is redesigned so subagents can read from a shared conversation history object. What is the primary problem with this design?
D1
+
  • A) It increases API cost because the full history is sent to every subagent
  • B) It violates context isolation — subagents influence each other's outputs, making each result dependent on execution order rather than objective research
  • C) The shared history object must be serialized/deserialized, adding latency
  • D) The Anthropic API doesn't support shared state between agent instances
Correct: B — Context isolation is a feature, not a limitation. Isolation ensures each subagent's research reflects its own tools and findings, not what another agent said. If the financial agent sees the company agent's (possibly wrong) revenue figure first, it may anchor on that value. Independent research then synthesis is more reliable. Option A is also true but is a secondary concern. Option D is incorrect — you can architect shared state, but shouldn't.
3
The coordinator emits Task calls to all 3 subagents in separate sequential responses (Task 1, wait for result, Task 2, wait, Task 3, wait). Total time is 180 seconds. What is wrong and how to fix it?
D1
+
  • A) Nothing is wrong — sequential Task calls are the correct pattern
  • B) The calls are sequential instead of parallel. Emit all 3 Task calls in ONE coordinator response — they run concurrently, reducing 180s to ~60s
  • C) The coordinator should use asyncio.gather() for parallel execution within each Task call
  • D) Only 2 Task calls can be parallel; the third must be sequential
Correct: B — The parallel spawning rule: multiple Task calls in ONE coordinator response run concurrently. Emitting them in separate responses forces sequential execution. asyncio.gather() (C) is how the Python code waits for results, but the actual parallelism comes from all Task calls being in the same response. There's no 2-Task limit (D).
4
The financial analyst subagent is given access to ["web_search", "financial_data", "web_fetch", "file_system"]. What is wrong with this tool assignment?
D2
+
  • A) Nothing — more tools give the agent more flexibility for unexpected research needs
  • B) Violates least-privilege. file_system and web_fetch are not needed for financial analysis; file_system particularly creates data leakage and unintended write risks
  • C) The agent definition only supports up to 3 tools per agent
  • D) web_search and financial_data overlap; only one should be included
Correct: B — Least-privilege means an agent gets only the tools it needs to do its job. financial_analyst needs web_search (for news) and financial_data (structured API data). web_fetch could be useful for specific pages but adds broad attack surface. file_system has no role in financial analysis and could silently write data to disk. More tools = more failure modes and security surface. No 3-tool limit (C). web_search and financial_data serve different purposes (D).
5
A Task prompt reads: "Research the company's recent product launches." A subagent asks in its output: "What company should I research?" What caused this?
D2
+
  • A) The subagent's system prompt didn't specify what type of agent it was
  • B) Context was not injected into the Task prompt. The subagent has no access to coordinator context; company name must be explicitly included in the Task prompt
  • C) The model needs a higher temperature setting to make reasonable assumptions
  • D) The Task tool doesn't support multi-turn conversations; the subagent can't ask follow-up questions
Correct: B — Each subagent is a fresh instance. It has zero access to coordinator conversation history. "Research the company" references context that the subagent simply doesn't have. Fix: "Research TechCorp Inc.'s recent product launches in the enterprise SaaS space (Q1-Q3 2024). Focus on: features announced, pricing changes, competitive positioning." All context must travel in the Task prompt.
6
Company researcher returns employee count: 1,200 (source: 2024 annual report). Market analyst returns: 500 (source: 2021 Crunchbase). How should conflict_detector.py handle this?
D5
+
Correct: B — A 3-year gap with a growing tech company is a temporal explanation, not a conflict. Both 500 (2021) and 1,200 (2024) can be simultaneously correct. The system adds a temporal note explaining that both values are valid at their respective dates. Option A (flag as conflict) is appropriate when sources are <1 year apart. Option C discards useful historical context. Option D is mathematically wrong and meaningless.
7
Two subagents from the same month return revenue: $2.1B (company_research) and $3.4B (financial_data). What should the system do?
D5
+
Correct: C — A 62% revenue discrepancy from the same time period is a genuine conflict. Both specialized APIs (A) and primary sources (B) can be wrong (different revenue definitions: GAAP vs. non-GAAP, gross vs. net, different fiscal years). The system should never silently pick a winner. Preserve both, surface the conflict, let a human with business context decide. Averaging (D) produces a value neither source reported.
8
The pipeline crashes after the company and market agents complete, but before the financial agent and synthesis. On restart, which tasks should run?
D5
+
Correct: B — The manifest shows company_research=completed and market_analysis=completed → load from per-agent result files. financial_analysis=not_started (or not in manifest) → run fresh. Then synthesis runs on all three. Option A discards 2/3 of the completed work and re-bills the API. Option C skips a required data source. Option D: synthesis doesn't generate data, it combines agent results.
9
The regulatory compliance section has no data because no tool can access the regulatory database. How should synthesis.py annotate this?
D5
+
Correct: C — Silent gaps (A) are worse than annotated gaps because they're invisible. A reader doesn't know whether regulatory info is clean or simply missing. Coverage: "none" with explanation tells downstream users exactly what's missing and why. "Partial" (D) is incorrect — no specific regulatory data was obtained. Option B provides generic filler text rather than honest reporting.
10
A 15,000-token synthesis prompt has KEY FINDINGS at position 8,000 tokens (middle). A colleague proposes moving them to position 1,000 (top). What's the reasoning?
D5
+
Correct: B — Research shows LLM attention degrades for content in the middle of long contexts ("lost-in-the-middle"). For a 15K-token prompt, content around token 8,000 is most at risk. Critical information (KEY FINDINGS, ACTION ITEMS, CONFLICTS) should be at the beginning or end where attention is highest. Moving tokens doesn't reduce count (C). Attention isn't uniformly highest for earlier tokens (D) — it's highest at start AND end.
11
You need to add a legal research agent to the pipeline. A colleague suggests giving it access to all tools since legal work requires broad knowledge. What's the correct approach?
D2
+
Correct: B — "Broad tool access" is the anti-pattern. The legal agent needs: web_search (news, articles), legal_database (case law, filings), and web_fetch (specific regulatory pages). It doesn't need financial_data, market_data, or file_system. More tools = more ways to make mistakes. A shared tool pool (C) breaks the isolation model — different agents have different access needs. Updating existing agent prompts (D) doesn't add new specialized tools.
12
The financial agent errors with "PERMISSION_DENIED: financial_data API subscription required." After 3 retries, still fails. What's the correct coordinator response?
D1
+
Correct: C — PERMISSION errors are non-retryable (isRetryable=False). Continuing to retry wastes API calls and time — the credential issue requires human intervention, not more requests. The anti-abort pattern applies: use the 2 successful agent results, annotate the gap, produce a partial report. Option B discards work done. Option D breaks context isolation and agent scope — the company researcher can't substitute for a specialized financial API.
13
A pipeline runs 3 agents sequentially in 135 seconds. Refactoring to parallel execution now takes 45 seconds. A PM asks if splitting into 6 agents (2 per domain) would reduce time to 22.5 seconds. Is this reasoning correct?
D1
+
Correct: A — Parallel execution time equals the slowest agent, not average. If the company_research agent takes 45s and web_search responses vary, splitting into 2 company researchers both taking 30-45s doesn't halve the time — they're parallel, not serial. Also, coordination overhead, rate limiting, and synthesis complexity grow with more agents. The 22.5s prediction assumes perfect workload splitting, which rarely holds. No 3-request parallel limit in the API (D).
14
The state persistence system uses only an in-memory dictionary instead of manifest.json. The coordinator crashes. What is the consequence?
D5
+
Correct: B — In-memory state is volatile. A crash wipes the dictionary. Without manifest.json, on restart the system has no way to know which tasks completed without reading and validating all result files — potentially complex and error-prone. Option A is wrong: without the manifest, the system doesn't know which per-agent JSON files to load. Option C would require timestamp-based heuristics (fragile). Option D: subagents are fire-and-forget; they don't maintain state or respond to queries.
15
A single monolithic agent replaces the hub-and-spoke design: one agent with all tools does all research sequentially. It's simpler. Why is the multi-agent design preferred for large-scale research pipelines?
D1
+
Correct: C — Multi-agent advantages are concrete, not philosophical: (1) Parallelism: 45s vs 135s; (2) Specialization: a financial analyst with financial_data tool and finance-specific system prompt outperforms a generalist with broad instructions; (3) Context isolation: financial errors don't contaminate company research; (4) Fault isolation: anti-abort pattern; (5) Selective retry of only failed components. Option D is true but not the primary reason. Option B is false — poorly designed multi-agent can underperform a good single agent.