🤖 Exercise 4 of 4

Multi-Agent Research Pipeline

A production multi-agent system for researching competitive landscapes. A coordinator spawns parallel subagents for company research, market analysis, and financial data gathering. Results are deduplicated, conflicts surfaced, coverage gaps annotated, and state persisted for fault-tolerant resumption. Covers D1, D2, and D5 exam content comprehensively.

D1 Agent Architecture — 27% D2 Tool & MCP Design — 18% D5 Context & Reliability — 15%

Exam Domains Covered

Domain	Name	Weight	Coverage
D1	Agent Architecture	27%	Hub-and-spoke topology, coordinator/subagent roles, parallel spawning, anti-abort pattern, context isolation
D2	Tool & MCP Design	18%	AgentDefinition config, allowed-tools least privilege, explicit context passing in Task prompts
D5	Context & Reliability	15%	Conflict detection (epistemic honesty), coverage annotations, state persistence with manifest.json

System Architecture

┌─────────────────────────────┐ │ COORDINATOR │ │ (hub — orchestrates all) │ │ Task tool × N subagents │ └────────────┬────────────────┘ │ spawns (parallel) ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ COMPANY │ │ MARKET │ │ FINANCIAL │ │ RESEARCHER │ │ ANALYST │ │ ANALYST │ │ │ │ │ │ │ │ web_search │ │ web_search │ │ web_search │ │ web_fetch │ │ market_data │ │ financial_data│ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └───────────────────┼───────────────────┘ ▼ ┌───────────────────┐ │ synthesis.py │ │ conflict_detect │ │ coverage_annot │ └───────────────────┘

Project Files — Code Walkthrough

📋 models.py D1 · D2 · Data Models

Defines all shared data structures: AgentDefinition, ResearchTask, AgentResult, and ResearchReport. The AgentDefinition config pattern is the exam's primary tool design concept — it enforces least-privilege access at the structural level.

AgentDefinition fields: name, role, system_prompt, allowed_tools (list), max_tokens, temperature — all configuration for one subagent type
allowed_tools: company researcher gets ["web_search", "web_fetch"]; financial analyst gets ["web_search", "financial_data"]; market analyst gets ["web_search", "market_data"] — no agent can use tools outside its mandate
ResearchTask fields: task_id, task_type, query, context (injected explicitly, not inherited), priority, max_retries
AgentResult fields: task_id, agent_type, success, content, sources, coverage (full/partial/none), error_type (for failure analysis)

Exam note: Allowed-tools least privilege appears in both D1 (agent design) and D2 (tool design). An agent with web_search but without file_system tools cannot leak sensitive data by accidentally writing to disk — the restriction is architectural, not just a prompt instruction.

🤖 subagents.py D1 · Subagent Execution + Context Isolation

The SubagentExecutor class runs individual subagent tasks. The critical exam concept here is context isolation — each subagent is a fresh Claude instance that receives only its system prompt and the Task prompt. It has no memory of coordinator decisions, other subagents, or previous pipeline runs.

Context isolation: no shared memory between subagents; coordinator state is invisible to subagents; each subagent starts fresh
Explicit context injection: company_name, industry, focus areas are passed in the Task prompt text — not assumed from coordinator conversation history
Context budget management: company_researcher max_tokens=4096 (broad coverage); financial_analyst max_tokens=2048 (structured data); market_analyst max_tokens=3072

# WRONG: assuming subagent inherits coordinator context
task_prompt = "Research the company's financials."

# RIGHT: inject all context explicitly in the Task prompt
task_prompt = f"""
Company: {company_name}
Industry: {industry}
Focus areas: revenue, growth rate, profitability
Task: Research recent financial performance.
Sources: SEC filings, earnings reports.
Return format: JSON with fields: revenue, growth, key_metrics
"""

Exam Trap: "Subagents can access the coordinator's conversation history via shared memory." False. Each subagent is a completely isolated Claude instance. It only knows what was passed in its system prompt + the Task prompt. The coordinator must explicitly pass all relevant context.

🎯 coordinator.py D1 · Hub-and-Spoke + Parallel Spawning

The ResearchCoordinator is the hub. It orchestrates the pipeline: checks state, spawns parallel subagents via Task tool, waits for results, handles failures via the anti-abort pattern, then triggers synthesis. The exam's most tested D1 concept: parallel spawning and failure handling.

Parallel Spawning Rule: Multiple Task calls in ONE coordinator response run concurrently. This is the difference between 3 × 45-second tasks taking 45 seconds total (parallel) vs. 135 seconds (sequential). The coordinator must emit all Task calls in a single response to get parallelism.

Anti-abort pattern: if 1 of 3 subagents fails, coordinator uses the 2 successful results, annotates gaps, continues pipeline rather than rejecting all work done
Failure triage: TRANSIENT errors (rate limit, timeout) → retry with exponential backoff; PERMISSION errors → skip with annotation; BUSINESS errors → annotate and continue
State checkpoint: save results to manifest.json before synthesis so a crash during synthesis doesn't re-run subagent tasks
Hub role: only the coordinator has the Task tool; subagents are workers with domain-specific tools only

# Parallel spawning — ALL Task calls in ONE response
# These run concurrently, not sequentially
results = await asyncio.gather(
    executor.run(company_task),   # Task call 1
    executor.run(market_task),    # Task call 2
    executor.run(financial_task), # Task call 3
    return_exceptions=True       # don't abort on single failure
)

⚠️ conflict_detector.py D5 · Conflict Detection + Epistemic Honesty

Detects contradictions between subagent results for the same factual claim. When two sources disagree, the system preserves both values rather than silently resolving to one. This is the "epistemic honesty" principle — annotated uncertainty is better than false confidence.

Conflict Type	Example	Resolution
Numeric divergence >10%	Source A: revenue $2.1B; Source B: $3.4B	Preserve both, add conflict_flag, route to human
Temporal: sources 1+ year apart	2022 report: 500 employees; 2024 report: 1,200 employees	Keep both, add temporal_explanation, both valid
Factual contradiction	Founded 2015 vs. Founded 2018	Preserve both, flag conflict, don't guess

ConflictingClaim structure: field_name, value_a (with source_a, date_a), value_b (with source_b, date_b), conflict_type, recommended_action
Temporal threshold: sources from the same year → flag as conflict; sources 1+ year apart → temporal_explanation (both may be correct at their respective times)
10% numeric threshold: small variation (<10%) for revenue/users = rounding differences, not a real conflict; large variation (>10%) = flag for review

Exam Trap: "When two sources conflict, pick the more recent source." Wrong. More recent is usually better for temporal facts (employee count), but not always (founding date — earlier source is correct). The system should preserve both values and let humans or downstream logic decide based on context.

📊 synthesis.py D5 · Coverage Annotations + Lost-in-Middle

Synthesizes results from all subagents into a ResearchReport. Implements coverage annotations for incomplete data and the "KEY FINDINGS at top, ACTION ITEMS at bottom" layout for long-context reliability.

Coverage Annotation Protocol: Every section gets a coverage field: "full" (complete, verified), "partial" (some data missing or unverified), or "none" (completely unavailable). Silent gaps — missing sections with no annotation — are worse than annotated gaps because they are invisible.

Lost-in-the-middle mitigation: KEY FINDINGS at top of report, ACTION ITEMS at bottom, CONFLICTS at top of relevant sections — the model attends to beginning and end of context most reliably
Partial section annotation: "Company Overview [partial — financial_data agent failed; revenue data missing; market position data complete]"
None annotation: "Regulatory Compliance [coverage: none — no relevant tool available for regulatory database lookup]"
Human review trigger: any conflict → add to review_required list with explanation; any partial coverage on critical fields → flag

💾 state_persistence.py D5 · State Persistence + Fault Tolerance

Fault-tolerant state management. Saves each agent's result to a per-agent JSON file and tracks completion in manifest.json. On resumption, completed tasks are skipped, failed/running tasks are restarted, and tasks not in the manifest run fresh — enabling cheap incremental recovery from crashes.

manifest.json: top-level index of task IDs → status (completed/running/failed/not_started), timestamps, file paths to results
Per-agent result files: results/{task_id}.json — complete AgentResult object, loaded on resume
Resume logic: "completed" → skip and load from file; "running" → restart (crash during execution); "not in manifest" → run fresh
Crash safety: save to file BEFORE updating manifest → prevents corrupted partial state

# Resume logic from manifest
for task_id, status in manifest.items():
    if status == "completed":
        results[task_id] = load_result(task_id)  # skip re-run
    elif status == "running":
        pending.append(task_id)  # crashed; restart
    # not in manifest → run fresh (appended to pending)

Why crash at "running" matters: A "running" status means the task started but never wrote a result — the process was killed mid-execution. It's safer to restart than to treat it as complete with no result file.

⚙️ config.py Configuration

Pipeline configuration: max parallel agents (3), retry limits per agent (3), conflict numeric threshold (10%), temporal conflict threshold (365 days), output paths. The 3-agent parallel limit prevents API rate limiting while maximizing throughput for typical research pipelines.

Increase max_parallel_agents for data pipelines with many independent tasks
Lower numeric_conflict_threshold (e.g., 5%) for high-stakes financial analysis where small discrepancies matter
Adjust temporal_conflict_threshold based on industry — fast-moving tech sectors may warrant 180 days

▶️ run_demo.py Demo — 3 Research Scenarios

Demonstrates the pipeline on 3 scenarios: fresh run (all agents succeed), resume (simulated crash after 2 of 3 agents), and conflict detection (company researcher and financial analyst return conflicting revenue figures).

Scenario 1 — fresh_run: TechCorp Inc., all 3 agents succeed, full coverage synthesis, no conflicts
Scenario 2 — resume_after_crash: StartupXYZ, agents 1+2 complete and saved, agent 3 crashes → on resume, agents 1+2 loaded from disk, agent 3 re-executed
Scenario 3 — conflict_detection: MegaCorp Ltd., company research says revenue $2.1B, financial agent says $3.4B → ConflictingClaim generated, flagged in report, human review required

Practice Questions (15)

Source: explanation Ex4.md

A research pipeline has 3 parallel subagents — company, market, and financial. The financial agent fails with a rate limit error. What is the correct coordinator response?

A) Abort the entire pipeline and report failure — incomplete data cannot produce a valid report
B) Retry the financial agent up to 3 times; if still failing, abort the pipeline
C) Use results from company and market agents, annotate the financial section as coverage: "none" with the specific reason, continue synthesis
D) Ask the company and market agents to supply the missing financial data

Correct: C — This is the anti-abort pattern. Rate limit is a TRANSIENT error; retrying is appropriate, but if retries fail, the coordinator should not discard hours of work from the other two agents. Coverage annotation ("none") is more valuable than no report at all. Option A discards all work. Option B is partially right (retry TRANSIENT errors) but wrong to abort on failure. Option D breaks context isolation — agents cannot be given tasks outside their AgentDefinition.

A coordinator is redesigned so subagents can read from a shared conversation history object. What is the primary problem with this design?

A) It increases API cost because the full history is sent to every subagent
B) It violates context isolation — subagents influence each other's outputs, making each result dependent on execution order rather than objective research
C) The shared history object must be serialized/deserialized, adding latency
D) The Anthropic API doesn't support shared state between agent instances

Correct: B — Context isolation is a feature, not a limitation. Isolation ensures each subagent's research reflects its own tools and findings, not what another agent said. If the financial agent sees the company agent's (possibly wrong) revenue figure first, it may anchor on that value. Independent research then synthesis is more reliable. Option A is also true but is a secondary concern. Option D is incorrect — you can architect shared state, but shouldn't.

The coordinator emits Task calls to all 3 subagents in separate sequential responses (Task 1, wait for result, Task 2, wait, Task 3, wait). Total time is 180 seconds. What is wrong and how to fix it?

A) Nothing is wrong — sequential Task calls are the correct pattern
B) The calls are sequential instead of parallel. Emit all 3 Task calls in ONE coordinator response — they run concurrently, reducing 180s to ~60s
C) The coordinator should use asyncio.gather() for parallel execution within each Task call
D) Only 2 Task calls can be parallel; the third must be sequential

Correct: B — The parallel spawning rule: multiple Task calls in ONE coordinator response run concurrently. Emitting them in separate responses forces sequential execution. asyncio.gather() (C) is how the Python code waits for results, but the actual parallelism comes from all Task calls being in the same response. There's no 2-Task limit (D).

The financial analyst subagent is given access to ["web_search", "financial_data", "web_fetch", "file_system"]. What is wrong with this tool assignment?

A) Nothing — more tools give the agent more flexibility for unexpected research needs
B) Violates least-privilege. file_system and web_fetch are not needed for financial analysis; file_system particularly creates data leakage and unintended write risks
C) The agent definition only supports up to 3 tools per agent
D) web_search and financial_data overlap; only one should be included

Correct: B — Least-privilege means an agent gets only the tools it needs to do its job. financial_analyst needs web_search (for news) and financial_data (structured API data). web_fetch could be useful for specific pages but adds broad attack surface. file_system has no role in financial analysis and could silently write data to disk. More tools = more failure modes and security surface. No 3-tool limit (C). web_search and financial_data serve different purposes (D).

A Task prompt reads: "Research the company's recent product launches." A subagent asks in its output: "What company should I research?" What caused this?

A) The subagent's system prompt didn't specify what type of agent it was
B) Context was not injected into the Task prompt. The subagent has no access to coordinator context; company name must be explicitly included in the Task prompt
C) The model needs a higher temperature setting to make reasonable assumptions
D) The Task tool doesn't support multi-turn conversations; the subagent can't ask follow-up questions

Correct: B — Each subagent is a fresh instance. It has zero access to coordinator conversation history. "Research the company" references context that the subagent simply doesn't have. Fix: "Research TechCorp Inc.'s recent product launches in the enterprise SaaS space (Q1-Q3 2024). Focus on: features announced, pricing changes, competitive positioning." All context must travel in the Task prompt.