Planner Graph Refactor Proposals: Ranked Analysis¶

Author: GLM-5.1 (Sisyphus) evaluation Date: 2026-06-11 Scope: All 11 proposals in docs/planner-graph-ref/proposals/ Source: plan_node() in src/venturescope/planner/agent.py:846-1192 (~350 LOC)

Evaluation Criteria¶

Criterion	Weight	Description
Graph readability	High	Does the Mermaid diagram honestly represent control flow?
Testability	High	Can each node be tested in isolation without LLM mocking?
Migration risk	High	Can the change be staged safely? Does it minimize checkpoint churn?
Correctness	Critical	Does it preserve all current behavior (all 15+ early-return paths)?
Decomposition quality	High	Right granularity — neither too many nodes nor too few?
State surface	Medium	How many new state fields? Are they transient or persistent?
Post-LLM handling	High	Are decision correctors well-organized?
Naming clarity	Low	Are node names intuitive and consistent?

Ranked Proposals¶

1. Fable 5 — Best Overall¶

Nodes: tick → prepare → select → decide → guard

Aspect	Assessment
Decomposition	★★★★★ — Five nodes is the sweet spot. Each has one clear job: `tick` (iteration + early-exit guards), `prepare` (schema composition + optional decomposition LLM), `select` (deterministic decision: calculator gates, acquisition fast-path, auto-finish), `decide` (pure LLM planner call), `guard` (all correctors).
Correctness	★★★★★ — `decision_origin` state field (`"deterministic" \| "llm"`) is the key insight. It lets `guard` apply only the relevant correctors: deterministic decisions get `_adjust_calculation_decision` only, while LLM decisions get the full pipeline (redirect, search cap, ask cap, calculator adjust). This matches current behavior precisely.
Testability	★★★★★ — Each node is independently testable. `tick` and `select` have zero LLM calls. `decide` is the only LLM call. `guard` is pure transformation on a `PlannerDecision`.
Migration	★★★★☆ — Three-phase plan (extract → rewire → docs) is solid. Phase 1 keeps `plan_node` as thin wrapper. Checkpoint namespace bump (`planner:v2`) acknowledged.
State surface	★★★★★ — Only 2 new fields: `decision_origin` and `llm_failed`, both runtime-only and serializer-compatible. No schema changes.
Post-LLM	★★★★★ — All correctors consolidated in `guard`, keyed by `decision_origin`. This is the cleanest post-LLM handling of any proposal.
Weakness	`select` contains a possible LLM call (blocked-field decomposition, lines 960-968). The proposal acknowledges this as a follow-up extraction. Also, `tick` handles region/currency bootstrap questions, which conceptually feel different from iteration guards — but the proposal explicitly notes that these decisions bypass `guard`, matching current behavior.

Key Insight: The decision_origin field is the single most important architectural contribution across all proposals. It solves the "which correctors apply?" problem without branching by node path.

2. DeepSeek 4 Pro — Strongrunner-up¶

Nodes: guard → prepare → plan → adjust

Aspect	Assessment
Decomposition	★★★★☆ — Four nodes, cleanly delineated. `prepare` handles both schema composition AND calculator gates, which mixes two concerns, but the resulting node size (~50 LOC) is manageable.
Correctness	★★★★★ — Explicit line-by-line mapping (846-887, 889-1035, 1037-1088, 1090-1192). Every early-return path accounted for.
Testability	★★★★☆ — Good, but `prepare` mixes deterministic gates (calculator cap/success) with schema transformation, making it a slightly harder unit test target.
Migration	★★★★★ — Best migration plan of all proposals. Four phases with clear steps. Phase 1 is pure extraction (no graph change). Phase 2 rewires. Phase 3 deletes old code. Phase 4 verifies.
State surface	★★★★☆ — Adds `recipes` to state for caching. Small addition, but introduces a cache-invalidation concern: `recipes` must stay synchronized with `dynamic_decompositions`.
Post-LLM	★★★★☆ — `adjust` consolidates all correctors. Clean, but doesn't distinguish which correctors apply to deterministic vs. LLM decisions. In the current code, `_adjust_calculation_decision` applies to acquisition fast-path decisions AND LLM decisions, while search/ask caps only apply to LLM decisions. The proposal's `adjust` would need to handle this distinction internally.
Weakness	Mixing calculator gates into `prepare` means the node has two different conceptual responsibilities: "compose the schema" and "is the calculator done?". These could drift apart in future refactoring.

3. GPT 5.4 — Best Principles, Good Naming¶

Nodes: tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy

Aspect	Assessment
Decomposition	★★★★☆ — Six nodes with clear single responsibilities. `bootstrap_gate` is a cleaner split for region/currency than lumping into `tick`.
Correctness	★★★★★ — Explicit anti-patterns section ("do not move state mutation into edge functions", "do not increment iterations in every gate") shows deep understanding of LangGraph constraints.
Testability	★★★★★ — Each node is small and focused. `tick` has zero LLM, zero schema, zero decisions (only iteration + event).
Migration	★★★★★ — Six-stage migration (one node per stage) is the safest incremental approach of any proposal. Each stage can ship independently.
Post-LLM	★★★★☆ — `decision_policy` consolidates all correctors. Same concern as DeepSeek: no `decision_origin` tracking. The proposal says "one `decision_policy` node is enough; splitting every rewrite into separate nodes would add graph noise without improving clarity." This is correct.
Naming	★★★★★ — Best node naming across all proposals. `tick`, `bootstrap_gate`, `prepare_context`, `acquisition_gate`, `llm_plan`, `decision_policy` — each name immediately tells you what the node does.
Weakness	Six nodes means six graph hops per iteration in the worst case, which increases checkpoint writes. The proposal acknowledges this but doesn't quantify the latency impact. Also, `bootstrap_gate` routing to `ask_user` directly (not through a decision-pipeline) means region/currency decisions skip `decision_policy`, which is correct — but needs explicit documentation.

4. Opus — Most Detailed Analysis, Too Fine-Grained¶

Nodes: g_iter_cap → g_region → emit_region → g_currency → emit_currency → enrich_schema → g_calc_caps → emit_calc_abort → emit_calc_done → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch

Aspect	Assessment
Decomposition	★★★☆☆ — Each concern has its own node, which is maximally granular. But the resulting graph has 10+ nodes in the hot path per iteration, which means 10+ LangGraph checkpoint writes per planner tick. This is measurably worse for a chat-paced agent that already has latency concerns.
Correctness	★★★★★ — The line-by-line mapping table (§4) is the most meticulous of all proposals. Every line range is accounted for. The `g_`/`c_`/`emit_` naming convention is systematic.
Testability	★★★★★ — Maximally testable. Each gate and corrector is a pure function.
Migration	★★★☆☆ — Three-step plan is too coarse for 10+ new nodes. Step 2 ("hoist gates into graph") introduces many nodes simultaneously, which is higher risk.
Post-LLM	★★★★★ — Every corrector is its own node. `dispatch` serves as the single logging/routing point. This is the most explicit and auditable post-LLM pipeline.
Weakness	The `c_` corrector chain (`c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust`) adds 6 sequential hops after every LLM decision. These are all deterministic mutations on a single `PlannerDecision` object. Making each a separate node means 6 extra `PostgresSaver` writes per iteration for operations that could be a single function call. The proposal acknowledges this concern but dismisses it. In practice, for a chat-paced agent, these writes add up. Recommendation:* Use this proposal's line-by-line mapping as a reference, but consolidate the `c_*` chain into a single `guard` or `decision_policy` node.

5. GLM 5.1 — Most Conservative, Safest Starting Point¶

Nodes: pre_check → acquire_or_plan

Aspect	Assessment
Decomposition	★★★☆☆ — Two nodes is the minimum viable decomposition. `pre_check` (~150 LOC) still has 7 exit paths. `acquire_or_plan` (~200 LOC) still has sequential LLM + correctors. The graph topology barely changes.
Correctness	★★★★★ — The `PreCheckResult` dataclass for routing is explicit. The proposal acknowledges that `acquire_or_plan` is still substantial but "reads top-to-bottom without early returns for guard conditions."
Testability	★★★☆☆ — `pre_check` is testable without LLM. `acquire_or_plan` still requires LLM mocking for full testing.
Migration	★★★★★ — Lowest risk of all proposals. Phase 1 (extract `pre_check`) is trivial. Phase 2 (rename). Phase 3 (extract `_build_plan_output` helper). Phase 4 optional `validate_decision` extraction.
State surface	★★★★★ — Proposes `_pre_check_route` as a routing tag OR extending the `Action` literal. Recommends the routing tag to keep `Action` clean. Good principle.
Weakness	The two-node split doesn't significantly improve graph readability. The diagram still shows a large `acquire_or_plan` node with hidden routing. `PreCheckResult` dataclass is overengineered — a simple string literal or existing `PlannerDecision.action` would suffice. The `Option B: Extract validate_decision later` fallback is wise.

Role in final recommendation: Use as the starting approach for migration, not the end state. Phase 1 extraction of pre_check is a safe first step that all proposals agree on.

6. GPT 5.5 — Thoroughbut Over-Specified¶

Nodes: enter_iteration → prepare_schema → require_region → require_currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate

Aspect	Assessment
Decomposition	★★★☆☆ — Nine nodes in the main path. `require_region` and `require_currency` as separate nodes from `enter_iteration` is over-granular — these are one-condition checks that could be a single `bootstrap_gate`. `maybe_calculate` naming is unclear (it's actually calculator adjustment, not "maybe calculate").
Correctness	★★★★★ — Very precise about which lines map to which node. Explicit about state changes (`llm_failed`, `decision_origin`).
Testability	★★★★☆ — Each node is testable. But `require_region`/`require_currency` as separate nodes adds unnecessary test surface.
Migration	★★★★☆ — Good step-by-step approach. Step 1 (extract pure helpers first, no graph change) is the safest possible start.
Post-LLM	★★★★☆ — `normalize_decision` + `retry_gate` + `maybe_calculate` splits the corrector pipeline into three. `normalize_decision` handles redirects, `retry_gate` handles caps, `maybe_calculate` handles calculator. This is a defensible split.
Weakness	Nine nodes with names like `maybe_calculate` and `retry_gate` require learning the naming convention. The `require_region`/`require_currency` split into separate nodes and separate `observe_region`/`observe_currency` handler nodes is a nice idea in theory (making bootstrap questions explicit in the graph) but in practice it creates 4 new nodes for what is currently 2 inline conditionals. The cost/benefit isn't justified.

7. Kimi 2.6 — Good Decision Matrix, Odd Node Choices¶

Nodes: tick → ask_region → ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy

Aspect	Assessment
Decomposition	★★★☆☆ — The `calc_gate → acquire → route_direct` chain is oddly organized. `calc_gate` routes to `acquire`, which then routes to `route_direct` or `decide`. But `route_direct` is described as "a tiny adapter node" that just converts an `AcquisitionTask` to a `PlannerDecision` and applies `_adjust_calculation_decision`. This is a ~5 line function, not a graph node.
Correctness	★★★★☆ — Good line-by-line mapping in the appendix. However, `route_direct` feeds into `enforce_policy`, which means acquisition-fast-path decisions also go through search/ask cap enforcement. In current code, acquisition decisions skip search/ask caps (they are deterministic). This may be a subtle behavioral change.
Testability	★★★★☆ — Good testability per node.
Naming	★★★☆☆ — `route_direct` and `enforce_policy` are fine, but `tick` and `calc_gate` are inconsistent in formality.
Weakness	`ask_region`/`ask_currency` as separate interrupt nodes diverges from the current `ask_user`/`observe_user` pattern. This creates new interrupt nodes that need their own handler logic, rather than reusing the existing `ask_user` node. The `route_direct` node is too small to justify its own graph node.

8. Qwen 3.6 Plus — Good Ideas, Flawed Routing¶

Nodes: preflight → ask_region → ask_currency → decompose → plan → route_finish_check

Aspect	Assessment
Decomposition	★★★☆☆ — Five main nodes is reasonable. `decompose` as a separate node is a good idea (isolating the LLM decomposition call). `route_finish_check` as a separate validation node before `finish` is interesting.
Correctness	★★☆☆☆ — Critical flaw: The proposal moves redirect logic (`_redirect_derived_direct_decision`, `_redirect_premature_ask_for_web_field`, `_adjust_calculation_decision`) and cap enforcement into `route_after_plan` as a routing function. LangGraph routing functions should be pure state readers that return a node name. Putting business logic (decision rewriting) inside a routing function is an anti-pattern that violates the "routing reads, nodes write" principle. This would make the routing function untestable in isolation and confusing to maintain.
Testability	★★☆☆☆ — Due to the routing function anti-pattern, testing the redirect logic requires routing function invocation, which is less direct than testing a node.
Weakness	Creating `observe_region`/`observe_currency` as separate handler nodes for bootstrap questions means the existing `observe_user` node logic must be duplicated or refactored to handle these cases. This increases maintenance surface for a marginal gain in diagram clarity.

9. Qwen 3.7 Max — Verbose, `_route` Anti-Pattern¶

Nodes: check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision

Aspect	Assessment
Decomposition	★★★☆☆ — 10 decision nodes is too many for the hot path. Every planner iteration traverses at least 7 of these nodes, adding checkpoint overhead.
Correctness	★★☆☆☆ — Anti-pattern: Every node writes a `_route` transient state field. This contradicts LangGraph's design where state should be meaningful, not a routing scratchpad. Proper LangGraph routing uses conditional edge functions that read existing state fields, not a throwaway `_route` field written by the previous node. If `_route` is accidentally persisted or not cleared, it becomes a stale routing hint.
Testability	★★☆☆☆ — The `_route` pattern means every test must set `_route` on input state, making tests verbose and fragile.
Naming	★★☆☆☆ — `check_termination`, `check_region_currency`, `check_calculator`, `check_completion` — the `check_` prefix is redundant. `compose_schema` and `post_process` are vague.
Weakness	The proposal includes complete pseudocode for every node, which is admirable for documentation but the pseudocode doesn't match the real implementation (e.g., `_llm_failed` handling is oversimplified). The `_route` field pattern is a design anti-pattern that should not be adopted. Also proposes `ask_region`/`ask_currency` as separate interrupt nodes (same issue as Kimi).

10. Gemini 3.1 Pro — Too Shallow¶

Nodes: prepare_state → evaluate_rules → llm_plan

Aspect	Assessment
Decomposition	★★☆☆☆ — Three nodes is insufficient. `evaluate_rules` (~100 LOC) combines calculator gates, acquisition logic, AND auto-finish into one node. `llm_plan` still has all post-LLM corrections. The graph topology barely changes from the current monolith.
Correctness	★★★☆☆ — No line mapping. No explicit handling of post-LLM corrections. The proposal says "you can now test business rules natively as pure python functions" but doesn't specify how `evaluate_rules` handles the calculator-blocked → acquisition task path, which includes LLM calls for dynamic decomposition.
Testability	★★★☆☆ — `prepare_state` and `evaluate_rules` are testable. But `llm_plan` still has 100+ LOC of correctors.
Weakness	The 3-node decomposition doesn't justify the migration risk. The graph diagram shows `evaluate_rules` routing to 5 destinations, but the routing logic is still hidden inside the node (same problem as current `plan_node`, just smaller). No post-LLM handling specification.

11. Mimo 2.5 Pro — Minimal but Insufficient¶

Nodes: guards → acquire → decide

Aspect	Assessment
Decomposition	★★☆☆☆ — Three nodes. `guards` (~40 LOC) is fine. `acquire` (~60 LOC) is fine. But `decide` (~80 LOC) still has all correctors, all cap enforcement, and `_adjust_calculation_decision` crammed in. The stated benefit says `decide` is "the most complex node," which means the decomposition hasn't actually simplified the hardest part.
Correctness	★★★☆☆ — The line estimates (40/60/80 LOC) don't add up to 350. The `acquire` node handles decomposition, composition, calculator gates, AND acquisition task selection — that's not ~60 lines. The `decide` node handles LLM + 5 correctors + cap enforcement + logging, which is well over 80 lines.
Testability	★★★☆☆ — `guards` and `acquire` are testable. `decide` still requires LLM mocking for full testing.
Weakness	The proposal has the best summary table (Before/After metrics) but the numbers don't match the actual code complexity. The open questions at the end reveal the key uncertainty: should `decide` include cap enforcement or should that be a separate node? This is the question that Fable 5 answers definitively with `decision_origin`.

Recommendation¶

Primary: Adopt Fable 5's Architecture¶

Target topology:

START → tick → prepare → select → decide → guard → [action nodes]
             ↘ (early exit) → finish      ↑         ↗
                                            └─ [action nodes loop back to tick]

Why Fable 5:

decision_origin is the key insight. No other proposal cleanly solves the problem of "which correctors apply to which decisions." The "deterministic" vs. "llm" tag lets guard apply _adjust_calculation_decision to both paths (matching current behavior) but search/ask caps only to LLM decisions (also matching current behavior). This is precise, minimal, and correct.
Five nodes is the right granularity. Four or fewer means one node still has >100 LOC. Ten or more means excessive checkpoint overhead per iteration. Five hits the sweet spot where each node has a clear single responsibility and the graph diagram is readable.
Migration is safe. Phase 1 (extract helpers → keep plan_node as thin wrapper) is risk-free. Phase 2 (rewire graph) is a single breaking change to the graph topology. Phase 3 (docs update) is mechanical.
select consolidates all deterministic routing. Calculator gates, acquisition tasks, auto-finish — all the "we know what to do without LLM" logic — live in one place. This is the most testable unit: no LLM mocking needed.
guard is a single pass-through for correctors. Instead of 5-6 separate nodes in the Opus proposal, guard applies correctors sequentially based on decision_origin. This is one function call per corrector, one graph hop, one checkpoint write — not five.

Secondary: Incorporate GPT 5.4's Naming and Migration Strategy¶

Adopt GPT 5.4's naming convention which is clearer than Fable 5's:

Fable 5	GPT 5.4 equivalent	Recommended name
`tick`	`tick`	`tick`
`prepare`	`prepare_context`	`prepare`
`select`	`acquisition_gate`	`select` (more accurate — not just acquisition)
`decide`	`llm_plan`	`decide` (shorter, clearer)
`guard`	`decision_policy`	`guard` (shorter, established term)

Adopt GPT 5.4's staged migration: extract one node at a time, starting with tick, then prepare, then rewire.

Adoption Strategy¶

Phase 0: Extract pure helper functions from plan_node without changing graph topology (all proposals agree on this). This is risk-free and should land first.

Phase 1: Add tick node with START → tick → plan edge. All other nodes still route through plan. Verify tests pass.

Phase 2: Add prepare node. Route tick → prepare → plan. Verify tests pass.

Phase 3: Add select node between prepare and plan. Add decision_origin state field. Verify tests pass.

Phase 4: Extract guard from plan_node post-LLM corrections. Rewire: select → plan → guard where plan becomes decide. Verify tests pass.

Phase 5: Bump planner thread namespace to v2. Remove old plan_node. Update docs.

Each phase is a separate PR. Each phase has its own test updates. The decision_origin field (added in Phase 3) is the only new state field.

What NOT to Adopt¶

Separate region/currency nodes (Qwen 3.6, Kimi 2.6, Opus, GPT 5.5): Four new nodes for two simple conditionals. The inline check in tick maintaining current behavior is simpler and less error-prone.
_route transient state field (Qwen 3.7 Max): Anti-pattern. Routing functions should read existing state, not a throwaway routing tag.
Separate corrector nodes (Opus's c_* chain): Six extra graph hops per iteration for deterministic functions that run in microseconds. Consolidate into guard.
route_finish_check as a node (Qwen 3.6): Auto-finish logic belongs in select as a deterministic routing decision, not a separate node between plan and finish.
PreCheckResult dataclass for routing (GLM 5.1): Overengineered. Use decision_origin + decision.action as the routing signals, not a custom result type.

Summary Table¶

Rank	Proposal	Nodes	Best At	Weakest At	Verdict
1	Fable 5	5	Correctness, decomposition, `decision_origin` insight	`select` has possible LLM call (blocked-field decomposition)	Adopt
2	DeepSeek 4 Pro	4	Migration plan, line precision	`prepare` mixes calculator gates with schema composition	Strong runner-up
3	GPT 5.4	6	Naming, anti-patterns, migration staging	Six hops per iteration	Adopt naming + migration strategy
4	Opus	10+	Analysis depth, line mapping	Too many nodes, 6 corrector hops	Use as reference, not as architecture
5	GLM 5.1	2	Lowest risk, conservatism	Doesn't solve the problem enough	Starting approach only
6	GPT 5.5	9	Thoroughness, state design	Over-granular bootstrap split	Some ideas worth borrowing
7	Kimi 2.6	9	Decision matrix	`route_direct` too small for node, separate interrupt nodes	Partially useful
8	Qwen 3.6	6	`decompose` isolation, `route_finish_check`	Business logic in routing function	Routing anti-pattern
9	Qwen 3.7	10	Pseudocode completeness	`_route` anti-pattern, too many `check_` nodes	Do not adopt
10	Gemini 3.1	3	Simplicity	No post-LLM handling, too shallow	Do not adopt
11	Mimo 2.5	3	Summary table	LOC estimates wrong, `decide` still too complex	Do not adopt