Skip to content

Planner Graph Refactor Proposals: Ranked Analysis

Author: GLM-5.1 (Sisyphus) evaluation Date: 2026-06-11 Scope: All 11 proposals in docs/planner-graph-ref/proposals/ Source: plan_node() in src/venturescope/planner/agent.py:846-1192 (~350 LOC)


Evaluation Criteria

Criterion Weight Description
Graph readability High Does the Mermaid diagram honestly represent control flow?
Testability High Can each node be tested in isolation without LLM mocking?
Migration risk High Can the change be staged safely? Does it minimize checkpoint churn?
Correctness Critical Does it preserve all current behavior (all 15+ early-return paths)?
Decomposition quality High Right granularity — neither too many nodes nor too few?
State surface Medium How many new state fields? Are they transient or persistent?
Post-LLM handling High Are decision correctors well-organized?
Naming clarity Low Are node names intuitive and consistent?

Ranked Proposals

1. Fable 5 — Best Overall

Nodes: tick → prepare → select → decide → guard

Aspect Assessment
Decomposition ★★★★★ — Five nodes is the sweet spot. Each has one clear job: tick (iteration + early-exit guards), prepare (schema composition + optional decomposition LLM), select (deterministic decision: calculator gates, acquisition fast-path, auto-finish), decide (pure LLM planner call), guard (all correctors).
Correctness ★★★★★ — decision_origin state field ("deterministic" | "llm") is the key insight. It lets guard apply only the relevant correctors: deterministic decisions get _adjust_calculation_decision only, while LLM decisions get the full pipeline (redirect, search cap, ask cap, calculator adjust). This matches current behavior precisely.
Testability ★★★★★ — Each node is independently testable. tick and select have zero LLM calls. decide is the only LLM call. guard is pure transformation on a PlannerDecision.
Migration ★★★★☆ — Three-phase plan (extract → rewire → docs) is solid. Phase 1 keeps plan_node as thin wrapper. Checkpoint namespace bump (planner:v2) acknowledged.
State surface ★★★★★ — Only 2 new fields: decision_origin and llm_failed, both runtime-only and serializer-compatible. No schema changes.
Post-LLM ★★★★★ — All correctors consolidated in guard, keyed by decision_origin. This is the cleanest post-LLM handling of any proposal.
Weakness select contains a possible LLM call (blocked-field decomposition, lines 960-968). The proposal acknowledges this as a follow-up extraction. Also, tick handles region/currency bootstrap questions, which conceptually feel different from iteration guards — but the proposal explicitly notes that these decisions bypass guard, matching current behavior.

Key Insight: The decision_origin field is the single most important architectural contribution across all proposals. It solves the "which correctors apply?" problem without branching by node path.


2. DeepSeek 4 Pro — Strongrunner-up

Nodes: guard → prepare → plan → adjust

Aspect Assessment
Decomposition ★★★★☆ — Four nodes, cleanly delineated. prepare handles both schema composition AND calculator gates, which mixes two concerns, but the resulting node size (~50 LOC) is manageable.
Correctness ★★★★★ — Explicit line-by-line mapping (846-887, 889-1035, 1037-1088, 1090-1192). Every early-return path accounted for.
Testability ★★★★☆ — Good, but prepare mixes deterministic gates (calculator cap/success) with schema transformation, making it a slightly harder unit test target.
Migration ★★★★★ — Best migration plan of all proposals. Four phases with clear steps. Phase 1 is pure extraction (no graph change). Phase 2 rewires. Phase 3 deletes old code. Phase 4 verifies.
State surface ★★★★☆ — Adds recipes to state for caching. Small addition, but introduces a cache-invalidation concern: recipes must stay synchronized with dynamic_decompositions.
Post-LLM ★★★★☆ — adjust consolidates all correctors. Clean, but doesn't distinguish which correctors apply to deterministic vs. LLM decisions. In the current code, _adjust_calculation_decision applies to acquisition fast-path decisions AND LLM decisions, while search/ask caps only apply to LLM decisions. The proposal's adjust would need to handle this distinction internally.
Weakness Mixing calculator gates into prepare means the node has two different conceptual responsibilities: "compose the schema" and "is the calculator done?". These could drift apart in future refactoring.

3. GPT 5.4 — Best Principles, Good Naming

Nodes: tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy

Aspect Assessment
Decomposition ★★★★☆ — Six nodes with clear single responsibilities. bootstrap_gate is a cleaner split for region/currency than lumping into tick.
Correctness ★★★★★ — Explicit anti-patterns section ("do not move state mutation into edge functions", "do not increment iterations in every gate") shows deep understanding of LangGraph constraints.
Testability ★★★★★ — Each node is small and focused. tick has zero LLM, zero schema, zero decisions (only iteration + event).
Migration ★★★★★ — Six-stage migration (one node per stage) is the safest incremental approach of any proposal. Each stage can ship independently.
Post-LLM ★★★★☆ — decision_policy consolidates all correctors. Same concern as DeepSeek: no decision_origin tracking. The proposal says "one decision_policy node is enough; splitting every rewrite into separate nodes would add graph noise without improving clarity." This is correct.
Naming ★★★★★ — Best node naming across all proposals. tick, bootstrap_gate, prepare_context, acquisition_gate, llm_plan, decision_policy — each name immediately tells you what the node does.
Weakness Six nodes means six graph hops per iteration in the worst case, which increases checkpoint writes. The proposal acknowledges this but doesn't quantify the latency impact. Also, bootstrap_gate routing to ask_user directly (not through a decision-pipeline) means region/currency decisions skip decision_policy, which is correct — but needs explicit documentation.

4. Opus — Most Detailed Analysis, Too Fine-Grained

Nodes: g_iter_cap → g_region → emit_region → g_currency → emit_currency → enrich_schema → g_calc_caps → emit_calc_abort → emit_calc_done → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch

Aspect Assessment
Decomposition ★★★☆☆ — Each concern has its own node, which is maximally granular. But the resulting graph has 10+ nodes in the hot path per iteration, which means 10+ LangGraph checkpoint writes per planner tick. This is measurably worse for a chat-paced agent that already has latency concerns.
Correctness ★★★★★ — The line-by-line mapping table (§4) is the most meticulous of all proposals. Every line range is accounted for. The g_/c_/emit_ naming convention is systematic.
Testability ★★★★★ — Maximally testable. Each gate and corrector is a pure function.
Migration ★★★☆☆ — Three-step plan is too coarse for 10+ new nodes. Step 2 ("hoist gates into graph") introduces many nodes simultaneously, which is higher risk.
Post-LLM ★★★★★ — Every corrector is its own node. dispatch serves as the single logging/routing point. This is the most explicit and auditable post-LLM pipeline.
Weakness The c_* corrector chain (c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust) adds 6 sequential hops after every LLM decision. These are all deterministic mutations on a single PlannerDecision object. Making each a separate node means 6 extra PostgresSaver writes per iteration for operations that could be a single function call. The proposal acknowledges this concern but dismisses it. In practice, for a chat-paced agent, these writes add up. Recommendation: Use this proposal's line-by-line mapping as a reference, but consolidate the c_* chain into a single guard or decision_policy node.

5. GLM 5.1 — Most Conservative, Safest Starting Point

Nodes: pre_check → acquire_or_plan

Aspect Assessment
Decomposition ★★★☆☆ — Two nodes is the minimum viable decomposition. pre_check (~150 LOC) still has 7 exit paths. acquire_or_plan (~200 LOC) still has sequential LLM + correctors. The graph topology barely changes.
Correctness ★★★★★ — The PreCheckResult dataclass for routing is explicit. The proposal acknowledges that acquire_or_plan is still substantial but "reads top-to-bottom without early returns for guard conditions."
Testability ★★★☆☆ — pre_check is testable without LLM. acquire_or_plan still requires LLM mocking for full testing.
Migration ★★★★★ — Lowest risk of all proposals. Phase 1 (extract pre_check) is trivial. Phase 2 (rename). Phase 3 (extract _build_plan_output helper). Phase 4 optional validate_decision extraction.
State surface ★★★★★ — Proposes _pre_check_route as a routing tag OR extending the Action literal. Recommends the routing tag to keep Action clean. Good principle.
Weakness The two-node split doesn't significantly improve graph readability. The diagram still shows a large acquire_or_plan node with hidden routing. PreCheckResult dataclass is overengineered — a simple string literal or existing PlannerDecision.action would suffice. The Option B: Extract validate_decision later fallback is wise.

Role in final recommendation: Use as the starting approach for migration, not the end state. Phase 1 extraction of pre_check is a safe first step that all proposals agree on.


6. GPT 5.5 — Thoroughbut Over-Specified

Nodes: enter_iteration → prepare_schema → require_region → require_currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate

Aspect Assessment
Decomposition ★★★☆☆ — Nine nodes in the main path. require_region and require_currency as separate nodes from enter_iteration is over-granular — these are one-condition checks that could be a single bootstrap_gate. maybe_calculate naming is unclear (it's actually calculator adjustment, not "maybe calculate").
Correctness ★★★★★ — Very precise about which lines map to which node. Explicit about state changes (llm_failed, decision_origin).
Testability ★★★★☆ — Each node is testable. But require_region/require_currency as separate nodes adds unnecessary test surface.
Migration ★★★★☆ — Good step-by-step approach. Step 1 (extract pure helpers first, no graph change) is the safest possible start.
Post-LLM ★★★★☆ — normalize_decision + retry_gate + maybe_calculate splits the corrector pipeline into three. normalize_decision handles redirects, retry_gate handles caps, maybe_calculate handles calculator. This is a defensible split.
Weakness Nine nodes with names like maybe_calculate and retry_gate require learning the naming convention. The require_region/require_currency split into separate nodes and separate observe_region/observe_currency handler nodes is a nice idea in theory (making bootstrap questions explicit in the graph) but in practice it creates 4 new nodes for what is currently 2 inline conditionals. The cost/benefit isn't justified.

7. Kimi 2.6 — Good Decision Matrix, Odd Node Choices

Nodes: tick → ask_region → ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy

Aspect Assessment
Decomposition ★★★☆☆ — The calc_gate → acquire → route_direct chain is oddly organized. calc_gate routes to acquire, which then routes to route_direct or decide. But route_direct is described as "a tiny adapter node" that just converts an AcquisitionTask to a PlannerDecision and applies _adjust_calculation_decision. This is a ~5 line function, not a graph node.
Correctness ★★★★☆ — Good line-by-line mapping in the appendix. However, route_direct feeds into enforce_policy, which means acquisition-fast-path decisions also go through search/ask cap enforcement. In current code, acquisition decisions skip search/ask caps (they are deterministic). This may be a subtle behavioral change.
Testability ★★★★☆ — Good testability per node.
Naming ★★★☆☆ — route_direct and enforce_policy are fine, but tick and calc_gate are inconsistent in formality.
Weakness ask_region/ask_currency as separate interrupt nodes diverges from the current ask_user/observe_user pattern. This creates new interrupt nodes that need their own handler logic, rather than reusing the existing ask_user node. The route_direct node is too small to justify its own graph node.

8. Qwen 3.6 Plus — Good Ideas, Flawed Routing

Nodes: preflight → ask_region → ask_currency → decompose → plan → route_finish_check

Aspect Assessment
Decomposition ★★★☆☆ — Five main nodes is reasonable. decompose as a separate node is a good idea (isolating the LLM decomposition call). route_finish_check as a separate validation node before finish is interesting.
Correctness ★★☆☆☆ — Critical flaw: The proposal moves redirect logic (_redirect_derived_direct_decision, _redirect_premature_ask_for_web_field, _adjust_calculation_decision) and cap enforcement into route_after_plan as a routing function. LangGraph routing functions should be pure state readers that return a node name. Putting business logic (decision rewriting) inside a routing function is an anti-pattern that violates the "routing reads, nodes write" principle. This would make the routing function untestable in isolation and confusing to maintain.
Testability ★★☆☆☆ — Due to the routing function anti-pattern, testing the redirect logic requires routing function invocation, which is less direct than testing a node.
Weakness Creating observe_region/observe_currency as separate handler nodes for bootstrap questions means the existing observe_user node logic must be duplicated or refactored to handle these cases. This increases maintenance surface for a marginal gain in diagram clarity.

9. Qwen 3.7 Max — Verbose, _route Anti-Pattern

Nodes: check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision

Aspect Assessment
Decomposition ★★★☆☆ — 10 decision nodes is too many for the hot path. Every planner iteration traverses at least 7 of these nodes, adding checkpoint overhead.
Correctness ★★☆☆☆ — Anti-pattern: Every node writes a _route transient state field. This contradicts LangGraph's design where state should be meaningful, not a routing scratchpad. Proper LangGraph routing uses conditional edge functions that read existing state fields, not a throwaway _route field written by the previous node. If _route is accidentally persisted or not cleared, it becomes a stale routing hint.
Testability ★★☆☆☆ — The _route pattern means every test must set _route on input state, making tests verbose and fragile.
Naming ★★☆☆☆ — check_termination, check_region_currency, check_calculator, check_completion — the check_ prefix is redundant. compose_schema and post_process are vague.
Weakness The proposal includes complete pseudocode for every node, which is admirable for documentation but the pseudocode doesn't match the real implementation (e.g., _llm_failed handling is oversimplified). The _route field pattern is a design anti-pattern that should not be adopted. Also proposes ask_region/ask_currency as separate interrupt nodes (same issue as Kimi).

10. Gemini 3.1 Pro — Too Shallow

Nodes: prepare_state → evaluate_rules → llm_plan

Aspect Assessment
Decomposition ★★☆☆☆ — Three nodes is insufficient. evaluate_rules (~100 LOC) combines calculator gates, acquisition logic, AND auto-finish into one node. llm_plan still has all post-LLM corrections. The graph topology barely changes from the current monolith.
Correctness ★★★☆☆ — No line mapping. No explicit handling of post-LLM corrections. The proposal says "you can now test business rules natively as pure python functions" but doesn't specify how evaluate_rules handles the calculator-blocked → acquisition task path, which includes LLM calls for dynamic decomposition.
Testability ★★★☆☆ — prepare_state and evaluate_rules are testable. But llm_plan still has 100+ LOC of correctors.
Weakness The 3-node decomposition doesn't justify the migration risk. The graph diagram shows evaluate_rules routing to 5 destinations, but the routing logic is still hidden inside the node (same problem as current plan_node, just smaller). No post-LLM handling specification.

11. Mimo 2.5 Pro — Minimal but Insufficient

Nodes: guards → acquire → decide

Aspect Assessment
Decomposition ★★☆☆☆ — Three nodes. guards (~40 LOC) is fine. acquire (~60 LOC) is fine. But decide (~80 LOC) still has all correctors, all cap enforcement, and _adjust_calculation_decision crammed in. The stated benefit says decide is "the most complex node," which means the decomposition hasn't actually simplified the hardest part.
Correctness ★★★☆☆ — The line estimates (40/60/80 LOC) don't add up to 350. The acquire node handles decomposition, composition, calculator gates, AND acquisition task selection — that's not ~60 lines. The decide node handles LLM + 5 correctors + cap enforcement + logging, which is well over 80 lines.
Testability ★★★☆☆ — guards and acquire are testable. decide still requires LLM mocking for full testing.
Weakness The proposal has the best summary table (Before/After metrics) but the numbers don't match the actual code complexity. The open questions at the end reveal the key uncertainty: should decide include cap enforcement or should that be a separate node? This is the question that Fable 5 answers definitively with decision_origin.

Recommendation

Primary: Adopt Fable 5's Architecture

Target topology:

START → tick → prepare → select → decide → guard → [action nodes]
             ↘ (early exit) → finish      ↑         ↗
                                            └─ [action nodes loop back to tick]

Why Fable 5:

  1. decision_origin is the key insight. No other proposal cleanly solves the problem of "which correctors apply to which decisions." The "deterministic" vs. "llm" tag lets guard apply _adjust_calculation_decision to both paths (matching current behavior) but search/ask caps only to LLM decisions (also matching current behavior). This is precise, minimal, and correct.

  2. Five nodes is the right granularity. Four or fewer means one node still has >100 LOC. Ten or more means excessive checkpoint overhead per iteration. Five hits the sweet spot where each node has a clear single responsibility and the graph diagram is readable.

  3. Migration is safe. Phase 1 (extract helpers → keep plan_node as thin wrapper) is risk-free. Phase 2 (rewire graph) is a single breaking change to the graph topology. Phase 3 (docs update) is mechanical.

  4. select consolidates all deterministic routing. Calculator gates, acquisition tasks, auto-finish — all the "we know what to do without LLM" logic — live in one place. This is the most testable unit: no LLM mocking needed.

  5. guard is a single pass-through for correctors. Instead of 5-6 separate nodes in the Opus proposal, guard applies correctors sequentially based on decision_origin. This is one function call per corrector, one graph hop, one checkpoint write — not five.

Secondary: Incorporate GPT 5.4's Naming and Migration Strategy

Adopt GPT 5.4's naming convention which is clearer than Fable 5's:

Fable 5 GPT 5.4 equivalent Recommended name
tick tick tick
prepare prepare_context prepare
select acquisition_gate select (more accurate — not just acquisition)
decide llm_plan decide (shorter, clearer)
guard decision_policy guard (shorter, established term)

Adopt GPT 5.4's staged migration: extract one node at a time, starting with tick, then prepare, then rewire.

Adoption Strategy

Phase 0: Extract pure helper functions from plan_node without changing graph topology (all proposals agree on this). This is risk-free and should land first.

Phase 1: Add tick node with START → tick → plan edge. All other nodes still route through plan. Verify tests pass.

Phase 2: Add prepare node. Route tick → prepare → plan. Verify tests pass.

Phase 3: Add select node between prepare and plan. Add decision_origin state field. Verify tests pass.

Phase 4: Extract guard from plan_node post-LLM corrections. Rewire: select → plan → guard where plan becomes decide. Verify tests pass.

Phase 5: Bump planner thread namespace to v2. Remove old plan_node. Update docs.

Each phase is a separate PR. Each phase has its own test updates. The decision_origin field (added in Phase 3) is the only new state field.

What NOT to Adopt

  • Separate region/currency nodes (Qwen 3.6, Kimi 2.6, Opus, GPT 5.5): Four new nodes for two simple conditionals. The inline check in tick maintaining current behavior is simpler and less error-prone.
  • _route transient state field (Qwen 3.7 Max): Anti-pattern. Routing functions should read existing state, not a throwaway routing tag.
  • Separate corrector nodes (Opus's c_* chain): Six extra graph hops per iteration for deterministic functions that run in microseconds. Consolidate into guard.
  • route_finish_check as a node (Qwen 3.6): Auto-finish logic belongs in select as a deterministic routing decision, not a separate node between plan and finish.
  • PreCheckResult dataclass for routing (GLM 5.1): Overengineered. Use decision_origin + decision.action as the routing signals, not a custom result type.

Summary Table

Rank Proposal Nodes Best At Weakest At Verdict
1 Fable 5 5 Correctness, decomposition, decision_origin insight select has possible LLM call (blocked-field decomposition) Adopt
2 DeepSeek 4 Pro 4 Migration plan, line precision prepare mixes calculator gates with schema composition Strong runner-up
3 GPT 5.4 6 Naming, anti-patterns, migration staging Six hops per iteration Adopt naming + migration strategy
4 Opus 10+ Analysis depth, line mapping Too many nodes, 6 corrector hops Use as reference, not as architecture
5 GLM 5.1 2 Lowest risk, conservatism Doesn't solve the problem enough Starting approach only
6 GPT 5.5 9 Thoroughness, state design Over-granular bootstrap split Some ideas worth borrowing
7 Kimi 2.6 9 Decision matrix route_direct too small for node, separate interrupt nodes Partially useful
8 Qwen 3.6 6 decompose isolation, route_finish_check Business logic in routing function Routing anti-pattern
9 Qwen 3.7 10 Pseudocode completeness _route anti-pattern, too many check_ nodes Do not adopt
10 Gemini 3.1 3 Simplicity No post-LLM handling, too shallow Do not adopt
11 Mimo 2.5 3 Summary table LOC estimates wrong, decide still too complex Do not adopt