Planner Graph Refactor Proposals: Ranked Analysis¶
Author: GLM-5.1 (Sisyphus) evaluation
Date: 2026-06-11
Scope: All 11 proposals in docs/planner-graph-ref/proposals/
Source: plan_node() in src/venturescope/planner/agent.py:846-1192 (~350 LOC)
Evaluation Criteria¶
| Criterion | Weight | Description |
|---|---|---|
| Graph readability | High | Does the Mermaid diagram honestly represent control flow? |
| Testability | High | Can each node be tested in isolation without LLM mocking? |
| Migration risk | High | Can the change be staged safely? Does it minimize checkpoint churn? |
| Correctness | Critical | Does it preserve all current behavior (all 15+ early-return paths)? |
| Decomposition quality | High | Right granularity — neither too many nodes nor too few? |
| State surface | Medium | How many new state fields? Are they transient or persistent? |
| Post-LLM handling | High | Are decision correctors well-organized? |
| Naming clarity | Low | Are node names intuitive and consistent? |
Ranked Proposals¶
1. Fable 5 — Best Overall¶
Nodes: tick → prepare → select → decide → guard
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★★★ — Five nodes is the sweet spot. Each has one clear job: tick (iteration + early-exit guards), prepare (schema composition + optional decomposition LLM), select (deterministic decision: calculator gates, acquisition fast-path, auto-finish), decide (pure LLM planner call), guard (all correctors). |
| Correctness | ★★★★★ — decision_origin state field ("deterministic" | "llm") is the key insight. It lets guard apply only the relevant correctors: deterministic decisions get _adjust_calculation_decision only, while LLM decisions get the full pipeline (redirect, search cap, ask cap, calculator adjust). This matches current behavior precisely. |
| Testability | ★★★★★ — Each node is independently testable. tick and select have zero LLM calls. decide is the only LLM call. guard is pure transformation on a PlannerDecision. |
| Migration | ★★★★☆ — Three-phase plan (extract → rewire → docs) is solid. Phase 1 keeps plan_node as thin wrapper. Checkpoint namespace bump (planner:v2) acknowledged. |
| State surface | ★★★★★ — Only 2 new fields: decision_origin and llm_failed, both runtime-only and serializer-compatible. No schema changes. |
| Post-LLM | ★★★★★ — All correctors consolidated in guard, keyed by decision_origin. This is the cleanest post-LLM handling of any proposal. |
| Weakness | select contains a possible LLM call (blocked-field decomposition, lines 960-968). The proposal acknowledges this as a follow-up extraction. Also, tick handles region/currency bootstrap questions, which conceptually feel different from iteration guards — but the proposal explicitly notes that these decisions bypass guard, matching current behavior. |
Key Insight: The decision_origin field is the single most important architectural contribution across all proposals. It solves the "which correctors apply?" problem without branching by node path.
2. DeepSeek 4 Pro — Strongrunner-up¶
Nodes: guard → prepare → plan → adjust
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★★☆ — Four nodes, cleanly delineated. prepare handles both schema composition AND calculator gates, which mixes two concerns, but the resulting node size (~50 LOC) is manageable. |
| Correctness | ★★★★★ — Explicit line-by-line mapping (846-887, 889-1035, 1037-1088, 1090-1192). Every early-return path accounted for. |
| Testability | ★★★★☆ — Good, but prepare mixes deterministic gates (calculator cap/success) with schema transformation, making it a slightly harder unit test target. |
| Migration | ★★★★★ — Best migration plan of all proposals. Four phases with clear steps. Phase 1 is pure extraction (no graph change). Phase 2 rewires. Phase 3 deletes old code. Phase 4 verifies. |
| State surface | ★★★★☆ — Adds recipes to state for caching. Small addition, but introduces a cache-invalidation concern: recipes must stay synchronized with dynamic_decompositions. |
| Post-LLM | ★★★★☆ — adjust consolidates all correctors. Clean, but doesn't distinguish which correctors apply to deterministic vs. LLM decisions. In the current code, _adjust_calculation_decision applies to acquisition fast-path decisions AND LLM decisions, while search/ask caps only apply to LLM decisions. The proposal's adjust would need to handle this distinction internally. |
| Weakness | Mixing calculator gates into prepare means the node has two different conceptual responsibilities: "compose the schema" and "is the calculator done?". These could drift apart in future refactoring. |
3. GPT 5.4 — Best Principles, Good Naming¶
Nodes: tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★★☆ — Six nodes with clear single responsibilities. bootstrap_gate is a cleaner split for region/currency than lumping into tick. |
| Correctness | ★★★★★ — Explicit anti-patterns section ("do not move state mutation into edge functions", "do not increment iterations in every gate") shows deep understanding of LangGraph constraints. |
| Testability | ★★★★★ — Each node is small and focused. tick has zero LLM, zero schema, zero decisions (only iteration + event). |
| Migration | ★★★★★ — Six-stage migration (one node per stage) is the safest incremental approach of any proposal. Each stage can ship independently. |
| Post-LLM | ★★★★☆ — decision_policy consolidates all correctors. Same concern as DeepSeek: no decision_origin tracking. The proposal says "one decision_policy node is enough; splitting every rewrite into separate nodes would add graph noise without improving clarity." This is correct. |
| Naming | ★★★★★ — Best node naming across all proposals. tick, bootstrap_gate, prepare_context, acquisition_gate, llm_plan, decision_policy — each name immediately tells you what the node does. |
| Weakness | Six nodes means six graph hops per iteration in the worst case, which increases checkpoint writes. The proposal acknowledges this but doesn't quantify the latency impact. Also, bootstrap_gate routing to ask_user directly (not through a decision-pipeline) means region/currency decisions skip decision_policy, which is correct — but needs explicit documentation. |
4. Opus — Most Detailed Analysis, Too Fine-Grained¶
Nodes: g_iter_cap → g_region → emit_region → g_currency → emit_currency → enrich_schema → g_calc_caps → emit_calc_abort → emit_calc_done → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — Each concern has its own node, which is maximally granular. But the resulting graph has 10+ nodes in the hot path per iteration, which means 10+ LangGraph checkpoint writes per planner tick. This is measurably worse for a chat-paced agent that already has latency concerns. |
| Correctness | ★★★★★ — The line-by-line mapping table (§4) is the most meticulous of all proposals. Every line range is accounted for. The g_/c_/emit_ naming convention is systematic. |
| Testability | ★★★★★ — Maximally testable. Each gate and corrector is a pure function. |
| Migration | ★★★☆☆ — Three-step plan is too coarse for 10+ new nodes. Step 2 ("hoist gates into graph") introduces many nodes simultaneously, which is higher risk. |
| Post-LLM | ★★★★★ — Every corrector is its own node. dispatch serves as the single logging/routing point. This is the most explicit and auditable post-LLM pipeline. |
| Weakness | The c_* corrector chain (c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust) adds 6 sequential hops after every LLM decision. These are all deterministic mutations on a single PlannerDecision object. Making each a separate node means 6 extra PostgresSaver writes per iteration for operations that could be a single function call. The proposal acknowledges this concern but dismisses it. In practice, for a chat-paced agent, these writes add up. Recommendation: Use this proposal's line-by-line mapping as a reference, but consolidate the c_* chain into a single guard or decision_policy node. |
5. GLM 5.1 — Most Conservative, Safest Starting Point¶
Nodes: pre_check → acquire_or_plan
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — Two nodes is the minimum viable decomposition. pre_check (~150 LOC) still has 7 exit paths. acquire_or_plan (~200 LOC) still has sequential LLM + correctors. The graph topology barely changes. |
| Correctness | ★★★★★ — The PreCheckResult dataclass for routing is explicit. The proposal acknowledges that acquire_or_plan is still substantial but "reads top-to-bottom without early returns for guard conditions." |
| Testability | ★★★☆☆ — pre_check is testable without LLM. acquire_or_plan still requires LLM mocking for full testing. |
| Migration | ★★★★★ — Lowest risk of all proposals. Phase 1 (extract pre_check) is trivial. Phase 2 (rename). Phase 3 (extract _build_plan_output helper). Phase 4 optional validate_decision extraction. |
| State surface | ★★★★★ — Proposes _pre_check_route as a routing tag OR extending the Action literal. Recommends the routing tag to keep Action clean. Good principle. |
| Weakness | The two-node split doesn't significantly improve graph readability. The diagram still shows a large acquire_or_plan node with hidden routing. PreCheckResult dataclass is overengineered — a simple string literal or existing PlannerDecision.action would suffice. The Option B: Extract validate_decision later fallback is wise. |
Role in final recommendation: Use as the starting approach for migration, not the end state. Phase 1 extraction of pre_check is a safe first step that all proposals agree on.
6. GPT 5.5 — Thoroughbut Over-Specified¶
Nodes: enter_iteration → prepare_schema → require_region → require_currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — Nine nodes in the main path. require_region and require_currency as separate nodes from enter_iteration is over-granular — these are one-condition checks that could be a single bootstrap_gate. maybe_calculate naming is unclear (it's actually calculator adjustment, not "maybe calculate"). |
| Correctness | ★★★★★ — Very precise about which lines map to which node. Explicit about state changes (llm_failed, decision_origin). |
| Testability | ★★★★☆ — Each node is testable. But require_region/require_currency as separate nodes adds unnecessary test surface. |
| Migration | ★★★★☆ — Good step-by-step approach. Step 1 (extract pure helpers first, no graph change) is the safest possible start. |
| Post-LLM | ★★★★☆ — normalize_decision + retry_gate + maybe_calculate splits the corrector pipeline into three. normalize_decision handles redirects, retry_gate handles caps, maybe_calculate handles calculator. This is a defensible split. |
| Weakness | Nine nodes with names like maybe_calculate and retry_gate require learning the naming convention. The require_region/require_currency split into separate nodes and separate observe_region/observe_currency handler nodes is a nice idea in theory (making bootstrap questions explicit in the graph) but in practice it creates 4 new nodes for what is currently 2 inline conditionals. The cost/benefit isn't justified. |
7. Kimi 2.6 — Good Decision Matrix, Odd Node Choices¶
Nodes: tick → ask_region → ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — The calc_gate → acquire → route_direct chain is oddly organized. calc_gate routes to acquire, which then routes to route_direct or decide. But route_direct is described as "a tiny adapter node" that just converts an AcquisitionTask to a PlannerDecision and applies _adjust_calculation_decision. This is a ~5 line function, not a graph node. |
| Correctness | ★★★★☆ — Good line-by-line mapping in the appendix. However, route_direct feeds into enforce_policy, which means acquisition-fast-path decisions also go through search/ask cap enforcement. In current code, acquisition decisions skip search/ask caps (they are deterministic). This may be a subtle behavioral change. |
| Testability | ★★★★☆ — Good testability per node. |
| Naming | ★★★☆☆ — route_direct and enforce_policy are fine, but tick and calc_gate are inconsistent in formality. |
| Weakness | ask_region/ask_currency as separate interrupt nodes diverges from the current ask_user/observe_user pattern. This creates new interrupt nodes that need their own handler logic, rather than reusing the existing ask_user node. The route_direct node is too small to justify its own graph node. |
8. Qwen 3.6 Plus — Good Ideas, Flawed Routing¶
Nodes: preflight → ask_region → ask_currency → decompose → plan → route_finish_check
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — Five main nodes is reasonable. decompose as a separate node is a good idea (isolating the LLM decomposition call). route_finish_check as a separate validation node before finish is interesting. |
| Correctness | ★★☆☆☆ — Critical flaw: The proposal moves redirect logic (_redirect_derived_direct_decision, _redirect_premature_ask_for_web_field, _adjust_calculation_decision) and cap enforcement into route_after_plan as a routing function. LangGraph routing functions should be pure state readers that return a node name. Putting business logic (decision rewriting) inside a routing function is an anti-pattern that violates the "routing reads, nodes write" principle. This would make the routing function untestable in isolation and confusing to maintain. |
| Testability | ★★☆☆☆ — Due to the routing function anti-pattern, testing the redirect logic requires routing function invocation, which is less direct than testing a node. |
| Weakness | Creating observe_region/observe_currency as separate handler nodes for bootstrap questions means the existing observe_user node logic must be duplicated or refactored to handle these cases. This increases maintenance surface for a marginal gain in diagram clarity. |
9. Qwen 3.7 Max — Verbose, _route Anti-Pattern¶
Nodes: check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision
| Aspect | Assessment |
|---|---|
| Decomposition | ★★★☆☆ — 10 decision nodes is too many for the hot path. Every planner iteration traverses at least 7 of these nodes, adding checkpoint overhead. |
| Correctness | ★★☆☆☆ — Anti-pattern: Every node writes a _route transient state field. This contradicts LangGraph's design where state should be meaningful, not a routing scratchpad. Proper LangGraph routing uses conditional edge functions that read existing state fields, not a throwaway _route field written by the previous node. If _route is accidentally persisted or not cleared, it becomes a stale routing hint. |
| Testability | ★★☆☆☆ — The _route pattern means every test must set _route on input state, making tests verbose and fragile. |
| Naming | ★★☆☆☆ — check_termination, check_region_currency, check_calculator, check_completion — the check_ prefix is redundant. compose_schema and post_process are vague. |
| Weakness | The proposal includes complete pseudocode for every node, which is admirable for documentation but the pseudocode doesn't match the real implementation (e.g., _llm_failed handling is oversimplified). The _route field pattern is a design anti-pattern that should not be adopted. Also proposes ask_region/ask_currency as separate interrupt nodes (same issue as Kimi). |
10. Gemini 3.1 Pro — Too Shallow¶
Nodes: prepare_state → evaluate_rules → llm_plan
| Aspect | Assessment |
|---|---|
| Decomposition | ★★☆☆☆ — Three nodes is insufficient. evaluate_rules (~100 LOC) combines calculator gates, acquisition logic, AND auto-finish into one node. llm_plan still has all post-LLM corrections. The graph topology barely changes from the current monolith. |
| Correctness | ★★★☆☆ — No line mapping. No explicit handling of post-LLM corrections. The proposal says "you can now test business rules natively as pure python functions" but doesn't specify how evaluate_rules handles the calculator-blocked → acquisition task path, which includes LLM calls for dynamic decomposition. |
| Testability | ★★★☆☆ — prepare_state and evaluate_rules are testable. But llm_plan still has 100+ LOC of correctors. |
| Weakness | The 3-node decomposition doesn't justify the migration risk. The graph diagram shows evaluate_rules routing to 5 destinations, but the routing logic is still hidden inside the node (same problem as current plan_node, just smaller). No post-LLM handling specification. |
11. Mimo 2.5 Pro — Minimal but Insufficient¶
Nodes: guards → acquire → decide
| Aspect | Assessment |
|---|---|
| Decomposition | ★★☆☆☆ — Three nodes. guards (~40 LOC) is fine. acquire (~60 LOC) is fine. But decide (~80 LOC) still has all correctors, all cap enforcement, and _adjust_calculation_decision crammed in. The stated benefit says decide is "the most complex node," which means the decomposition hasn't actually simplified the hardest part. |
| Correctness | ★★★☆☆ — The line estimates (40/60/80 LOC) don't add up to 350. The acquire node handles decomposition, composition, calculator gates, AND acquisition task selection — that's not ~60 lines. The decide node handles LLM + 5 correctors + cap enforcement + logging, which is well over 80 lines. |
| Testability | ★★★☆☆ — guards and acquire are testable. decide still requires LLM mocking for full testing. |
| Weakness | The proposal has the best summary table (Before/After metrics) but the numbers don't match the actual code complexity. The open questions at the end reveal the key uncertainty: should decide include cap enforcement or should that be a separate node? This is the question that Fable 5 answers definitively with decision_origin. |
Recommendation¶
Primary: Adopt Fable 5's Architecture¶
Target topology:
START → tick → prepare → select → decide → guard → [action nodes]
↘ (early exit) → finish ↑ ↗
└─ [action nodes loop back to tick]
Why Fable 5:
-
decision_originis the key insight. No other proposal cleanly solves the problem of "which correctors apply to which decisions." The"deterministic"vs."llm"tag letsguardapply_adjust_calculation_decisionto both paths (matching current behavior) but search/ask caps only to LLM decisions (also matching current behavior). This is precise, minimal, and correct. -
Five nodes is the right granularity. Four or fewer means one node still has >100 LOC. Ten or more means excessive checkpoint overhead per iteration. Five hits the sweet spot where each node has a clear single responsibility and the graph diagram is readable.
-
Migration is safe. Phase 1 (extract helpers → keep
plan_nodeas thin wrapper) is risk-free. Phase 2 (rewire graph) is a single breaking change to the graph topology. Phase 3 (docs update) is mechanical. -
selectconsolidates all deterministic routing. Calculator gates, acquisition tasks, auto-finish — all the "we know what to do without LLM" logic — live in one place. This is the most testable unit: no LLM mocking needed. -
guardis a single pass-through for correctors. Instead of 5-6 separate nodes in the Opus proposal,guardapplies correctors sequentially based ondecision_origin. This is one function call per corrector, one graph hop, one checkpoint write — not five.
Secondary: Incorporate GPT 5.4's Naming and Migration Strategy¶
Adopt GPT 5.4's naming convention which is clearer than Fable 5's:
| Fable 5 | GPT 5.4 equivalent | Recommended name |
|---|---|---|
tick |
tick |
tick |
prepare |
prepare_context |
prepare |
select |
acquisition_gate |
select (more accurate — not just acquisition) |
decide |
llm_plan |
decide (shorter, clearer) |
guard |
decision_policy |
guard (shorter, established term) |
Adopt GPT 5.4's staged migration: extract one node at a time, starting with tick, then prepare, then rewire.
Adoption Strategy¶
Phase 0: Extract pure helper functions from plan_node without changing graph topology (all proposals agree on this). This is risk-free and should land first.
Phase 1: Add tick node with START → tick → plan edge. All other nodes still route through plan. Verify tests pass.
Phase 2: Add prepare node. Route tick → prepare → plan. Verify tests pass.
Phase 3: Add select node between prepare and plan. Add decision_origin state field. Verify tests pass.
Phase 4: Extract guard from plan_node post-LLM corrections. Rewire: select → plan → guard where plan becomes decide. Verify tests pass.
Phase 5: Bump planner thread namespace to v2. Remove old plan_node. Update docs.
Each phase is a separate PR. Each phase has its own test updates. The decision_origin field (added in Phase 3) is the only new state field.
What NOT to Adopt¶
- Separate region/currency nodes (Qwen 3.6, Kimi 2.6, Opus, GPT 5.5): Four new nodes for two simple conditionals. The inline check in
tickmaintaining current behavior is simpler and less error-prone. _routetransient state field (Qwen 3.7 Max): Anti-pattern. Routing functions should read existing state, not a throwaway routing tag.- Separate corrector nodes (Opus's
c_*chain): Six extra graph hops per iteration for deterministic functions that run in microseconds. Consolidate intoguard. route_finish_checkas a node (Qwen 3.6): Auto-finish logic belongs inselectas a deterministic routing decision, not a separate node betweenplanandfinish.PreCheckResultdataclass for routing (GLM 5.1): Overengineered. Usedecision_origin+decision.actionas the routing signals, not a custom result type.
Summary Table¶
| Rank | Proposal | Nodes | Best At | Weakest At | Verdict |
|---|---|---|---|---|---|
| 1 | Fable 5 | 5 | Correctness, decomposition, decision_origin insight |
select has possible LLM call (blocked-field decomposition) |
Adopt |
| 2 | DeepSeek 4 Pro | 4 | Migration plan, line precision | prepare mixes calculator gates with schema composition |
Strong runner-up |
| 3 | GPT 5.4 | 6 | Naming, anti-patterns, migration staging | Six hops per iteration | Adopt naming + migration strategy |
| 4 | Opus | 10+ | Analysis depth, line mapping | Too many nodes, 6 corrector hops | Use as reference, not as architecture |
| 5 | GLM 5.1 | 2 | Lowest risk, conservatism | Doesn't solve the problem enough | Starting approach only |
| 6 | GPT 5.5 | 9 | Thoroughness, state design | Over-granular bootstrap split | Some ideas worth borrowing |
| 7 | Kimi 2.6 | 9 | Decision matrix | route_direct too small for node, separate interrupt nodes |
Partially useful |
| 8 | Qwen 3.6 | 6 | decompose isolation, route_finish_check |
Business logic in routing function | Routing anti-pattern |
| 9 | Qwen 3.7 | 10 | Pseudocode completeness | _route anti-pattern, too many check_ nodes |
Do not adopt |
| 10 | Gemini 3.1 | 3 | Simplicity | No post-LLM handling, too shallow | Do not adopt |
| 11 | Mimo 2.5 | 3 | Summary table | LOC estimates wrong, decide still too complex |
Do not adopt |