Skip to content

Planner Graph Refactor Proposals — Evaluation & Ranking

Evaluator: mimo-2.5-pro (Sisyphus) Date: 2026-06-11 Scope: 11 proposals in docs/planner-graph-ref/proposals/ Reference: current-graph.md, src/venturescope/planner/agent.py:846-1192


Evaluation Criteria

Each proposal is assessed on:

Criterion Weight What it measures
Problem accuracy 15% Does the proposal correctly identify all 10+ responsibilities in plan_node?
Decomposition quality 25% Are node boundaries drawn at the right seams? Single-responsibility?
Graph honesty 20% Does the resulting Mermaid diagram actually represent the control flow?
Migration safety 15% Can this ship incrementally without breaking existing planner behavior?
Test seam quality 10% Can each new node be unit-tested in isolation?
LangGraph alignment 10% Does it follow LangGraph idioms (nodes do work, edges do routing)?
Practical completeness 5% Are state changes, risks, and edge cases addressed?

Ranked List

1. fable-5 — "Decompose plan_node into Graph-Level Stages"

Score: 9.2/10

Criterion Score Notes
Problem accuracy 9 Maps all 9 responsibility blocks with line ranges
Decomposition quality 9.5 tick / prepare / select / decide / guard — clean, well-bounded
Graph honesty 9.5 Mermaid diagram shows real control flow including deterministic bypass
Migration safety 9 Two-phase: extract without rewiring, then rewire. Thread namespace bump for in-flight checkpoints.
Test seam quality 9.5 decision_origin field cleanly separates deterministic vs LLM paths for guard
LangGraph alignment 9 Nodes own mutations, edges own routing, tick as loop entry
Practical completeness 9 State changes documented, checkpoint compatibility addressed

What's good: - The tickprepareselectdecideguard pipeline is the most natural decomposition. Each node has exactly one job. - decision_origin: Literal["deterministic", "llm"] is elegant — it tells guard which rewrite subset to apply without needing to know where the decision came from. - The select node correctly captures "deterministic decision ladder" as a distinct phase before the LLM is ever called. - Checkpoint compatibility is explicitly addressed with thread namespace bumping. - The "Follow-up" section honestly defers decompose loop node and Command(goto=...) as out-of-scope.

What's bad: - select still has one LLM call (decomposition for blocked field without recipe, lines 960-968). This slightly violates "deterministic" labeling. The proposal acknowledges this. - The guard node's decision_origin-based branching adds a small amount of complexity that needs documentation.


2. gpt-5.5 — "Move Planner Control Logic to Graph Level"

Score: 9.0/10

Criterion Score Notes
Problem accuracy 9 Lists all 9 responsibility categories
Decomposition quality 9 enter_iteration / prepare_schema / require_region / require_currency / acquisition_gate / plan / normalize_decision / retry_gate / maybe_calculate
Graph honesty 9 Mermaid is detailed and accurate
Migration safety 8.5 4-step staged migration, but many nodes to wire at once
Test seam quality 9.5 Each policy node is independently testable
LangGraph alignment 9 Clean separation of concerns
Practical completeness 9 State changes, test strategy, non-goals all documented

What's good: - normalize_decision, retry_gate, maybe_calculate as separate nodes is the most granular correct decomposition. Each post-LLM policy is independently testable. - require_region and require_currency as dedicated nodes (not just ask_user) makes bootstrap flow explicit in the graph. - The "Non-goals" section is precise: no outer graph changes, no SQL persistence changes. - llm_failed or decision_origin state field for calculator loop protection is correctly identified.

What's bad: - 9 new nodes is a lot. The graph becomes visually complex even though behavior is the same. For a team that reads Mermaid diagrams, this may be harder to follow than a 5-node pipeline. - The migration plan ("Step 1: extract helpers, Step 2: add graph nodes one group at a time") is good but doesn't address checkpoint compatibility for in-flight conversations.


3. gpt-5.4 — "Planner Graph 5.4 Proposal"

Score: 8.8/10

Criterion Score Notes
Problem accuracy 8.5 Lists 7 responsibility categories (slightly less granular)
Decomposition quality 9 tick / bootstrap_gate / prepare_context / acquisition_gate / llm_plan / decision_policy
Graph honesty 9 Clean Mermaid with 6 pipeline stages
Migration safety 9 6-stage migration, each stage is a standalone commit
Test seam quality 8.5 Good, but decision_policy bundles multiple rewrites
LangGraph alignment 9 Explicitly warns against state mutation in edge functions
Practical completeness 8.5 Anti-patterns section is excellent

What's good: - The "anti-patterns to avoid" section is unique and valuable: "Do not move state mutation into edge functions", "Do not increment iterations in every gate", "Do not turn every policy rule into a node", "Do not move interrupts into policy gates". - 6-stage migration plan is the most detailed and safest. - tick as the sole owner of iterations increment is correctly called out as critical. - "deterministic orchestration belongs in graph phases; business logic stays in helpers; LLM planning stays small" is the clearest design principle statement.

What's bad: - decision_policy bundles derived-field redirect, web-preferred redirect, search cap, ask-user cap, and calc adjustment into one node. This is a pragmatic choice but means that node is still ~80 lines with multiple concerns. - bootstrap_gate as a single node that routes to ask_user for either region or currency loses some explicitness compared to separate require_region/require_currency nodes.


4. deepseek-4-pro — "Decompose plan_node into Graph-Level Logic"

Score: 8.5/10

Criterion Score Notes
Problem accuracy 9 Lists all 12 responsibility items with line references
Decomposition quality 8.5 guard / prepare / plan / adjust — 4 nodes
Graph honesty 8.5 Good Mermaid, but prepare routing is complex
Migration safety 9 4-phase migration, Phase 1 is extraction without graph changes
Test seam quality 8 Good, but prepare has 5 exit paths
LangGraph alignment 8.5 Mostly good, but plan self-loop for decomposition is unusual
Practical completeness 8.5 Risk assessment is honest

What's good: - The guardprepareplanadjust pipeline is the simplest 4-node decomposition that captures the essential phases. - recipes: dict[str, FieldAcquisition] in state is a good optimization — avoids recomputing in multiple nodes. - The migration path is clear: Phase 1 (extract helpers), Phase 2 (rewire graph), Phase 3 (cleanup). - Risk assessment is honest about recipes synchronization concerns.

What's bad: - prepare has 5 exit conditions (calculator cap, calculator success, blocked→acquire, auto-complete, pass-through). This is more routing complexity than ideal for a "preparation" node. - plan self-looping for decomposition generation is architecturally unusual — a node looping to itself with a conditional edge feels like a workaround. - adjust at ~80 lines is still substantial for a "correction" node.


5. opus — "Hoist plan_node Policy Into Graph Edges"

Score: 8.3/10

Criterion Score Notes
Problem accuracy 9.5 Most detailed responsibility table (17 items with line ranges)
Decomposition quality 8 g_iter_cap / g_region / g_currency / enrich_schema / g_calc_caps / acquire / plan_llm / 6 c_* correctors / dispatch
Graph honesty 8.5 Very detailed Mermaid with naming conventions
Migration safety 8 3-step migration, but Step 2 is "hoist gates into graph" which is a big change
Test seam quality 8.5 Each corrector is independently testable
LangGraph alignment 7.5 10+ nodes with add_edge chains for correctors — unusual pattern
Practical completeness 8.5 State surface changes, routing rules documented

What's good: - The g_* (gate) / enrich / acquire / plan_llm / c_* (corrector) / dispatch naming convention is the clearest taxonomy. - The "one invariant ties everything together" section is excellent: every path deposits a PlannerDecision, dispatch reads decision.action. - 17-item responsibility mapping is the most thorough problem analysis. - State surface changes table (which node writes which fields) is unique and valuable.

What's bad: - 10+ graph nodes is too many. The c_* corrector chain as 6 separate nodes with add_edge (not add_conditional_edges) creates a long linear chain that adds checkpoint writes without adding routing intelligence. - The g_regionemit_region split (predicate vs emitter) is over-engineering for a 2-line check. - Migration Step 2 ("hoist gates into graph") is a large change — multiple new nodes and edges at once.


6. kimi-2.6 — "Decompose plan_node into Graph-Level Pipeline"

Score: 8.1/10

Criterion Score Notes
Problem accuracy 8.5 Lists 9 responsibility categories
Decomposition quality 8 tick / ask_region / ask_currency / prepare / calc_gate / acquire / route_direct / decide / enforce_policy
Graph honesty 8 Good Mermaid with clear node labels
Migration safety 7.5 3-step migration but lacks checkpoint compatibility details
Test seam quality 8 Each node is testable
LangGraph alignment 8 Good separation
Practical completeness 7.5 Decision matrix appendix is helpful but risks section is thin

What's good: - ask_region / ask_currency as dedicated nodes (not generic ask_user) is correct — bootstrap questions are structural, not LLM decisions. - The decision matrix appendix (Logic Block → Current Location → Proposed Location → Reason) is excellent for implementation. - route_direct as a tiny adapter node bridging deterministic acquisition into the post-processing pipeline is a clean pattern.

What's bad: - enforce_policy bundles 7 different rewrites into one node. The proposal acknowledges this ("still the most complex node") but doesn't offer a clear path to further decomposition. - The migration plan is thin — "Extract helper functions" → "Register as nodes" → "Delete plan_node" is too high-level. - No discussion of checkpoint compatibility or in-flight conversation handling.


7. mimo-2.5-pro — "Refactor plan_node — Move Routing Logic to Graph Level"

Score: 7.8/10

Criterion Score Notes
Problem accuracy 8 Lists 10 responsibility categories
Decomposition quality 7.5 guards / acquire / decide — 3 nodes
Graph honesty 7.5 Simple Mermaid, but decide bundles too much
Migration safety 8 3-phase migration with risk assessment
Test seam quality 7.5 decide node is still ~80 lines with multiple concerns
LangGraph alignment 7.5 Mostly good, but plan_router as a routing function (not a node) is inconsistent
Practical completeness 7.5 Open questions section is honest

What's good: - The 3-node decomposition (guardsacquiredecide) is the simplest proposal. Easy to understand and implement. - Phase 1 (extract without changing topology) is the safest migration approach. - The "Expected Benefits" table is honest about LOC counts.

What's bad: - decide bundles LLM call + redirectors + loop/cap detection. This is still a complex node (~80 lines) with multiple concerns. - plan_router as a routing function (not a node) is inconsistent with the rest of the graph — it's a function that decide calls, not a graph-level construct. - Open questions (should decide include loop detection? should plan_router be a node?) suggest the design isn't fully resolved.


8. gemini-3.1-pro — "Refactoring Planner Logic to the Graph Level"

Score: 7.5/10

Criterion Score Notes
Problem accuracy 7.5 Lists 8 responsibility items (less granular)
Decomposition quality 7 prepare_state / evaluate_rules / llm_plan — 3 nodes
Graph honesty 7 Simple Mermaid, but evaluate_rules bundles too much
Migration safety 7.5 3-step implementation plan
Test seam quality 7 evaluate_rules is still complex
LangGraph alignment 7.5 Good separation of concerns
Practical completeness 6.5 Benefits section is thin, no risks section

What's good: - The 3-node split (prepare_state / evaluate_rules / llm_plan) is the simplest decomposition that captures the essential separation. - "Reduced Latency/Cost Risk: The LLM node becomes completely isolated" is a good insight. - The implementation steps are concrete and actionable.

What's bad: - evaluate_rules bundles 7 different checks (max_iters, aborted, region, currency, calculator, acquisition, auto-finish) into one node. This is still a complex function. - The "route_after_rules" edge function handles both deterministic routing AND the "needs_llm" flag — this mixes concerns. - No discussion of state changes, checkpoint compatibility, or risks. - The Mermaid diagram uses route_after_rules and route_after_llm as diamond nodes, but in LangGraph these are edge functions, not nodes. This is slightly misleading.


9. glm-5.1 — "Decompose plan Node into Graph-Level Routing"

Score: 7.3/10

Criterion Score Notes
Problem accuracy 7.5 Lists 6 responsibility categories
Decomposition quality 7 pre_check / acquire_or_plan — 2 nodes
Graph honesty 7 Simple Mermaid, but acquire_or_plan is still complex
Migration safety 7.5 4-phase migration with optional Phase 4
Test seam quality 7 pre_check is testable, acquire_or_plan is still ~200 LOC
LangGraph alignment 7 Good separation
Practical completeness 7 Risks section is present but thin

What's good: - The pre_check extraction is the minimal valuable change — it pulls out all no-LLM guards into a testable node. - Phase 4 (optional validate_decision node) is honest about what's deferred. - The _build_plan_output helper idea is practical — collapsing 9 conditional-return patterns.

What's bad: - acquire_or_plan at ~200 LOC is still a substantial node with 7 sequential steps. The proposal acknowledges this but doesn't offer a clear path to further decomposition. - Only 2 new nodes is the least ambitious decomposition. It solves the "guard interleaving" problem but leaves the "post-LLM rewriting" problem inside acquire_or_plan. - The _pre_check_route state field uses a string literal instead of a typed enum, which is less safe.


10. qwen-3.6-plus — "Decompose plan_node into Graph-Level Routing"

Score: 7.0/10

Criterion Score Notes
Problem accuracy 7.5 Lists 12 responsibility items
Decomposition quality 6.5 preflight / decompose / plan + route_after_plan with redirects + route_finish_check
Graph honesty 7 Good Mermaid, but route_after_plan as edge function doing redirects is unusual
Migration safety 7 5-phase migration
Test seam quality 6.5 route_after_plan as an edge function doing redirects is hard to test
LangGraph alignment 6 Edge functions doing state mutation (redirects) violates LangGraph idioms
Practical completeness 7 Open questions section is thoughtful

What's good: - ask_region / ask_currency with dedicated observe_region / observe_currency nodes is the most explicit bootstrap handling. - route_finish_check as a validation node before END is a unique insight — it catches cases where finish is premature. - The 5-phase migration plan is detailed.

What's bad: - Moving post-LLM redirects into route_after_plan (an edge function) is an anti-pattern in LangGraph. Edge functions should be pure routing decisions, not state mutations. This makes checkpoint behavior harder to reason about. - route_after_plan doing 6 different redirect checks is complex for an edge function — it should be a node. - The proposal creates observe_region / observe_currency as separate nodes, which adds graph complexity without clear benefit (they share most logic with observe_user).


11. qwen-3.7-max — "Decomposing the plan Node"

Score: 6.5/10

Criterion Score Notes
Problem accuracy 7 Lists 11 responsibility items
Decomposition quality 6 10 nodes including check_termination, check_region_currency, compose_schema, check_calculator, acquisition_routing, check_completion, llm_decide, post_process, enforce_caps, route_decision
Graph honesty 6.5 Detailed Mermaid but overly granular
Migration safety 6 5-phase migration, but 10 nodes is a big change
Test seam quality 6.5 Each node is testable, but graph complexity is high
LangGraph alignment 5.5 _route state field as transient routing hint is an anti-pattern
Practical completeness 6.5 State schema changes documented, but _route field approach is problematic

What's good: - The most granular decomposition — every check is its own node. - Code examples for each node are complete and runnable. - The _route state field approach is at least explicit about what it's doing.

What's bad: - 10 new nodes is too many. The graph becomes harder to read than the original monolith. The Mermaid diagram has 20+ edges. - Using _route: NotRequired[str] as a transient routing hint stored in state is an anti-pattern. LangGraph conditional edges should read existing state and return a string — they shouldn't need a special routing field. This adds state mutation that the checkpointer sees. - check_termination and check_region_currency as separate nodes adds a hop for what could be a single preflight node. - route_decision as a node that just reads decision.action is redundant — this is what route_after_plan already does as an edge function.


Summary Table

Rank Proposal Author Nodes Score Key Strength Key Weakness
1 fable-5 fable-5 5 new 9.2 decision_origin for deterministic vs LLM separation select still has one LLM call
2 gpt-5.5 gpt-5.5 9 new 9.0 Most granular correct decomposition Graph visually complex
3 gpt-5.4 gpt-5.4 6 new 8.8 Best anti-patterns section, safest migration decision_policy bundles multiple rewrites
4 deepseek-4-pro deepseek-v4-pro 4 new 8.5 Simplest pipeline that captures essential phases plan self-loop for decomposition
5 opus opus-4.7 10+ new 8.3 Best naming convention and responsibility mapping Too many nodes, corrector chain overhead
6 kimi-2.6 kimi-k2.6 8 new 8.1 Decision matrix appendix enforce_policy still complex
7 mimo-2.5-pro mimo-2.5-pro 3 new 7.8 Simplest proposal, easiest to understand decide bundles too much
8 gemini-3.1-pro gemini-3.1-pro 3 new 7.5 Simplest viable decomposition evaluate_rules still complex, thin docs
9 glm-5.1 glm-5.1 2 new 7.3 Minimal valuable change acquire_or_plan still ~200 LOC
10 qwen-3.6-plus qwen-3.6-plus 5 new 7.0 route_finish_check validation node Edge functions doing state mutation
11 qwen-3.7-max qwen-3.7-max 10 new 6.5 Most granular Too many nodes, _route anti-pattern

Recommendation

Best single proposal: fable-5. It has the best balance of decomposition quality, migration safety, and LangGraph alignment. The decision_origin field is an elegant solution for separating deterministic and LLM paths.

Best combination: fable-5 core + gpt-5.4 anti-patterns + gpt-5.5 post-LLM granularity.

Concretely:

  1. Adopt fable-5's 5-node pipeline (tickprepareselectdecideguard) as the core architecture.
  2. Adopt gpt-5.4's anti-patterns section as implementation constraints (no state mutation in edges, only tick increments iterations, don't turn every rule into a node, keep ask_user as sole interrupt).
  3. Consider splitting guard into normalize_decision + retry_gate + maybe_calculate (from gpt-5.5) if post-LLM policy testing proves difficult after the initial refactor. This can be a follow-up PR.
  4. Adopt gpt-5.4's 6-stage migration plan for safety (each stage is a standalone commit).

The combined approach gives: - 5 nodes in the initial refactor (fable-5) - Clear design constraints (gpt-5.4 anti-patterns) - Optional further decomposition path (gpt-5.5 post-LLM nodes) - Safest possible migration (gpt-5.4 staged approach)

What to avoid: - qwen-3.7-max's 10-node decomposition and _route state field pattern - qwen-3.6-plus's edge functions doing state mutation - opus's 10+ node corrector chain (too many checkpoint writes per iteration)


Appendix: Convergence Analysis

Most proposals agree on these architectural decisions:

Decision Consensus Dissent
Extract iteration/abort guards into a tick node 10/11 agree glm-5.1 calls it pre_check
Extract schema preparation (decomposition + composition) into its own node 9/11 agree qwen-3.7-max calls it compose_schema
Isolate the LLM call into a slim node 11/11 agree
Extract post-LLM decision correction into a separate phase 8/11 agree glm-5.1 keeps it in acquire_or_plan
Region/currency bootstrap should bypass normal planning 7/11 agree Some keep it in tick/guard
Loop-back edges should target the first guard node, not plan 9/11 agree
llm_failed or decision_origin state field for calculator loop protection 6/11 agree Some handle it inline

The strongest consensus is on isolating the LLM call and extracting iteration guards. The weakest consensus is on how granular post-LLM correction should be (one node vs multiple).