Planner Graph Refactor Proposals — Evaluation & Ranking¶
Evaluator: mimo-2.5-pro (Sisyphus)
Date: 2026-06-11
Scope: 11 proposals in docs/planner-graph-ref/proposals/
Reference: current-graph.md, src/venturescope/planner/agent.py:846-1192
Evaluation Criteria¶
Each proposal is assessed on:
| Criterion | Weight | What it measures |
|---|---|---|
| Problem accuracy | 15% | Does the proposal correctly identify all 10+ responsibilities in plan_node? |
| Decomposition quality | 25% | Are node boundaries drawn at the right seams? Single-responsibility? |
| Graph honesty | 20% | Does the resulting Mermaid diagram actually represent the control flow? |
| Migration safety | 15% | Can this ship incrementally without breaking existing planner behavior? |
| Test seam quality | 10% | Can each new node be unit-tested in isolation? |
| LangGraph alignment | 10% | Does it follow LangGraph idioms (nodes do work, edges do routing)? |
| Practical completeness | 5% | Are state changes, risks, and edge cases addressed? |
Ranked List¶
1. fable-5 — "Decompose plan_node into Graph-Level Stages"¶
Score: 9.2/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 9 | Maps all 9 responsibility blocks with line ranges |
| Decomposition quality | 9.5 | tick / prepare / select / decide / guard — clean, well-bounded |
| Graph honesty | 9.5 | Mermaid diagram shows real control flow including deterministic bypass |
| Migration safety | 9 | Two-phase: extract without rewiring, then rewire. Thread namespace bump for in-flight checkpoints. |
| Test seam quality | 9.5 | decision_origin field cleanly separates deterministic vs LLM paths for guard |
| LangGraph alignment | 9 | Nodes own mutations, edges own routing, tick as loop entry |
| Practical completeness | 9 | State changes documented, checkpoint compatibility addressed |
What's good:
- The tick → prepare → select → decide → guard pipeline is the most natural decomposition. Each node has exactly one job.
- decision_origin: Literal["deterministic", "llm"] is elegant — it tells guard which rewrite subset to apply without needing to know where the decision came from.
- The select node correctly captures "deterministic decision ladder" as a distinct phase before the LLM is ever called.
- Checkpoint compatibility is explicitly addressed with thread namespace bumping.
- The "Follow-up" section honestly defers decompose loop node and Command(goto=...) as out-of-scope.
What's bad:
- select still has one LLM call (decomposition for blocked field without recipe, lines 960-968). This slightly violates "deterministic" labeling. The proposal acknowledges this.
- The guard node's decision_origin-based branching adds a small amount of complexity that needs documentation.
2. gpt-5.5 — "Move Planner Control Logic to Graph Level"¶
Score: 9.0/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 9 | Lists all 9 responsibility categories |
| Decomposition quality | 9 | enter_iteration / prepare_schema / require_region / require_currency / acquisition_gate / plan / normalize_decision / retry_gate / maybe_calculate |
| Graph honesty | 9 | Mermaid is detailed and accurate |
| Migration safety | 8.5 | 4-step staged migration, but many nodes to wire at once |
| Test seam quality | 9.5 | Each policy node is independently testable |
| LangGraph alignment | 9 | Clean separation of concerns |
| Practical completeness | 9 | State changes, test strategy, non-goals all documented |
What's good:
- normalize_decision, retry_gate, maybe_calculate as separate nodes is the most granular correct decomposition. Each post-LLM policy is independently testable.
- require_region and require_currency as dedicated nodes (not just ask_user) makes bootstrap flow explicit in the graph.
- The "Non-goals" section is precise: no outer graph changes, no SQL persistence changes.
- llm_failed or decision_origin state field for calculator loop protection is correctly identified.
What's bad: - 9 new nodes is a lot. The graph becomes visually complex even though behavior is the same. For a team that reads Mermaid diagrams, this may be harder to follow than a 5-node pipeline. - The migration plan ("Step 1: extract helpers, Step 2: add graph nodes one group at a time") is good but doesn't address checkpoint compatibility for in-flight conversations.
3. gpt-5.4 — "Planner Graph 5.4 Proposal"¶
Score: 8.8/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 8.5 | Lists 7 responsibility categories (slightly less granular) |
| Decomposition quality | 9 | tick / bootstrap_gate / prepare_context / acquisition_gate / llm_plan / decision_policy |
| Graph honesty | 9 | Clean Mermaid with 6 pipeline stages |
| Migration safety | 9 | 6-stage migration, each stage is a standalone commit |
| Test seam quality | 8.5 | Good, but decision_policy bundles multiple rewrites |
| LangGraph alignment | 9 | Explicitly warns against state mutation in edge functions |
| Practical completeness | 8.5 | Anti-patterns section is excellent |
What's good:
- The "anti-patterns to avoid" section is unique and valuable: "Do not move state mutation into edge functions", "Do not increment iterations in every gate", "Do not turn every policy rule into a node", "Do not move interrupts into policy gates".
- 6-stage migration plan is the most detailed and safest.
- tick as the sole owner of iterations increment is correctly called out as critical.
- "deterministic orchestration belongs in graph phases; business logic stays in helpers; LLM planning stays small" is the clearest design principle statement.
What's bad:
- decision_policy bundles derived-field redirect, web-preferred redirect, search cap, ask-user cap, and calc adjustment into one node. This is a pragmatic choice but means that node is still ~80 lines with multiple concerns.
- bootstrap_gate as a single node that routes to ask_user for either region or currency loses some explicitness compared to separate require_region/require_currency nodes.
4. deepseek-4-pro — "Decompose plan_node into Graph-Level Logic"¶
Score: 8.5/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 9 | Lists all 12 responsibility items with line references |
| Decomposition quality | 8.5 | guard / prepare / plan / adjust — 4 nodes |
| Graph honesty | 8.5 | Good Mermaid, but prepare routing is complex |
| Migration safety | 9 | 4-phase migration, Phase 1 is extraction without graph changes |
| Test seam quality | 8 | Good, but prepare has 5 exit paths |
| LangGraph alignment | 8.5 | Mostly good, but plan self-loop for decomposition is unusual |
| Practical completeness | 8.5 | Risk assessment is honest |
What's good:
- The guard → prepare → plan → adjust pipeline is the simplest 4-node decomposition that captures the essential phases.
- recipes: dict[str, FieldAcquisition] in state is a good optimization — avoids recomputing in multiple nodes.
- The migration path is clear: Phase 1 (extract helpers), Phase 2 (rewire graph), Phase 3 (cleanup).
- Risk assessment is honest about recipes synchronization concerns.
What's bad:
- prepare has 5 exit conditions (calculator cap, calculator success, blocked→acquire, auto-complete, pass-through). This is more routing complexity than ideal for a "preparation" node.
- plan self-looping for decomposition generation is architecturally unusual — a node looping to itself with a conditional edge feels like a workaround.
- adjust at ~80 lines is still substantial for a "correction" node.
5. opus — "Hoist plan_node Policy Into Graph Edges"¶
Score: 8.3/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 9.5 | Most detailed responsibility table (17 items with line ranges) |
| Decomposition quality | 8 | g_iter_cap / g_region / g_currency / enrich_schema / g_calc_caps / acquire / plan_llm / 6 c_* correctors / dispatch |
| Graph honesty | 8.5 | Very detailed Mermaid with naming conventions |
| Migration safety | 8 | 3-step migration, but Step 2 is "hoist gates into graph" which is a big change |
| Test seam quality | 8.5 | Each corrector is independently testable |
| LangGraph alignment | 7.5 | 10+ nodes with add_edge chains for correctors — unusual pattern |
| Practical completeness | 8.5 | State surface changes, routing rules documented |
What's good:
- The g_* (gate) / enrich / acquire / plan_llm / c_* (corrector) / dispatch naming convention is the clearest taxonomy.
- The "one invariant ties everything together" section is excellent: every path deposits a PlannerDecision, dispatch reads decision.action.
- 17-item responsibility mapping is the most thorough problem analysis.
- State surface changes table (which node writes which fields) is unique and valuable.
What's bad:
- 10+ graph nodes is too many. The c_* corrector chain as 6 separate nodes with add_edge (not add_conditional_edges) creates a long linear chain that adds checkpoint writes without adding routing intelligence.
- The g_region → emit_region split (predicate vs emitter) is over-engineering for a 2-line check.
- Migration Step 2 ("hoist gates into graph") is a large change — multiple new nodes and edges at once.
6. kimi-2.6 — "Decompose plan_node into Graph-Level Pipeline"¶
Score: 8.1/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 8.5 | Lists 9 responsibility categories |
| Decomposition quality | 8 | tick / ask_region / ask_currency / prepare / calc_gate / acquire / route_direct / decide / enforce_policy |
| Graph honesty | 8 | Good Mermaid with clear node labels |
| Migration safety | 7.5 | 3-step migration but lacks checkpoint compatibility details |
| Test seam quality | 8 | Each node is testable |
| LangGraph alignment | 8 | Good separation |
| Practical completeness | 7.5 | Decision matrix appendix is helpful but risks section is thin |
What's good:
- ask_region / ask_currency as dedicated nodes (not generic ask_user) is correct — bootstrap questions are structural, not LLM decisions.
- The decision matrix appendix (Logic Block → Current Location → Proposed Location → Reason) is excellent for implementation.
- route_direct as a tiny adapter node bridging deterministic acquisition into the post-processing pipeline is a clean pattern.
What's bad:
- enforce_policy bundles 7 different rewrites into one node. The proposal acknowledges this ("still the most complex node") but doesn't offer a clear path to further decomposition.
- The migration plan is thin — "Extract helper functions" → "Register as nodes" → "Delete plan_node" is too high-level.
- No discussion of checkpoint compatibility or in-flight conversation handling.
7. mimo-2.5-pro — "Refactor plan_node — Move Routing Logic to Graph Level"¶
Score: 7.8/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 8 | Lists 10 responsibility categories |
| Decomposition quality | 7.5 | guards / acquire / decide — 3 nodes |
| Graph honesty | 7.5 | Simple Mermaid, but decide bundles too much |
| Migration safety | 8 | 3-phase migration with risk assessment |
| Test seam quality | 7.5 | decide node is still ~80 lines with multiple concerns |
| LangGraph alignment | 7.5 | Mostly good, but plan_router as a routing function (not a node) is inconsistent |
| Practical completeness | 7.5 | Open questions section is honest |
What's good:
- The 3-node decomposition (guards → acquire → decide) is the simplest proposal. Easy to understand and implement.
- Phase 1 (extract without changing topology) is the safest migration approach.
- The "Expected Benefits" table is honest about LOC counts.
What's bad:
- decide bundles LLM call + redirectors + loop/cap detection. This is still a complex node (~80 lines) with multiple concerns.
- plan_router as a routing function (not a node) is inconsistent with the rest of the graph — it's a function that decide calls, not a graph-level construct.
- Open questions (should decide include loop detection? should plan_router be a node?) suggest the design isn't fully resolved.
8. gemini-3.1-pro — "Refactoring Planner Logic to the Graph Level"¶
Score: 7.5/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 7.5 | Lists 8 responsibility items (less granular) |
| Decomposition quality | 7 | prepare_state / evaluate_rules / llm_plan — 3 nodes |
| Graph honesty | 7 | Simple Mermaid, but evaluate_rules bundles too much |
| Migration safety | 7.5 | 3-step implementation plan |
| Test seam quality | 7 | evaluate_rules is still complex |
| LangGraph alignment | 7.5 | Good separation of concerns |
| Practical completeness | 6.5 | Benefits section is thin, no risks section |
What's good:
- The 3-node split (prepare_state / evaluate_rules / llm_plan) is the simplest decomposition that captures the essential separation.
- "Reduced Latency/Cost Risk: The LLM node becomes completely isolated" is a good insight.
- The implementation steps are concrete and actionable.
What's bad:
- evaluate_rules bundles 7 different checks (max_iters, aborted, region, currency, calculator, acquisition, auto-finish) into one node. This is still a complex function.
- The "route_after_rules" edge function handles both deterministic routing AND the "needs_llm" flag — this mixes concerns.
- No discussion of state changes, checkpoint compatibility, or risks.
- The Mermaid diagram uses route_after_rules and route_after_llm as diamond nodes, but in LangGraph these are edge functions, not nodes. This is slightly misleading.
9. glm-5.1 — "Decompose plan Node into Graph-Level Routing"¶
Score: 7.3/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 7.5 | Lists 6 responsibility categories |
| Decomposition quality | 7 | pre_check / acquire_or_plan — 2 nodes |
| Graph honesty | 7 | Simple Mermaid, but acquire_or_plan is still complex |
| Migration safety | 7.5 | 4-phase migration with optional Phase 4 |
| Test seam quality | 7 | pre_check is testable, acquire_or_plan is still ~200 LOC |
| LangGraph alignment | 7 | Good separation |
| Practical completeness | 7 | Risks section is present but thin |
What's good:
- The pre_check extraction is the minimal valuable change — it pulls out all no-LLM guards into a testable node.
- Phase 4 (optional validate_decision node) is honest about what's deferred.
- The _build_plan_output helper idea is practical — collapsing 9 conditional-return patterns.
What's bad:
- acquire_or_plan at ~200 LOC is still a substantial node with 7 sequential steps. The proposal acknowledges this but doesn't offer a clear path to further decomposition.
- Only 2 new nodes is the least ambitious decomposition. It solves the "guard interleaving" problem but leaves the "post-LLM rewriting" problem inside acquire_or_plan.
- The _pre_check_route state field uses a string literal instead of a typed enum, which is less safe.
10. qwen-3.6-plus — "Decompose plan_node into Graph-Level Routing"¶
Score: 7.0/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 7.5 | Lists 12 responsibility items |
| Decomposition quality | 6.5 | preflight / decompose / plan + route_after_plan with redirects + route_finish_check |
| Graph honesty | 7 | Good Mermaid, but route_after_plan as edge function doing redirects is unusual |
| Migration safety | 7 | 5-phase migration |
| Test seam quality | 6.5 | route_after_plan as an edge function doing redirects is hard to test |
| LangGraph alignment | 6 | Edge functions doing state mutation (redirects) violates LangGraph idioms |
| Practical completeness | 7 | Open questions section is thoughtful |
What's good:
- ask_region / ask_currency with dedicated observe_region / observe_currency nodes is the most explicit bootstrap handling.
- route_finish_check as a validation node before END is a unique insight — it catches cases where finish is premature.
- The 5-phase migration plan is detailed.
What's bad:
- Moving post-LLM redirects into route_after_plan (an edge function) is an anti-pattern in LangGraph. Edge functions should be pure routing decisions, not state mutations. This makes checkpoint behavior harder to reason about.
- route_after_plan doing 6 different redirect checks is complex for an edge function — it should be a node.
- The proposal creates observe_region / observe_currency as separate nodes, which adds graph complexity without clear benefit (they share most logic with observe_user).
11. qwen-3.7-max — "Decomposing the plan Node"¶
Score: 6.5/10
| Criterion | Score | Notes |
|---|---|---|
| Problem accuracy | 7 | Lists 11 responsibility items |
| Decomposition quality | 6 | 10 nodes including check_termination, check_region_currency, compose_schema, check_calculator, acquisition_routing, check_completion, llm_decide, post_process, enforce_caps, route_decision |
| Graph honesty | 6.5 | Detailed Mermaid but overly granular |
| Migration safety | 6 | 5-phase migration, but 10 nodes is a big change |
| Test seam quality | 6.5 | Each node is testable, but graph complexity is high |
| LangGraph alignment | 5.5 | _route state field as transient routing hint is an anti-pattern |
| Practical completeness | 6.5 | State schema changes documented, but _route field approach is problematic |
What's good:
- The most granular decomposition — every check is its own node.
- Code examples for each node are complete and runnable.
- The _route state field approach is at least explicit about what it's doing.
What's bad:
- 10 new nodes is too many. The graph becomes harder to read than the original monolith. The Mermaid diagram has 20+ edges.
- Using _route: NotRequired[str] as a transient routing hint stored in state is an anti-pattern. LangGraph conditional edges should read existing state and return a string — they shouldn't need a special routing field. This adds state mutation that the checkpointer sees.
- check_termination and check_region_currency as separate nodes adds a hop for what could be a single preflight node.
- route_decision as a node that just reads decision.action is redundant — this is what route_after_plan already does as an edge function.
Summary Table¶
| Rank | Proposal | Author | Nodes | Score | Key Strength | Key Weakness |
|---|---|---|---|---|---|---|
| 1 | fable-5 | fable-5 | 5 new | 9.2 | decision_origin for deterministic vs LLM separation |
select still has one LLM call |
| 2 | gpt-5.5 | gpt-5.5 | 9 new | 9.0 | Most granular correct decomposition | Graph visually complex |
| 3 | gpt-5.4 | gpt-5.4 | 6 new | 8.8 | Best anti-patterns section, safest migration | decision_policy bundles multiple rewrites |
| 4 | deepseek-4-pro | deepseek-v4-pro | 4 new | 8.5 | Simplest pipeline that captures essential phases | plan self-loop for decomposition |
| 5 | opus | opus-4.7 | 10+ new | 8.3 | Best naming convention and responsibility mapping | Too many nodes, corrector chain overhead |
| 6 | kimi-2.6 | kimi-k2.6 | 8 new | 8.1 | Decision matrix appendix | enforce_policy still complex |
| 7 | mimo-2.5-pro | mimo-2.5-pro | 3 new | 7.8 | Simplest proposal, easiest to understand | decide bundles too much |
| 8 | gemini-3.1-pro | gemini-3.1-pro | 3 new | 7.5 | Simplest viable decomposition | evaluate_rules still complex, thin docs |
| 9 | glm-5.1 | glm-5.1 | 2 new | 7.3 | Minimal valuable change | acquire_or_plan still ~200 LOC |
| 10 | qwen-3.6-plus | qwen-3.6-plus | 5 new | 7.0 | route_finish_check validation node |
Edge functions doing state mutation |
| 11 | qwen-3.7-max | qwen-3.7-max | 10 new | 6.5 | Most granular | Too many nodes, _route anti-pattern |
Recommendation¶
Best single proposal: fable-5. It has the best balance of decomposition quality, migration safety, and LangGraph alignment. The decision_origin field is an elegant solution for separating deterministic and LLM paths.
Best combination: fable-5 core + gpt-5.4 anti-patterns + gpt-5.5 post-LLM granularity.
Concretely:
- Adopt fable-5's 5-node pipeline (
tick→prepare→select→decide→guard) as the core architecture. - Adopt gpt-5.4's anti-patterns section as implementation constraints (no state mutation in edges, only
tickincrements iterations, don't turn every rule into a node, keepask_useras sole interrupt). - Consider splitting
guardintonormalize_decision+retry_gate+maybe_calculate(from gpt-5.5) if post-LLM policy testing proves difficult after the initial refactor. This can be a follow-up PR. - Adopt gpt-5.4's 6-stage migration plan for safety (each stage is a standalone commit).
The combined approach gives: - 5 nodes in the initial refactor (fable-5) - Clear design constraints (gpt-5.4 anti-patterns) - Optional further decomposition path (gpt-5.5 post-LLM nodes) - Safest possible migration (gpt-5.4 staged approach)
What to avoid:
- qwen-3.7-max's 10-node decomposition and _route state field pattern
- qwen-3.6-plus's edge functions doing state mutation
- opus's 10+ node corrector chain (too many checkpoint writes per iteration)
Appendix: Convergence Analysis¶
Most proposals agree on these architectural decisions:
| Decision | Consensus | Dissent |
|---|---|---|
Extract iteration/abort guards into a tick node |
10/11 agree | glm-5.1 calls it pre_check |
| Extract schema preparation (decomposition + composition) into its own node | 9/11 agree | qwen-3.7-max calls it compose_schema |
| Isolate the LLM call into a slim node | 11/11 agree | — |
| Extract post-LLM decision correction into a separate phase | 8/11 agree | glm-5.1 keeps it in acquire_or_plan |
| Region/currency bootstrap should bypass normal planning | 7/11 agree | Some keep it in tick/guard |
Loop-back edges should target the first guard node, not plan |
9/11 agree | — |
llm_failed or decision_origin state field for calculator loop protection |
6/11 agree | Some handle it inline |
The strongest consensus is on isolating the LLM call and extracting iteration guards. The weakest consensus is on how granular post-LLM correction should be (one node vs multiple).