Planner Graph Refactor: Proposal Evaluation¶
Author: Sisyphus (qwen-3.7-max)
Date: 2026-06-11
Scope: Evaluation of 11 proposals in docs/planner-graph-ref/proposals/
Evaluation Criteria¶
Each proposal is assessed on eight axes:
| Axis | Weight | Description |
|---|---|---|
| Decomposition quality | High | Are node boundaries natural? Does each node have one clear job? |
| Completeness | High | Does it cover all 12 concerns currently inside plan_node? |
| Migration safety | High | Is the path incremental? Can each step ship green? |
| State management | Medium | Are state changes minimal, safe, and serializer-compatible? |
| Testability gain | Medium | Does it actually improve test isolation per concern? |
| Graph honesty | Medium | Does the Mermaid diagram become a truthful map of control flow? |
| LangGraph alignment | Medium | Does it follow LangGraph conventions (side-effect-free edges, node-owned mutations)? |
| Practical feasibility | Medium | Is the node count manageable? Is checkpoint overhead acceptable? |
Ranked List¶
1. fable-5 — tick → prepare → select → decide → guard¶
Decomposition: 5 nodes. The select/decide split cleanly separates deterministic from LLM-originated decisions, which is the single most valuable architectural seam in the entire problem space.
Strengths:
- decision_origin field is the cleanest abstraction any proposal introduces. It lets guard apply different rewrite subsets based on where the decision came from — exactly matching the current code's implicit branching.
- Best checkpoint reasoning of all proposals. Explicitly addresses that one plan superstep becomes up to 4 checkpoint writes, and explains why that's acceptable for a chat-paced agent.
- Only proposal that addresses in-flight checkpoint compatibility — proposes bumping the planner thread namespace (planner:v2) so old checkpoints degrade to clean re-bootstrap. Every other proposal silently breaks resume for pending planner threads.
- llm_failed as a state field (instead of a local variable) is the right call — it lets guard skip _adjust_calculation_decision after structured-output failure, preserving the current infinite-loop protection.
- Follow-up section defers decompose loop node and Command(goto=...) returns with clear reasoning — shows maturity about what NOT to do now.
Weaknesses:
- select still contains one LLM call (blocked-path decomposition, lines 960-968). The proposal acknowledges this and defers it, but it means select is not purely deterministic.
- Test churn is acknowledged as "the bulk of the diff" but no concrete test migration strategy beyond "target the new stage nodes."
Verdict: Best overall balance of decomposition depth, state management, and migration realism. The decision_origin abstraction alone justifies adoption.
2. deepseek-4-pro — guard → prepare → plan → adjust¶
Decomposition: 4 nodes. The most intuitive naming of any proposal — guard/prepare/plan/adjust reads like a sentence.
Strengths:
- Cleanest migration plan: 4 phases, each mechanically verifiable, tests pass at each step.
- recipes in state is a reasonable optimization — currently recomputed in plan_node and observe_user_node. Caching avoids redundant work.
- The plan → plan self-loop for decomposition regeneration is elegant — avoids adding a separate decompose node while keeping the LLM re-prompting behavior explicit in the graph.
- Best "Alternatives Considered" section. Explicitly rejects 16-node decomposition, calculator subgraph, and minimal-change options with clear reasoning.
- Risk assessment is honest: low risk for structural decomposition, medium risk for recipes synchronization.
Weaknesses:
- recipes in state introduces a new serialization concern — FieldAcquisition objects must be serializer-compatible. The proposal mentions this but doesn't fully resolve it.
- adjust node still handles 5+ rewrites sequentially. Less granular than fable-5's guard with decision_origin-based subsetting.
- Doesn't address in-flight checkpoint compatibility.
Verdict: Close second. Simpler than fable-5 (4 vs 5 nodes) but loses the decision_origin insight. Best choice if the team prefers minimal node count.
3. gpt-5.4 — tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy¶
Decomposition: 6 nodes. The strongest architectural guidance of any proposal.
Strengths:
- Best anti-pattern section. Explicitly warns against: state mutation in edge functions, iteration increment in multiple gates, turning every policy rule into a node, and moving interrupts into policy gates. These are the exact mistakes a team would make without guidance.
- 6-stage migration order is well-sequenced: tick first (cleanest seam), bootstrap_gate second (preserves interrupt behavior), then progressively deeper extractions.
- "What should not change" section is uniquely valuable — explicitly preserves ask_user as the only interrupt node, planner thread namespacing, outer run_planner_step() contract, and one-decision-per-iteration invariant.
- The principle "deterministic orchestration belongs in graph phases; business logic stays in helpers; LLM planning stays small" is the best one-sentence design rule across all proposals.
Weaknesses:
- 6 nodes is the upper edge of practical. bootstrap_gate as a separate node from tick adds a graph hop for what is essentially two if checks.
- No state schema changes proposed, which means routing relies heavily on reading decision from state — works but less explicit than fable-5's decision_origin.
- Less detailed on post-LLM rewriting placement than fable-5 or deepseek-4-pro.
Verdict: Best for teams that need architectural guardrails. The anti-pattern guidance alone prevents months of regret. Slightly over-decomposed compared to fable-5.
4. opus — g_iter_cap → g_region → g_currency → enrich_schema → g_calc_caps → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch¶
Decomposition: 16+ nodes. The most thorough analysis, but impractical as-is.
Strengths:
- Best problem analysis. The 17-row responsibility table with line numbers and kind classification (guard/pre-flight/enrichment/corrector) is the definitive map of plan_node.
- Gate/corrector taxonomy (g_* / c_*) is the right mental model. Gates emit decisions and short-circuit; correctors mutate existing decisions sequentially. This distinction is implicit in other proposals but explicit here.
- State ownership table (which node writes which fields) eliminates the 8 separate if schema_changed: out["schema"] = schema_dict blocks — a real code quality win.
- 3-step migration is pragmatic: extract helpers first (no graph change), hoist gates second, split correctors third. Each step ships green.
- Honest about checkpoint overhead: "up to 10 node-to-node hops per iteration" with mitigation via add_edge for the statically-known corrector chain.
Weaknesses:
- 16+ nodes is too many. The Mermaid diagram is a wall of boxes. LangGraph visualization becomes noise rather than signal.
- Separate emit_region/emit_currency nodes (splitting predicate from emission) is over-engineering — the predicate is 3 lines, the emission is 5 lines.
- g_region and g_currency as separate nodes (vs. one bootstrap_gate) adds two graph hops for what is structurally identical logic.
- Open question about whether enrich_schema should always run is left unresolved.
Verdict: Best reference document, worst implementation target. Use the gate/corrector taxonomy as a mental model, implement fable-5's 5-node structure. The 3-step migration strategy is worth borrowing.
5. gpt-5.5 — enter_iteration → prepare_schema → require_region/currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate¶
Decomposition: 8 nodes. Most granular "practical" proposal.
Strengths:
- Most detailed test plan. Explicitly maps existing test assertions to new node targets: calculator-before-finish → maybe_calculate, search cap → retry_gate, etc.
- normalize_decision / retry_gate / maybe_calculate as separate post-LLM nodes makes each policy correction independently testable.
- Comprehensive "Non-goals" section prevents scope creep.
- State change guidance is precise: llm_failed or decision_origin recommended, prepared_recipes explicitly rejected unless serialization is proven.
Weaknesses:
- 8 nodes is over-decomposed for the actual complexity. require_region and require_currency as separate nodes is graph noise — they're structurally identical 5-line checks.
- maybe_calculate as a separate node from retry_gate splits what is conceptually one "decision policy" pass into two graph hops.
- Migration plan has 7 node-introduction steps — high coordination cost.
Verdict: Good test plan, over-engineered graph. Borrow the test migration strategy, use a simpler node structure.
6. kimi-2.6 — tick → ask_region/ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy¶
Decomposition: 8 nodes. Clean naming, good appendix.
Strengths:
- Decision matrix appendix (what goes where) is the best quick-reference of any proposal.
- route_direct as a tiny adapter node (acquisition task → PlannerDecision) is a nice touch — bridges deterministic and LLM-originated decisions into the same post-processing pipeline.
- "Each node should do ONE thing" principle is stated clearly and applied consistently.
- Migration path (extract helpers → register as nodes → delete plan_node) is the standard 3-phase approach, well-explained.
Weaknesses:
- ask_region and ask_currency as separate nodes (vs. routing through the existing ask_user node) duplicates interrupt infrastructure.
- calc_gate as a separate node from acquire splits calculator lifecycle into two hops when they're tightly coupled.
- Less detailed on state management than fable-5 or gpt-5.5.
Verdict: Solid middle-tier proposal. The decision matrix is worth borrowing. Graph structure is less optimal than fable-5 or deepseek-4-pro.
7. glm-5.1 — pre_check → acquire_or_plan¶
Decomposition: 2 nodes. Most conservative approach.
Strengths:
- Lowest migration risk. Only one new node (pre_check) plus a rename (plan → acquire_or_plan).
- _build_plan_output helper to collapse the 9-way conditional-return pattern is a good standalone improvement.
- Option A (keep post-LLM rewriting inline) vs Option B (extract validate_decision) gives the team a choice based on risk tolerance.
- Risk table is thorough and honest.
Weaknesses:
- Doesn't go far enough. acquire_or_plan is still ~200 LOC with acquisition + LLM + rewriting fused. The core problem (hidden control flow) is only half-solved.
- _pre_check_route as a state field is an anti-pattern — routing should live in edge functions, not state.
- No decision_origin equivalent means acquire_or_plan still needs to know whether it produced a deterministic or LLM decision for correct rewriting.
Verdict: Good first step, insufficient end state. Could be Phase 1 of a deeper refactor but shouldn't be the final architecture.
8. qwen-3.7-max — check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision¶
Decomposition: 10 nodes. Most implementation-ready with full code examples.
Strengths:
- Full code for every node — copy-paste ready. Useful as an implementation reference regardless of which architecture is chosen.
- Graph construction code is complete and runnable.
- Transient _route / _llm_failed state fields are explicitly marked as non-persisted.
Weaknesses:
- 10 nodes is excessive. check_termination and check_region_currency should be one node. check_calculator and check_completion should be one node.
- _route as a transient state field is an anti-pattern. LangGraph conditional edges should read state and return route labels — routing hints in state defeat the purpose of declarative edges.
- route_decision as a separate node (vs. a routing function) adds a graph hop for what is return decision.action.
- No discussion of checkpoint overhead with 10 nodes per iteration.
Verdict: Useful as implementation reference code, but the architecture is over-engineered. Borrow code snippets, use a simpler structure.
9. qwen-3.6-plus — preflight → ask_region/currency → observe_region/currency → decompose → plan → route_finish_check¶
Decomposition: 6+ nodes. Interesting approach with dedicated observe nodes.
Strengths:
- Dedicated observe_region/observe_currency nodes remove special-case branching from observe_user_node — a real simplification.
- 5-phase migration is the most granular of any proposal.
- route_finish_check as a validation step before END is a good safety net.
Weaknesses:
- Moving post-LLM redirects into route_after_plan (the routing function) violates the "side-effect-free edges" principle. Redirects mutate decision — that's state mutation, which belongs in a node.
- Dedicated observe nodes for region/currency mean 4 new nodes (ask_region, ask_currency, observe_region, observe_currency) for what is structurally one interrupt-and-parse pattern.
- Doesn't address how _redirect_derived_direct_decision and friends fit into the new structure if they're in a routing function.
Verdict: Interesting ideas buried in a problematic structure. The dedicated observe nodes are worth considering; the routing-function redirects are not.
10. mimo-2.5-pro — guards → acquire → decide¶
Decomposition: 3 nodes. Minimal approach.
Strengths: - Smallest diff. Lowest coordination cost. - Phase 1 (extract without changing topology) is the safest possible first step. - Open questions section is honest about unresolved design choices.
Weaknesses:
- decide node still handles LLM call + 3 redirectors + loop/cap detection — ~80 lines with 4+ concerns. The core problem is not solved.
- plan_router as a separate routing node is unnecessary — should be a routing function.
- Doesn't address post-LLM decomposition generation placement.
- Least detailed analysis of any proposal.
Verdict: Insufficient decomposition. The decide node is still a mini-monolith. Better than nothing but not a real solution.
11. gemini-3.1-pro — prepare_state → evaluate_rules → llm_plan¶
Decomposition: 3 nodes. Simplest proposal.
Strengths: - Short and readable. Can be understood in 5 minutes. - "True Graph Visibility" and "Reduced Latency/Cost Risk" benefits are correctly identified.
Weaknesses:
- Missing post-LLM rewriting entirely. The proposal doesn't address where _redirect_derived_direct_decision, search/ask caps, or _adjust_calculation_decision go.
- evaluate_rules still handles 5+ distinct concerns (max_iters, region, currency, calculator, acquisition, auto-finish) — only marginally better than the current plan_node.
- No migration plan beyond "extract three functions."
- No state schema changes, no routing function design, no test strategy.
- Least thorough analysis of any proposal.
Verdict: Insufficient. Reads like a first-draft sketch rather than a real proposal. Missing too many details to be implementable.
Recommendation¶
Primary recommendation: fable-5 with targeted borrowings¶
Adopt fable-5's 5-node structure as the base architecture:
With these specific borrowings from other proposals:
-
From deepseek-4-pro: The
recipesin state optimization (ifFieldAcquisitionserialization is proven safe). This avoids redundantbuild_dynamic_recipes()calls across nodes. -
From gpt-5.4: The anti-pattern guidance. Pin these to the team's working agreement:
- No state mutation in edge functions
- Only
tickincrementsiterations - One
decision_policynode, not a chain of per-rewrite nodes -
ask_userremains the single interrupt node -
From opus: The gate/corrector mental model. Even though we're not implementing 16 nodes, thinking in terms of "gates emit decisions, correctors mutate decisions" helps keep node responsibilities clean.
-
From gpt-5.5: The test migration strategy. Map each existing
test_planner_decisions.pyassertion to its new target node before starting implementation. -
From fable-5 itself: Bump the planner thread namespace (
planner:v2) to handle in-flight checkpoint compatibility. This is the only proposal that addresses this real-world concern.
Why not more nodes?¶
The opus and qwen-3.7-max proposals show that 10-16 nodes is technically possible but practically counterproductive:
- Checkpoint overhead: Each node boundary is a
PostgresSaverwrite. At chat pace this is fine for 5 nodes, but 10+ nodes per iteration adds measurable latency. - Graph readability: The Mermaid diagram should be a useful map, not a wall of boxes. 5 nodes fits on a screen; 16 does not.
- Coordination cost: Each new node is a test target, a logging scope, and a potential failure point. The marginal benefit of splitting
guardintog_region+g_currency+g_calc_capsdoes not justify the coordination overhead.
Why not fewer nodes?¶
The glm-5.1 and mimo-2.5-pro proposals show that 2-3 nodes leaves the core problem unsolved:
acquire_or_planat ~200 LOC is still a mini-monolith.decideat ~80 lines with 4+ concerns is still hard to test.- The Mermaid diagram still lies about where routing happens.
Implementation order¶
Follow gpt-5.4's staged approach with fable-5's node structure:
-
Extract helpers (no graph change). Pull
tick,prepare,select,decide,guardbodies out ofplan_nodeas module-level functions.plan_nodebecomes a thin sequential composition. All existing tests pass. -
Add state fields. Add
decision_origin: Literal["deterministic", "llm"] | Noneandllm_failed: booltoStateandPlannerState. Old checkpoints deserialize with defaults. -
Rewire graph. Register the 5 nodes, add conditional edges, point action-node returns at
tick. Bump planner thread namespace. Deleteplan_nodeandroute_after_plan. -
Update tests. Move assertions from
test_planner_decisions.pyto per-node test files. Keep integration tests forrun_planner_step(). -
Update docs. Regenerate
current-graph.mdfrom the new topology.
What to defer¶
decomposeas a loop node (from fable-5 follow-up). The two remaining in-stage LLM decomposition calls can stay inline initially. Extract later if they cause observability issues.Command(goto=...)returns (from fable-5 follow-up). Keep routing functions for now — they make the topology declaratively visible in_build_state_graph.recipesin state (from deepseek-4-pro). Only ifFieldAcquisitionserialization is proven safe. Otherwise rebuild recipes per-node (cheap).
Summary Table¶
| Rank | Proposal | Nodes | Key Insight | Verdict |
|---|---|---|---|---|
| 1 | fable-5 | 5 | decision_origin + checkpoint compat |
Adopt as base |
| 2 | deepseek-4-pro | 4 | recipes in state + alternatives analysis |
Borrow recipes optimization |
| 3 | gpt-5.4 | 6 | Anti-pattern guidance + staged migration | Borrow migration discipline |
| 4 | opus | 16+ | Gate/corrector taxonomy | Borrow mental model only |
| 5 | gpt-5.5 | 8 | Test migration strategy | Borrow test plan |
| 6 | kimi-2.6 | 8 | Decision matrix appendix | Borrow quick-reference format |
| 7 | glm-5.1 | 2 | Conservative first step | Good Phase 1, insufficient end state |
| 8 | qwen-3.7-max | 10 | Full implementation code | Borrow code snippets |
| 9 | qwen-3.6-plus | 6+ | Dedicated observe nodes | Interesting but flawed routing |
| 10 | mimo-2.5-pro | 3 | Minimal diff | Insufficient decomposition |
| 11 | gemini-3.1-pro | 3 | Shortest proposal | Insufficient analysis |