Skip to content

Planner Graph Refactor: Proposal Evaluation

Author: Sisyphus (qwen-3.7-max) Date: 2026-06-11 Scope: Evaluation of 11 proposals in docs/planner-graph-ref/proposals/

Evaluation Criteria

Each proposal is assessed on eight axes:

Axis Weight Description
Decomposition quality High Are node boundaries natural? Does each node have one clear job?
Completeness High Does it cover all 12 concerns currently inside plan_node?
Migration safety High Is the path incremental? Can each step ship green?
State management Medium Are state changes minimal, safe, and serializer-compatible?
Testability gain Medium Does it actually improve test isolation per concern?
Graph honesty Medium Does the Mermaid diagram become a truthful map of control flow?
LangGraph alignment Medium Does it follow LangGraph conventions (side-effect-free edges, node-owned mutations)?
Practical feasibility Medium Is the node count manageable? Is checkpoint overhead acceptable?

Ranked List

1. fable-5 — tick → prepare → select → decide → guard

Decomposition: 5 nodes. The select/decide split cleanly separates deterministic from LLM-originated decisions, which is the single most valuable architectural seam in the entire problem space.

Strengths: - decision_origin field is the cleanest abstraction any proposal introduces. It lets guard apply different rewrite subsets based on where the decision came from — exactly matching the current code's implicit branching. - Best checkpoint reasoning of all proposals. Explicitly addresses that one plan superstep becomes up to 4 checkpoint writes, and explains why that's acceptable for a chat-paced agent. - Only proposal that addresses in-flight checkpoint compatibility — proposes bumping the planner thread namespace (planner:v2) so old checkpoints degrade to clean re-bootstrap. Every other proposal silently breaks resume for pending planner threads. - llm_failed as a state field (instead of a local variable) is the right call — it lets guard skip _adjust_calculation_decision after structured-output failure, preserving the current infinite-loop protection. - Follow-up section defers decompose loop node and Command(goto=...) returns with clear reasoning — shows maturity about what NOT to do now.

Weaknesses: - select still contains one LLM call (blocked-path decomposition, lines 960-968). The proposal acknowledges this and defers it, but it means select is not purely deterministic. - Test churn is acknowledged as "the bulk of the diff" but no concrete test migration strategy beyond "target the new stage nodes."

Verdict: Best overall balance of decomposition depth, state management, and migration realism. The decision_origin abstraction alone justifies adoption.


2. deepseek-4-pro — guard → prepare → plan → adjust

Decomposition: 4 nodes. The most intuitive naming of any proposal — guard/prepare/plan/adjust reads like a sentence.

Strengths: - Cleanest migration plan: 4 phases, each mechanically verifiable, tests pass at each step. - recipes in state is a reasonable optimization — currently recomputed in plan_node and observe_user_node. Caching avoids redundant work. - The plan → plan self-loop for decomposition regeneration is elegant — avoids adding a separate decompose node while keeping the LLM re-prompting behavior explicit in the graph. - Best "Alternatives Considered" section. Explicitly rejects 16-node decomposition, calculator subgraph, and minimal-change options with clear reasoning. - Risk assessment is honest: low risk for structural decomposition, medium risk for recipes synchronization.

Weaknesses: - recipes in state introduces a new serialization concern — FieldAcquisition objects must be serializer-compatible. The proposal mentions this but doesn't fully resolve it. - adjust node still handles 5+ rewrites sequentially. Less granular than fable-5's guard with decision_origin-based subsetting. - Doesn't address in-flight checkpoint compatibility.

Verdict: Close second. Simpler than fable-5 (4 vs 5 nodes) but loses the decision_origin insight. Best choice if the team prefers minimal node count.


3. gpt-5.4 — tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy

Decomposition: 6 nodes. The strongest architectural guidance of any proposal.

Strengths: - Best anti-pattern section. Explicitly warns against: state mutation in edge functions, iteration increment in multiple gates, turning every policy rule into a node, and moving interrupts into policy gates. These are the exact mistakes a team would make without guidance. - 6-stage migration order is well-sequenced: tick first (cleanest seam), bootstrap_gate second (preserves interrupt behavior), then progressively deeper extractions. - "What should not change" section is uniquely valuable — explicitly preserves ask_user as the only interrupt node, planner thread namespacing, outer run_planner_step() contract, and one-decision-per-iteration invariant. - The principle "deterministic orchestration belongs in graph phases; business logic stays in helpers; LLM planning stays small" is the best one-sentence design rule across all proposals.

Weaknesses: - 6 nodes is the upper edge of practical. bootstrap_gate as a separate node from tick adds a graph hop for what is essentially two if checks. - No state schema changes proposed, which means routing relies heavily on reading decision from state — works but less explicit than fable-5's decision_origin. - Less detailed on post-LLM rewriting placement than fable-5 or deepseek-4-pro.

Verdict: Best for teams that need architectural guardrails. The anti-pattern guidance alone prevents months of regret. Slightly over-decomposed compared to fable-5.


4. opus — g_iter_cap → g_region → g_currency → enrich_schema → g_calc_caps → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch

Decomposition: 16+ nodes. The most thorough analysis, but impractical as-is.

Strengths: - Best problem analysis. The 17-row responsibility table with line numbers and kind classification (guard/pre-flight/enrichment/corrector) is the definitive map of plan_node. - Gate/corrector taxonomy (g_* / c_*) is the right mental model. Gates emit decisions and short-circuit; correctors mutate existing decisions sequentially. This distinction is implicit in other proposals but explicit here. - State ownership table (which node writes which fields) eliminates the 8 separate if schema_changed: out["schema"] = schema_dict blocks — a real code quality win. - 3-step migration is pragmatic: extract helpers first (no graph change), hoist gates second, split correctors third. Each step ships green. - Honest about checkpoint overhead: "up to 10 node-to-node hops per iteration" with mitigation via add_edge for the statically-known corrector chain.

Weaknesses: - 16+ nodes is too many. The Mermaid diagram is a wall of boxes. LangGraph visualization becomes noise rather than signal. - Separate emit_region/emit_currency nodes (splitting predicate from emission) is over-engineering — the predicate is 3 lines, the emission is 5 lines. - g_region and g_currency as separate nodes (vs. one bootstrap_gate) adds two graph hops for what is structurally identical logic. - Open question about whether enrich_schema should always run is left unresolved.

Verdict: Best reference document, worst implementation target. Use the gate/corrector taxonomy as a mental model, implement fable-5's 5-node structure. The 3-step migration strategy is worth borrowing.


5. gpt-5.5 — enter_iteration → prepare_schema → require_region/currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate

Decomposition: 8 nodes. Most granular "practical" proposal.

Strengths: - Most detailed test plan. Explicitly maps existing test assertions to new node targets: calculator-before-finish → maybe_calculate, search cap → retry_gate, etc. - normalize_decision / retry_gate / maybe_calculate as separate post-LLM nodes makes each policy correction independently testable. - Comprehensive "Non-goals" section prevents scope creep. - State change guidance is precise: llm_failed or decision_origin recommended, prepared_recipes explicitly rejected unless serialization is proven.

Weaknesses: - 8 nodes is over-decomposed for the actual complexity. require_region and require_currency as separate nodes is graph noise — they're structurally identical 5-line checks. - maybe_calculate as a separate node from retry_gate splits what is conceptually one "decision policy" pass into two graph hops. - Migration plan has 7 node-introduction steps — high coordination cost.

Verdict: Good test plan, over-engineered graph. Borrow the test migration strategy, use a simpler node structure.


6. kimi-2.6 — tick → ask_region/ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy

Decomposition: 8 nodes. Clean naming, good appendix.

Strengths: - Decision matrix appendix (what goes where) is the best quick-reference of any proposal. - route_direct as a tiny adapter node (acquisition task → PlannerDecision) is a nice touch — bridges deterministic and LLM-originated decisions into the same post-processing pipeline. - "Each node should do ONE thing" principle is stated clearly and applied consistently. - Migration path (extract helpers → register as nodes → delete plan_node) is the standard 3-phase approach, well-explained.

Weaknesses: - ask_region and ask_currency as separate nodes (vs. routing through the existing ask_user node) duplicates interrupt infrastructure. - calc_gate as a separate node from acquire splits calculator lifecycle into two hops when they're tightly coupled. - Less detailed on state management than fable-5 or gpt-5.5.

Verdict: Solid middle-tier proposal. The decision matrix is worth borrowing. Graph structure is less optimal than fable-5 or deepseek-4-pro.


7. glm-5.1 — pre_check → acquire_or_plan

Decomposition: 2 nodes. Most conservative approach.

Strengths: - Lowest migration risk. Only one new node (pre_check) plus a rename (planacquire_or_plan). - _build_plan_output helper to collapse the 9-way conditional-return pattern is a good standalone improvement. - Option A (keep post-LLM rewriting inline) vs Option B (extract validate_decision) gives the team a choice based on risk tolerance. - Risk table is thorough and honest.

Weaknesses: - Doesn't go far enough. acquire_or_plan is still ~200 LOC with acquisition + LLM + rewriting fused. The core problem (hidden control flow) is only half-solved. - _pre_check_route as a state field is an anti-pattern — routing should live in edge functions, not state. - No decision_origin equivalent means acquire_or_plan still needs to know whether it produced a deterministic or LLM decision for correct rewriting.

Verdict: Good first step, insufficient end state. Could be Phase 1 of a deeper refactor but shouldn't be the final architecture.


8. qwen-3.7-max — check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision

Decomposition: 10 nodes. Most implementation-ready with full code examples.

Strengths: - Full code for every node — copy-paste ready. Useful as an implementation reference regardless of which architecture is chosen. - Graph construction code is complete and runnable. - Transient _route / _llm_failed state fields are explicitly marked as non-persisted.

Weaknesses: - 10 nodes is excessive. check_termination and check_region_currency should be one node. check_calculator and check_completion should be one node. - _route as a transient state field is an anti-pattern. LangGraph conditional edges should read state and return route labels — routing hints in state defeat the purpose of declarative edges. - route_decision as a separate node (vs. a routing function) adds a graph hop for what is return decision.action. - No discussion of checkpoint overhead with 10 nodes per iteration.

Verdict: Useful as implementation reference code, but the architecture is over-engineered. Borrow code snippets, use a simpler structure.


9. qwen-3.6-plus — preflight → ask_region/currency → observe_region/currency → decompose → plan → route_finish_check

Decomposition: 6+ nodes. Interesting approach with dedicated observe nodes.

Strengths: - Dedicated observe_region/observe_currency nodes remove special-case branching from observe_user_node — a real simplification. - 5-phase migration is the most granular of any proposal. - route_finish_check as a validation step before END is a good safety net.

Weaknesses: - Moving post-LLM redirects into route_after_plan (the routing function) violates the "side-effect-free edges" principle. Redirects mutate decision — that's state mutation, which belongs in a node. - Dedicated observe nodes for region/currency mean 4 new nodes (ask_region, ask_currency, observe_region, observe_currency) for what is structurally one interrupt-and-parse pattern. - Doesn't address how _redirect_derived_direct_decision and friends fit into the new structure if they're in a routing function.

Verdict: Interesting ideas buried in a problematic structure. The dedicated observe nodes are worth considering; the routing-function redirects are not.


10. mimo-2.5-pro — guards → acquire → decide

Decomposition: 3 nodes. Minimal approach.

Strengths: - Smallest diff. Lowest coordination cost. - Phase 1 (extract without changing topology) is the safest possible first step. - Open questions section is honest about unresolved design choices.

Weaknesses: - decide node still handles LLM call + 3 redirectors + loop/cap detection — ~80 lines with 4+ concerns. The core problem is not solved. - plan_router as a separate routing node is unnecessary — should be a routing function. - Doesn't address post-LLM decomposition generation placement. - Least detailed analysis of any proposal.

Verdict: Insufficient decomposition. The decide node is still a mini-monolith. Better than nothing but not a real solution.


11. gemini-3.1-pro — prepare_state → evaluate_rules → llm_plan

Decomposition: 3 nodes. Simplest proposal.

Strengths: - Short and readable. Can be understood in 5 minutes. - "True Graph Visibility" and "Reduced Latency/Cost Risk" benefits are correctly identified.

Weaknesses: - Missing post-LLM rewriting entirely. The proposal doesn't address where _redirect_derived_direct_decision, search/ask caps, or _adjust_calculation_decision go. - evaluate_rules still handles 5+ distinct concerns (max_iters, region, currency, calculator, acquisition, auto-finish) — only marginally better than the current plan_node. - No migration plan beyond "extract three functions." - No state schema changes, no routing function design, no test strategy. - Least thorough analysis of any proposal.

Verdict: Insufficient. Reads like a first-draft sketch rather than a real proposal. Missing too many details to be implementable.


Recommendation

Primary recommendation: fable-5 with targeted borrowings

Adopt fable-5's 5-node structure as the base architecture:

tick → prepare → select → decide → guard

With these specific borrowings from other proposals:

  1. From deepseek-4-pro: The recipes in state optimization (if FieldAcquisition serialization is proven safe). This avoids redundant build_dynamic_recipes() calls across nodes.

  2. From gpt-5.4: The anti-pattern guidance. Pin these to the team's working agreement:

  3. No state mutation in edge functions
  4. Only tick increments iterations
  5. One decision_policy node, not a chain of per-rewrite nodes
  6. ask_user remains the single interrupt node

  7. From opus: The gate/corrector mental model. Even though we're not implementing 16 nodes, thinking in terms of "gates emit decisions, correctors mutate decisions" helps keep node responsibilities clean.

  8. From gpt-5.5: The test migration strategy. Map each existing test_planner_decisions.py assertion to its new target node before starting implementation.

  9. From fable-5 itself: Bump the planner thread namespace (planner:v2) to handle in-flight checkpoint compatibility. This is the only proposal that addresses this real-world concern.

Why not more nodes?

The opus and qwen-3.7-max proposals show that 10-16 nodes is technically possible but practically counterproductive:

  • Checkpoint overhead: Each node boundary is a PostgresSaver write. At chat pace this is fine for 5 nodes, but 10+ nodes per iteration adds measurable latency.
  • Graph readability: The Mermaid diagram should be a useful map, not a wall of boxes. 5 nodes fits on a screen; 16 does not.
  • Coordination cost: Each new node is a test target, a logging scope, and a potential failure point. The marginal benefit of splitting guard into g_region + g_currency + g_calc_caps does not justify the coordination overhead.

Why not fewer nodes?

The glm-5.1 and mimo-2.5-pro proposals show that 2-3 nodes leaves the core problem unsolved:

  • acquire_or_plan at ~200 LOC is still a mini-monolith.
  • decide at ~80 lines with 4+ concerns is still hard to test.
  • The Mermaid diagram still lies about where routing happens.

Implementation order

Follow gpt-5.4's staged approach with fable-5's node structure:

  1. Extract helpers (no graph change). Pull tick, prepare, select, decide, guard bodies out of plan_node as module-level functions. plan_node becomes a thin sequential composition. All existing tests pass.

  2. Add state fields. Add decision_origin: Literal["deterministic", "llm"] | None and llm_failed: bool to State and PlannerState. Old checkpoints deserialize with defaults.

  3. Rewire graph. Register the 5 nodes, add conditional edges, point action-node returns at tick. Bump planner thread namespace. Delete plan_node and route_after_plan.

  4. Update tests. Move assertions from test_planner_decisions.py to per-node test files. Keep integration tests for run_planner_step().

  5. Update docs. Regenerate current-graph.md from the new topology.

What to defer

  • decompose as a loop node (from fable-5 follow-up). The two remaining in-stage LLM decomposition calls can stay inline initially. Extract later if they cause observability issues.
  • Command(goto=...) returns (from fable-5 follow-up). Keep routing functions for now — they make the topology declaratively visible in _build_state_graph.
  • recipes in state (from deepseek-4-pro). Only if FieldAcquisition serialization is proven safe. Otherwise rebuild recipes per-node (cheap).

Summary Table

Rank Proposal Nodes Key Insight Verdict
1 fable-5 5 decision_origin + checkpoint compat Adopt as base
2 deepseek-4-pro 4 recipes in state + alternatives analysis Borrow recipes optimization
3 gpt-5.4 6 Anti-pattern guidance + staged migration Borrow migration discipline
4 opus 16+ Gate/corrector taxonomy Borrow mental model only
5 gpt-5.5 8 Test migration strategy Borrow test plan
6 kimi-2.6 8 Decision matrix appendix Borrow quick-reference format
7 glm-5.1 2 Conservative first step Good Phase 1, insufficient end state
8 qwen-3.7-max 10 Full implementation code Borrow code snippets
9 qwen-3.6-plus 6+ Dedicated observe nodes Interesting but flawed routing
10 mimo-2.5-pro 3 Minimal diff Insufficient decomposition
11 gemini-3.1-pro 3 Shortest proposal Insufficient analysis