Skip to content

Planner Graph Refactor Proposals — Ranked Evaluation

Context: 11 proposals from different LLM models to decompose the monolithic plan_node() (~350 lines, src/venturescope/planner/agent.py:846-1192) into smaller graph-level nodes.

Current state: plan_node() is a god node containing 10+ interleaved concerns: iteration guards, region/currency bootstrap, decomposition generation, schema composition, calculator lifecycle, acquisition task routing, auto-finish logic, LLM decision, post-LLM redirects, and cap enforcement. The graph diagram shows START → plan → action, but the real control flow is hidden inside the node.


Evaluation Criteria

Criterion Weight Description
Testability High Can each concern be tested independently without LLM mocking?
Observability High Does the graph diagram honestly reflect control flow?
Migration Risk High Is the migration staged and reversible?
Checkpoint Overhead Medium How many checkpointer writes per iteration?
Practicality High Is the result maintainable by a human team?
LangGraph Alignment Medium Does it follow LangGraph idioms (conditional edges, single-responsibility nodes)?
Granularity Balance High Is the decomposition "just right" — not too coarse, not too fine?

Ranked List (Best to Worst)

1. fable-5-proposal — "The Goldilocks Split" (5 nodes)

Architecture: tick → prepare → select → decide → guard

Why it ranks #1:

  • Perfect granularity. The 5-node split maps exactly to the 5 phases of a planner iteration: (1) iteration bookkeeping, (2) state preparation, (3) deterministic routing, (4) LLM decision, (5) policy enforcement. No node is too small (not a single if-block) or too large (not a 200-line monster).
  • Decision origin tracking. Introduces decision_origin: Literal["deterministic", "llm"] and llm_failed: bool — these are state-level abstractions that clarify the contract between nodes. The guard node knows whether to apply full policy corrections or just calculator adjustment.
  • Honest checkpoint boundaries. Each LLM call (proactive decomposition in prepare, blocked-field decomposition in select, planner LLM in decide) is a separate graph step. If the process crashes mid-tick, LangGraph resumes from the last node boundary — no repeated LLM calls.
  • Action nodes return to tick, not plan. This is semantically correct: every iteration should re-check abort/max-iters conditions. The current graph (and several proposals) route back to plan, which buries the guard check.
  • Staged migration is explicit. Phase 1: extract helpers without graph changes. Phase 2: rewire. Phase 3: docs. Each phase is testable.
  • Thread namespace bump. Explicitly recommends f"{conversation_id}:planner:v2" to handle in-flight checkpoint compatibility — a real-world concern most proposals skip.

Trade-offs: - More checkpoint writes per iteration (4 vs 1). Acceptable for chat-paced agents; the trade-off is explicit and justified. - decision_origin and llm_failed add transient state fields. Clean and backward-compatible (old checkpoints deserialize with defaults).

Verdict: Best balance of decomposition, observability, and migration safety. The decision_origin concept is the key insight that makes the post-LLM policy node (guard) cleanly separable from deterministic routing (select).


2. deepseek-4-pro-proposal — "The Structured Pipeline" (4 nodes)

Architecture: guard → prepare → plan → adjust

Why it ranks #2:

  • Cleanest conceptual model. The names (guard, prepare, plan, adjust) map intuitively to "check → setup → decide → correct." Anyone reading the graph can understand the lifecycle at a glance.
  • Auto-complete routes through adjust. Small but important detail: the proposal recognizes that auto_finish and acquisition_task decisions must still pass through _adjust_calculation_decision, so it routes them through adjust rather than directly to action nodes. This avoids duplicating calculator logic.
  • Explicit state caching. Recommends adding recipes to State to avoid redundant recomputation across prepare, plan, observe_user, etc. Several other proposals miss this optimization.
  • Good migration path. Phase 1 extraction, Phase 2 graph rewire, Phase 3 cleanup, Phase 4 verify. Mechanical and low-risk.

Trade-offs: - The plan node still contains the on-demand decomposition loop (lines 1072-1088) — a single LLM call that could be its own node. The fable-5 proposal handles this better by putting decomposition in prepare/select. - No decision_origin tracking, so adjust must apply all corrections to every decision (including deterministic ones). This is slightly wasteful but not wrong. - Loop-back edges go to guard (correct) but the plan node can still loop to itself for decomposition. This is a bit odd topologically.

Verdict: Very close to #1. Slightly less refined on the decomposition/resumability front, but conceptually the cleanest 4-node split. The recipes caching idea is a nice practical touch.


3. gpt-5.4-proposal — "The Conservative Staged Split" (6 nodes)

Architecture: tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy

Why it ranks #3:

  • Best migration order. The staged extraction (tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy) is the most conservative and lowest-risk. Each stage is a standalone PR.
  • Explicitly warns against over-decomposition. "Do not turn every policy rule into a node" — this is a mature architectural judgment. The decision_policy node keeps all post-LLM rewrites together, avoiding a graph that is harder to follow than the original code.
  • Clear contract boundaries. tick owns iteration counting. bootstrap_gate owns region/currency. prepare_context owns schema mutation. acquisition_gate owns deterministic routing. llm_plan owns the LLM. decision_policy owns corrections.
  • Action nodes return to tick. Correct semantic choice.

Trade-offs: - The decision_policy node is still a ~80-line amalgamation of redirects, caps, and calculator adjustments. It could be split further (as fable-5 does with guard), but the proposal explicitly rejects this to avoid graph noise. - bootstrap_gate as a separate node is a bit thin (~20 lines). It might be better fused into tick (as fable-5 does) or prepare (as deepseek-4 does). - No decision_origin concept, so decision_policy applies all corrections unconditionally.

Verdict: Excellent if your team prioritizes low-risk staged migration over maximum observability. The conservative philosophy is appropriate for a production system.


4. gemini-3.1-pro-proposal — "The Minimalist Split" (3 nodes)

Architecture: prepare_state → evaluate_rules → llm_plan

Why it ranks #4:

  • Simplicity. Only 3 new nodes. The cognitive overhead of the refactor is minimal — a junior developer can understand the new graph in 5 minutes.
  • Clear separation of concerns. prepare_state = pure state mutation. evaluate_rules = deterministic routing. llm_plan = heuristic routing. This is the cleanest 3-way split.
  • Good for quick wins. If the team has limited bandwidth, this delivers 80% of the benefit with 20% of the change.

Trade-offs: - evaluate_rules is still a ~100-line node with 6-7 interleaved concerns (max_iters, region, currency, calculator caps, blocked calc, auto-finish). It's better than the current 350-line monolith, but not as testable as a finer split. - llm_plan still contains the on-demand decomposition loop and post-LLM redirects. The fable-5 proposal handles this better by separating decide and guard. - Loop-back edges go to prepare_state rather than evaluate_rules. This means evaluate_rules doesn't re-run on every iteration — a minor semantic gap (max_iters won't be re-checked after a search/action cycle unless the graph routes back through evaluate_rules).

Verdict: Best "bang for buck" if you need a quick refactor. The 3-node split is a genuine improvement over the current state, but leaves the most complex parts (deterministic routing + post-LLM policy) still somewhat monolithic.


5. glm-5.1-proposal — "The Two-Step Split" (2 nodes)

Architecture: pre_check → acquire_or_plan

Why it ranks #5:

  • Lowest migration risk. Only 2 new nodes. The graph change is minimal: START → pre_check → acquire_or_plan → action.
  • Clear pre-LLM / post-LLM boundary. pre_check is everything before the LLM. acquire_or_plan is everything after. This is the simplest conceptual model.
  • Good PreCheckResult abstraction. The PreCheckResult dataclass with a routing tag is a clean way to communicate between node and router without overloading PlannerDecision.
  • Option B for future extraction. Explicitly recommends extracting a validate_decision node later if needed. This shows foresight.

Trade-offs: - pre_check is still ~150 lines (guards + prerequisites + preprocessing + calculator checks). It solves the "all logic in one node" problem, but pre_check itself is still a medium-sized monolith. - acquire_or_plan is ~200 lines (acquisition + LLM + rewriting). The post-LLM rewrites are still buried inside the LLM node. - The _pre_check_route transient field is a bit hacky. The fable-5 decision_origin approach is cleaner. - Loop-back edges go to pre_check (correct), but the pre_check node itself is still doing a lot of work.

Verdict: A solid "first step" that the team can build on. The 2-node split is too coarse to be the final state, but it is the lowest-risk starting point. Consider it as Phase 1 of a deeper refactor.


6. mimo-2.5-pro-proposal — "The Three-Way Split" (3 nodes)

Architecture: guards → acquire → decide

Why it ranks #6:

  • Very similar to glm-5.1 but with a slightly different boundary. The guards node is iteration + region/currency + abort/max_iters. acquire is decomposition + composition + task selection. decide is LLM + redirects + cap enforcement.
  • Good node naming. guards, acquire, decide are intuitive.
  • Phase 1 extraction is well-specified. Extract helpers first, then wire.

Trade-offs: - guards does NOT include calculator caps (those are in acquire in this proposal, or in decide). This is a slightly odd split — calculator caps are guards, not acquisition logic. - decide is still ~80 lines of LLM + ~100 lines of redirects + caps. This is the heaviest node and the one that most needs further splitting. - The route_after_guards / route_after_acquire routing functions are simple, but the decide node is still a black box. - No decision_origin concept. decide applies all redirects to deterministic decisions too.

Verdict: Similar to glm-5.1 but with a slightly less clean boundary. The decide node is too heavy. The guards node is too light (missing calculator logic). The 3-node split is better than 2-node, but the boundaries are less well-chosen than gemini-3.1.


7. gpt-5.5-proposal — "The Maximalist Split" (7+ nodes)

Architecture: enter_iteration → prepare_schema → require_region/currency → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate

Why it ranks #7:

  • Very granular. Each concern has its own node. normalize_decision, retry_gate, maybe_calculate are separate nodes — this is the most refined post-LLM policy split.
  • Excellent test seams. Each policy (redirects, caps, calculator routing) can be unit-tested independently.
  • decision_origin equivalent. Uses llm_failed or decision_origin to preserve the current safeguard against infinite loops after LLM failure.

Trade-offs: - Too granular. The graph has ~10 nodes before reaching an action node. This is approaching the "16 nodes" alternative that deepseek-4 explicitly rejected. - High checkpoint overhead. Up to 8 small writes per iteration under PostgresSaver. For a chat-paced agent this might be negligible, but for a high-throughput scenario it could matter. - require_region / require_currency as separate nodes. These are ~20 lines each. They could be merged into bootstrap_gate (as gpt-5.4 does) or tick (as fable-5 does) without loss of clarity. - retry_gate and maybe_calculate as separate nodes. These are sequential policy applications. The fable-5 proposal merges them into guard (one node) with decision_origin tracking. The gpt-5.5 approach is more testable but also more verbose. - Visual complexity. The Mermaid diagram for this graph is large and harder to read than a 4-5 node graph. The goal was to make the graph "honest," but this may make it "honest but overwhelming."

Verdict: Best for teams with very strong testing culture and high tolerance for graph complexity. The granularity is arguably excessive for a production system. The fable-5 guard node achieves similar testability with less graph noise.


8. kimi-2.6-proposal (self) — "The Pipeline Split" (8 nodes)

Architecture: tick → ask_region/ask_currency → prepare → calc_gate → acquire → route_direct → decide → enforce_policy

Why it ranks #8:

  • Honest diagram. The 8-node pipeline is very explicit about what happens when.
  • route_direct adapter node. Good insight: deterministic acquisition tasks need to be converted to PlannerDecision and then pass through the same post-processing as LLM decisions.
  • calc_gate as a separate conditional node. Makes calculator lifecycle visible in the graph.

Trade-offs: - ask_region / ask_currency as separate nodes from ask_user. This adds 2 nodes and 2 edges for a ~20-line concern. The existing ask_user node already handles region/currency via _handle_region_answer / _handle_currency_answer. Splitting them is overkill — the fable-5 approach (routing ask_user from tick with a decision) is simpler. - enforce_policy is still ~80 lines. The post-LLM redirects, caps, and calculator adjustments are all in one node. The fable-5 guard node is the same size but applies corrections conditionally based on decision_origin. - No decision_origin tracking. enforce_policy applies all corrections to every decision. - The graph is large but not as well-organized as gpt-5.5. The acquire node does blocked-calc + auto-finish + open-task resolution — three concerns in one node. The gpt-5.5 acquisition_gate is similarly complex, but at least gpt-5.5 splits the post-LLM policy further.

Verdict: A solid but slightly over-granular proposal. The ask_region/ask_currency split is unnecessary. The enforce_policy node is still too heavy. The decision_origin concept (from fable-5) would improve this design significantly.


9. qwen-3.6-plus-proposal — "The Preflight Pattern" (6 nodes)

Architecture: preflight → ask_region/ask_currency → decompose → plan → route_after_plan → route_finish_check

Why it ranks #9:

  • Good preflight concept. The preflight node is a clear guard + prerequisite gate.
  • Explicit route_finish_check node. Makes the auto-finish validation visible as a graph step.
  • Clean route_after_plan gains redirect logic. Moving post-LLM redirects into the routing function is an interesting idea — it makes the graph edges express the policy.

Trade-offs: - route_after_plan as a routing function with complex logic. The proposal moves _redirect_derived_direct_decision, _redirect_premature_ask_for_web_field, _adjust_calculation_decision, and cap enforcement into route_after_plan. This is a significant anti-pattern in LangGraph: conditional edge functions should be simple predicates (3-8 lines), not complex policy engines. The proposal explicitly states "LangGraph conditional edges can't express complex logic — the routing functions are pure Python," but LangGraph best practice is that edge functions should be trivial. Moving 80 lines of policy into an edge function defeats the purpose of graph-level visibility. - ask_region / ask_currency as separate nodes. Same over-granularity as kimi-2.6. - observe_region / observe_currency as separate nodes. These are ~40 lines each and duplicate the existing observe_user logic. The existing observe_user already handles region/currency via target_field check. This is unnecessary duplication. - The plan node is still ~80 lines. Not as slim as it could be (the fable-5 decide node is ~40 lines).

Verdict: The preflight and route_finish_check nodes are good ideas, but the proposal is fundamentally flawed in its treatment of route_after_plan. Moving complex policy into edge functions violates LangGraph conventions and makes the graph harder to debug, not easier.


10. qwen-3.7-max-proposal — "The Maximal Implementation" (10+ nodes)

Architecture: check_termination → check_region_currency → compose_schema → check_calculator → acquisition_routing → check_completion → llm_decide → post_process → enforce_caps → route_decision

Why it ranks #10:

  • Most detailed implementation. The proposal includes full Python code for every node and routing function. This is useful as a reference implementation.
  • Good node naming. check_termination, check_region_currency, compose_schema, check_calculator, acquisition_routing, check_completion, llm_decide, post_process, enforce_caps, route_decision — very explicit.

Trade-offs: - Massive graph. 10+ nodes before reaching an action. This is the "16 nodes" alternative that deepseek-4 explicitly rejected. The Mermaid diagram is a maze. - High checkpoint overhead. 8-9 small writes per iteration. - route_decision as a node that only routes. The proposal adds a route_decision node whose only job is to read decision.action and emit an event. This should be a routing function (as in every other proposal), not a node. A node that does no state mutation and only returns a routing string is pure overhead. - _route transient state fields. The proposal adds _route and _llm_failed to State. The _route field is a routing hack — it exists because the graph is too granular to express routing naturally. The fable-5 decision_origin is a better abstraction. - observe_user loop-back goes to check_region_currency, not check_termination. This means a user answering a region question skips the iteration increment and abort check on the next cycle. This is a subtle bug — every loop-back should go to the first guard node (check_termination). - check_completion does acquisition task selection. This is a weird split — acquisition routing is in both acquisition_routing and check_completion. The auto-finish logic is split across two nodes. - The proposal is 722 lines. This is longer than the current plan_node itself. It may be more "correct" but it is also more complex to understand than the original code.

Verdict: Useful as a reference for what each node could look like, but the graph is too granular and has a subtle routing bug. The route_decision node is a clear anti-pattern. The _route transient fields are a smell of over-decomposition.


11. opus-proposal — "The Ultimate Granularity" (~16 nodes)

Architecture: g_iter_cap → g_region → g_currency → enrich → g_calc_caps → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch

Why it ranks #11:

  • Most honest diagram. The graph diagram is the most accurate representation of the actual control flow.
  • One invariant ties everything together. Every gate/corrector deposits a PlannerDecision into state["decision"]. dispatch reads it. This is elegant.
  • Good naming convention. g_* = gate, c_* = corrector, enrich = enrichment, dispatch = router. Consistent and readable.

Trade-offs: - Way too granular. The corrector chain (c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust) is 6 sequential nodes, each mutating a single field. This is the extreme opposite of the monolith — a "nanolith" problem. The fable-5 guard node (one node, conditional corrections) is vastly more practical. - Up to 10 node hops per iteration. Each hop is a checkpointer write. For a Postgres-backed checkpointer, this is non-trivial overhead. - The dispatch node is read-only but still a node. Same anti-pattern as qwen-3.7's route_decision — a node that only routes and logs. This should be a routing function. - The g_region / g_currency split. These are ~5 lines each. As separate nodes, they add visual noise without improving clarity. - The c_* corrector chain uses add_edge (not add_conditional_edges). This is correct (no branching needed), but it means the chain is always traversed, even when no correction is needed. This is wasteful. - Maintaining this graph is harder than maintaining the original code. The original 350-line function is at least in one place. This graph is 16 files/nodes. The cognitive overhead of navigating the graph is higher than navigating a single long function.

Verdict: An interesting theoretical exercise in maximum decomposition, but not practical for a production system. The graph is larger than the code it replaces. The checkpoint overhead is real. The maintainability is worse than the original. Rejected for production use.


Summary Table

Proposal Nodes Granularity Testability Migration Risk Checkpoint Overhead Key Strength Key Weakness
fable-5 5 ★★★★★ ★★★★★ ★★★★☆ ★★★☆☆ decision_origin tracking Slightly more writes
deepseek-4 4 ★★★★☆ ★★★★☆ ★★★★★ ★★★☆☆ recipes caching plan still loops to itself
gpt-5.4 6 ★★★★☆ ★★★★☆ ★★★★★ ★★★☆☆ Conservative migration decision_policy is still heavy
gemini-3.1 3 ★★★☆☆ ★★★☆☆ ★★★★★ ★★☆☆☆ Simplicity evaluate_rules is still big
glm-5.1 2 ★★☆☆☆ ★★☆☆☆ ★★★★★ ★★☆☆☆ Lowest risk pre_check is still ~150 lines
mimo-2.5 3 ★★★☆☆ ★★★☆☆ ★★★★☆ ★★☆☆☆ Intuitive naming decide is too heavy
gpt-5.5 7+ ★★★★★ ★★★★★ ★★★☆☆ ★★★★☆ Very granular Visual complexity
kimi-2.6 8 ★★★★☆ ★★★★☆ ★★★☆☆ ★★★★☆ Honest diagram ask_region/ask_currency overkill
qwen-3.6 6 ★★★☆☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ Good preflight Complex edge functions
qwen-3.7 10+ ★★★★☆ ★★★★☆ ★★☆☆☆ ★★★★★ Detailed code Too granular, routing bug
opus ~16 ★★★★★ ★★★★★ ★★☆☆☆ ★★★★★ Honest diagram Nanolith problem

Primary Recommendation: Adopt the fable-5-proposal with deepseek-4's recipes caching

The fable-5 proposal is the best overall solution because it achieves the "Goldilocks" balance:

  • Not too coarse (like glm-5.1's 2-node split, which leaves ~150-line nodes)
  • Not too fine (like opus's 16-node chain, which is harder to maintain than the original)
  • The decision_origin concept is the key architectural insight that makes the 5-node split work cleanly. It lets the guard node apply the right corrections without knowing the full provenance of every decision.
  • Checkpoint resumability is explicitly designed: each LLM call is a separate node, so crashes mid-tick don't repeat LLM work.
  • Thread namespace bump (:planner:v2) is explicitly recommended for checkpoint compatibility.

Secondary Recommendation: Augment with deepseek-4's recipes state caching

The deepseek-4 proposal adds recipes to State to avoid redundant recomputation. This is a small, practical optimization that improves performance without adding complexity. It is orthogonal to the fable-5 architecture and can be added as a follow-up.

Implementation Order

  1. Phase 1 (risk-free): Extract helper functions from plan_node without changing graph topology. The fable-5 proposal's tick, prepare, select, decide, guard bodies become standalone functions called sequentially from plan_node. Run tests.
  2. Phase 2 (medium risk): Rewire the graph. Register the 5 nodes, add conditional edges, point loop-backs to tick. Bump thread namespace to :planner:v2. Run tests.
  3. Phase 3 (low risk): Add recipes to State and cache it in prepare. Run tests.
  4. Phase 4 (cleanup): Remove old plan_node and route_after_plan. Update docs.

What to Avoid

  • Do NOT adopt the opus or qwen-3.7 proposals. The granularity is excessive and the checkpoint overhead is real. The nanolith problem is worse than the monolith problem.
  • Do NOT move complex policy into edge functions (as qwen-3.6 suggests). Edge functions should be 3-8 line predicates. Policy belongs in nodes.
  • Do NOT split ask_region / ask_currency into separate nodes (as kimi-2.6 and gpt-5.5 suggest). The existing ask_user node already handles these via _handle_region_answer / _handle_currency_answer. Separate nodes add graph noise without improving clarity.
  • Do NOT add a dispatch or route_decision node that only routes. Routing should be a conditional edge function, not a node. A node with no state mutation is pure overhead.

Final Word

The best refactor is the one that the team will actually maintain. The fable-5 proposal's 5-node graph is: - Small enough to fit on one screen - Large enough to make the control flow honest - Testable enough to give confidence in each node - Migrated safely in 4 phases

This is the recommendation.