Planner Graph Refactor — Ranked Evaluation of Proposals¶

Reviewer: Sisyphus (Claude Opus 4.7) Date: 2026-06-11 Inputs: current-graph.md, the 11 proposals, src/venturescope/planner/agent.py, planner AGENTS.md.

0. Common Ground (Shared Diagnosis)¶

Every proposal agrees on the same diagnosis, which I confirm by reading the code:

plan_node (≈350 lines) bundles 10–17 distinct concerns: iteration tick, abort/max-iters caps, region/currency bootstrap, proactive decomposition (LLM), schema composition, calculator-cap guard, calculator-success guard, blocked-calculator acquisition (LLM-fallback decomposition), auto-finish, planner LLM, post-LLM correctors (derived-direct redirect, web-preferred redirect, search cap, ask cap, calc-adjust), event emission.
The Mermaid diagram is misleading: route_after_plan is trivial; ALL real routing intelligence is hidden inside plan_node early-returns.
A single planner tick can perform up to 3 LLM calls (proactive decomposition, planner decision, on-demand decomposition for composite targets); a crash mid-tick re-runs all of them on resume because LangGraph can only replay from node boundaries.

So the question is not "should we decompose?" but "how granular, how to migrate, and how to keep checkpoints/observability working?".

1. Evaluation Criteria¶

I scored each proposal across seven axes that map to risk and value for THIS codebase:

#	Axis	Why it matters
C1	Granularity fit	Too coarse = doesn't solve the monolith problem. Too fine = checkpoint-write storm, hard to read.
C2	Migration safety	A staged "extract helpers first → rewire graph second" plan keeps every intermediate commit green.
C3	Checkpoint compatibility	Renaming/removing the `plan` node breaks resume for in-flight threads whose pending tasks reference node names.
C4	State surface discipline	New fields must serialize cleanly through `PlannerState`; transient fields polluting checkpoints are a code smell.
C5	LLM-call isolation / resumability	Decomposition LLM calls should be in their own node so partial failure does not re-bill the planner LLM.
C6	Edge-function purity	Conditional edges must stay side-effect-free; mutation belongs in nodes.
C7	Test impact realism	Most planner tests in `tests/planner/test_planner_agent.py` import `plan_node` directly; an honest proposal acknowledges the churn.

2. Ranked List¶

Rank	Proposal	Score	One-line verdict
1	`fable-5-proposal.md`	A	Best balance: 5 nodes, minimal state additions, only proposal that addresses checkpoint compat explicitly, mechanical migration.
2	`gpt-5.4-proposal.md`	A	Strong principles ("no mutation in edges", "one decision_policy, not many"), clean 6-stage migration, explicit anti-patterns.
3	`gpt-5.5-proposal.md`	A−	Same shape as gpt-5.4 with more nodes (8) and a tests-to-update map; slightly more graph noise.
4	`opus-proposal.md`	B+	Most principled invariant ("every gate deposits a `PlannerDecision`; `dispatch` reads `decision.action`"); risks chatty graph (up to 10 hops/iter).
5	`kimi-2.6-proposal.md`	B+	8-node pipeline with dedicated `ask_region`/`ask_currency`; honest decision matrix; minor: leaves a big `enforce_policy` node.
6	`mimo-2.5-pro-proposal.md`	B	Clean 3-node split (guards / acquire / decide); folds all redirects back into `decide`, partly recreating the monolith.
7	`gemini-3.1-pro-proposal.md`	B	Shortest, most pragmatic (3 nodes); too coarse — `evaluate_rules` still groups 6 different gates.
8	`deepseek-4-pro-proposal.md`	B−	4 nodes + `plan → plan` self-loop for on-demand decomposition; adds `recipes` to state (serialization concern).
9	`glm-5.1-proposal.md`	C+	Only `pre_check` + `acquire_or_plan`; honest about leaving post-LLM rewriting in. Minimum-viable change but doesn't finish the job.
10	`qwen-3.6-plus-proposal.md`	C	Splits region/currency into separate `observe_region`/`observe_currency`; duplicates `observe_user` logic; `route_finish_check` introduces redundant validation.
11	`qwen-3.7-max-proposal.md`	C−	Most exhaustive code-level proposal, but adds transient `_route` field to checkpointer state; `observe_user → check_region_currency` edge bypasses the iteration tick — bug magnet.

3. Per-Proposal Analysis¶

3.1 `fable-5-proposal.md` — Rank 1 (A)¶

Shape: tick → prepare → select → decide → guard → dispatch action. Loop-backs all return to tick.

Strengths - C1 ✅ — 5 nodes hit the sweet spot. Each maps to a clear lifecycle phase ("loop entry / state prep / deterministic decision / LLM / decision rewrite"). - C2 ✅ — Migration is explicit: Phase 1 extract pure helpers inside the existing plan_node; Phase 2 rewire graph; Phase 3 docs. Stays green between phases. - C3 ✅ — Only proposal that addresses checkpoint compat: it proposes bumping the planner thread namespace (f"{conversation_id}:planner:v2") to invalidate in-flight pending tasks that still reference the old plan node. The runner's prior_schema + prior_dynamic_decompositions already handle bootstrap re-entry, so an in-flight conversation degrades to a clean re-bootstrap. - C4 ✅ — Only 2 new state fields (decision_origin: Literal["deterministic","llm"], llm_failed: bool). Both plain serializable. - C5 ⚠️ — Decomposition LLM call is isolated to prepare. The remaining in-stage LLM calls (select step 3 for blocked-field decomposition, guard for on-demand decomposition) are explicitly flagged as a follow-up decompose node. - C6 ✅ — Routing functions are explicitly pure label-mappers. - C7 ✅ — Acknowledges that tests/planner/test_planner_agent.py will churn and frames it as a net win.

Weaknesses - The decision_origin flag is a stand-in for "did the LLM run?". It works, but it leaks decision provenance into every downstream consumer. An alternative is Command(goto=...) from the deterministic gates so they never reach guard — fable-5 mentions this as deferred. - Up to 4 checkpoint writes/tick vs 1 today; acceptable for a chat-paced agent. Fable-5 explicitly calls out tick+prepare can be fused if profiling warrants.

Verdict: This is the proposal I would adopt as the default plan.

3.2 `gpt-5.4-proposal.md` — Rank 2 (A)¶

Shape: tick → bootstrap_gate → prepare_context → acquisition_gate → llm_plan → decision_policy → action.

Strengths - C1 ✅ — 6 stages, each named after its job. Bigger than fable-5 but the extra split (bootstrap_gate separate from tick) makes interrupt boundaries explicit. - C2 ✅ — Migration is staged in 6 individually shippable steps, each preserving behavior. Lowest-risk plan in the field. - C3 ❌ — Does not address checkpoint compat; this is the only material gap. - C4 ✅ — No new state fields proposed. - C5 ⚠️ — Same as fable-5: decomposition LLM call sits inside prepare_context and inside decision_policy; not isolated. - C6 ✅ — Has a dedicated "Risks and anti-patterns" section that explicitly forbids state mutation in edge functions and "do not turn every policy rule into a node". This kind of guidance is the most valuable part of the proposal. - C7 ⚠️ — Doesn't enumerate test-by-test churn; mentions tests at a high level.

Weaknesses - Keeps _adjust_calculation_decision, _redirect_*, and cap enforcement in a single decision_policy node. That's pragmatic but means decision_policy will be the next-biggest node after the refactor. fable-5 and opus split this further. - Does not propose any state additions, which means it inherits today's implicit llm_failed local variable. Either the migration must reproduce that, or a flag must be added.

Verdict: Equally adoptable. The principles ("graph shows phases, helpers own domain logic, LLM stays small") are the cleanest articulation in the corpus.

3.3 `gpt-5.5-proposal.md` — Rank 3 (A−)¶

Shape: 8 nodes — enter_iteration → prepare_schema → [require_region | require_currency] → acquisition_gate → plan → normalize_decision → retry_gate → maybe_calculate → action.

Strengths - C1 ⚠️ — More nodes than gpt-5.4 with no functional difference (same concerns split further). - C2 ✅ — 4-step migration with explicit pure-helper extraction first; this is the same shape as fable-5/gpt-5.4 and is the right pattern. - C3 ⚠️ — Mentions "potential checkpoint compatibility concerns" but only for adding optional state fields. Does not address renaming plan away. - C4 ✅ — Optional decision_origin / llm_failed; explicitly rejects prepared_recipes in state because FieldAcquisition serialization isn't proven. - C5 ⚠️ — Same as gpt-5.4. - C6 ✅ — Implicit (no edge mutation discussed because all logic is in nodes). - C7 ✅ — Best test mapping: each new node has a corresponding test target listed (e.g., "calculator-before-finish routes through maybe_calculate"). Saves the implementer time.

Weaknesses - The graph is visually busier than gpt-5.4's with no behavior gain. - normalize_decision and retry_gate could be one node (matching gpt-5.4's decision_policy). The split is defensible but adds two more checkpoint writes per tick.

Verdict: Effectively a more verbose gpt-5.4. Adopt only if you want one-rule-per-node enforcement.

3.4 `opus-proposal.md` — Rank 4 (B+)¶

Shape: Most granular — g_iter_cap → g_region → g_currency → enrich → g_calc_caps → acquire → plan_llm → c_target_decompose → c_redirect_derived → c_redirect_web → c_search_cap → c_ask_cap → c_calc_adjust → dispatch → action.

Strengths - C1 ⚠️ — 13+ nodes. Risks being too chatty. The c_* chain is fully sequential with no branching so it could use add_edge (not add_conditional_edges) and that is what the proposal recommends. - C2 ✅ — Explicit 3-step plan: extract helpers, hoist gates, split correctors. Each step is independently green. - C3 ❌ — Does not address checkpoint compat. - C4 ✅ — Only adds llm_failed: bool. - C5 ⚠️ — Same as everyone except mentions the c_target_decompose step separately, which is closer to isolating the post-LLM decomposition LLM call. - C6 ✅ — Establishes a strong invariant: every gate/corrector either pre-fills state["decision"] then jumps to dispatch, or hands off to the next node. This is the cleanest mental model in the corpus. - C7 ⚠️ — Acknowledges existing tests stay green at step 1, treats step 2/3 test churn lightly.

Weaknesses - 13+ nodes means 13+ checkpoint writes per tick (vs 1 today). Even with add_edge batching where possible, this is a real overhead for a Postgres-backed checkpointer in a chat session. - Risks "fragmentation cost": adding any new correction becomes another node + edge + name in three places. After the refactor, the next engineer is doing surgery on a 13-node graph instead of a 13-section function. Net cognitive load is lower per touchpoint but higher per change. - The "every gate writes decision then jumps to dispatch" pattern is elegant but it means gates have to short-circuit by setting decision, which is structurally identical to today's early returns — just spread across more files.

Verdict: Aspirational. Use the invariant and the g_* / c_* naming; do not adopt the full granularity.

3.5 `kimi-2.6-proposal.md` — Rank 5 (B+)¶

Shape: 8 nodes — tick → [ask_region | ask_currency] → prepare → calc_gate → acquire → [route_direct | decide] → enforce → action. Has the appendix decision matrix.

Strengths - C1 ✅ — Sensible granularity. - C2 ✅ — 3-phase migration. - C3 ❌ — Does not address checkpoint compat. - C4 ✅ — No new state fields. - C5 ⚠️ — Inline decomposition stays in acquire/enforce. - C6 ⚠️ — Routing details are sparse on edge purity. - C7 ✅ — Decision matrix appendix is a useful reviewer artifact.

Weaknesses - Dedicated ask_region/ask_currency nodes that route through the existing observe_user is a nice touch (saves duplicating answer parsing), but does add nodes whose only job is to construct a PlannerDecision. Could be edge-level instead. - route_direct is a tiny adapter (acquisition task → decision). Borderline necessary. - enforce_policy keeps the entire post-LLM correction stack (target decompose + 2 redirects + 2 caps + calc-adjust) as one node — same trade-off as gpt-5.4's decision_policy.

Verdict: A reasonable alternative to fable-5; nothing decisive.

3.6 `mimo-2.5-pro-proposal.md` — Rank 6 (B)¶

Shape: 3 nodes — guards → acquire → decide → action. Decide folds in all redirects.

Strengths - C1 ⚠️ — Coarsest non-glm proposal. Solves visibility for the front half but decide is still doing the LLM call + 3 redirectors + per-field cap detection. - C2 ✅ — 3-phase migration is explicit. - C3 ❌ — Not addressed. - C4 ✅ — No new state fields. - C5 ❌ — Decomposition + LLM + redirects all in decide. Same monolith problem, slightly smaller. - C6 ⚠️ — Open question "is plan_router a node or routing function?" indicates the design is incomplete. - C7 ⚠️ — Test impact discussed at high level only.

Weaknesses - The "open questions" section reveals the design isn't fully thought through. - Folding redirects back into decide partly recreates the original problem.

Verdict: Adopt the front-half split (guards + acquire) and pair it with a richer post-LLM split (gpt-5.4 or opus).

3.7 `gemini-3.1-pro-proposal.md` — Rank 7 (B)¶

Shape: 3 nodes — prepare_state → evaluate_rules → llm_plan → action.

Strengths - C1 ❌ — evaluate_rules covers SIX gates (region, currency, max_iters, calculator cap, calculator success, blocked-task, auto-finish). One node, six concerns; partially recreates the original. - C2 ⚠️ — High-level migration only; no commit-by-commit plan. - C3 ❌ — Not addressed. - C4 ✅ — No new state fields. - C5 ✅ — Pure LLM in llm_plan. Cleanest LLM isolation in the corpus (no post-LLM corrections; they would need a separate node not included). - C6 ✅ — Conceptually clean. - C7 ✅ — Test seams explicit ("test schema composition and business rules natively as pure python functions without mocking out _llm()").

Weaknesses - Missing post-LLM correction stage entirely. The proposal essentially says "test the LLM separately" but doesn't show where derived-direct redirect, web-preferred redirect, search caps, and ask caps live. This is a load-bearing piece of behavior; omitting it is a design gap.

Verdict: Good for the prep + LLM-isolation idea; incomplete on the corrector side.

3.8 `deepseek-4-pro-proposal.md` — Rank 8 (B−)¶

Shape: 4 nodes — guard → prepare → plan → adjust → action. plan → plan self-loop for on-demand decomposition.

Strengths - C1 ✅ — Reasonable. - C2 ✅ — 4-phase migration (extract, rewire, remove, verify). - C3 ❌ — Not addressed. - C4 ❌ — Adds recipes: dict[str, FieldAcquisition] to State. FieldAcquisition serialization is unproven (gpt-5.5 explicitly rejects exactly this); the planner state is checkpointer-owned (PlannerState mirror). - C5 ⚠️ — plan → plan self-loop for decomposition is novel but creates a re-prompt loop; debugging "why did plan run twice" gets harder. - C6 ✅ — Routing functions described as small. - C7 ⚠️ — Test impact discussed lightly.

Weaknesses - recipes in state is a real risk. Today build_dynamic_recipes is called wherever needed; caching it requires serializing FieldAcquisition and keeping it in sync with dynamic_decompositions. The proposal admits "must stay synchronized" without specifying how. - The plan → plan self-loop is more elegant in theory than in practice; LangGraph re-entrancy through the same node is harder to checkpoint-debug than two distinct nodes.

Verdict: Avoid the recipes state addition and the self-loop. The rest is fine.

3.9 `glm-5.1-proposal.md` — Rank 9 (C+)¶

Shape: 2 nodes — pre_check → acquire_or_plan → action. Post-LLM rewriting stays in acquire_or_plan.

Strengths - C1 ❌ — acquire_or_plan is still ~200 LOC after the split. Half the problem moved, half stayed. - C2 ✅ — 4-phase migration; Phase 4 ("optional: extract validate_decision later") is honest about the deferred work. - C3 ❌ — Not addressed. - C4 ⚠️ — Proposes a transient _pre_check_route field on State, then offers an alternative of extending the Action Literal — and recommends the first. A transient string field on checkpointer state is a smell. - C5 ❌ — Decomposition + LLM + redirects all in acquire_or_plan. - C6 ✅ — route_after_pre_check reads the routing tag. - C7 ✅ — Notes that loop-back edge change does not affect checkpointer keying.

Weaknesses - "Minimum-viable refactor" is honest framing but does not finish the job. The next refactor (Phase 4) would essentially be a separate proposal.

Verdict: Acceptable as Phase 1 of a larger plan, but on its own it leaves the corrector monolith intact.

3.10 `qwen-3.6-plus-proposal.md` — Rank 10 (C)¶

Shape: 6+ nodes including dedicated ask_region/ask_currency/observe_region/observe_currency/decompose/route_finish_check.

Strengths - C2 ✅ — 5-phase migration. - C7 ✅ — Explicit benefits table.

Weaknesses - C1 ❌ — Splits observe_user into per-field observers (observe_region, observe_currency). This duplicates the field-routing logic that already exists in observe_user_node (_handle_region_answer / _handle_currency_answer). Kimi-2.6 correctly keeps a single observe_user. - C3 ❌ — Not addressed. - C5 ⚠️ — decompose node is good but the proposal also embeds decomposition lookups in route_after_plan redirect logic, defeating the purpose. - C6 ❌ — route_after_plan is given heavy responsibilities (6 redirect rules including derived-direct, web-preferred, finish→calculate, finish→reflect, search cap, ask cap). This violates the "edges must be pure label functions" principle that gpt-5.4 explicitly calls out. - route_finish_check is a node whose sole job is to validate that finish is appropriate; today's finish_node already does this via iter_schema_leaves + missing_leaves. Adds a hop with no clear gain.

Verdict: Anti-pattern in C6 (edge function with side-effect-laden routing). Avoid.

3.11 `qwen-3.7-max-proposal.md` — Rank 11 (C−)¶

Shape: 10 named nodes with full code samples (700+ lines).

Strengths - C7 ✅ — Has the most extensive code-level detail; whoever implements would have ready-to-paste skeletons.

Weaknesses - C4 ❌ — Adds _route: NotRequired[str] and _llm_failed: NotRequired[bool] as transient routing fields on the checkpointer state. The proposal says "exclude from checkpointing and clear after use" but does not show how — LangGraph reducers default to merging, not excluding. This is exactly the smell gpt-5.4 warns against. - C3 ❌ — Not addressed. - C6 ❌ — route_decision is a routing function that calls _emit_decision_event(decision) — i.e., a side effect inside an edge function. This is the explicit anti-pattern in gpt-5.4. - C1 / C5 ⚠️ — Calls build_dynamic_recipes in compose_schema, acquisition_routing, check_completion, llm_decide, post_process, enforce_caps — six times per tick instead of once. The current code already rebuilds it in 2–3 places; this refactor doubles the cost rather than reducing it. - The edge observe_user → check_region_currency bypasses check_termination, which means an interrupt-resume cycle does not re-check max-iters or abort status. Subtle bug.

Verdict: Most ambitious, lowest quality. Several material design errors. Do not adopt as-is.

4. Recommendation¶

Adopt fable-5-proposal.md as the baseline plan, with two amendments lifted from gpt-5.4-proposal.md and opus-proposal.md.

4.1 Baseline: fable-5¶

Reasons:

It is the only proposal that confronts checkpoint compatibility (thread namespace bump). Every other proposal silently breaks in-flight planner threads.
The 5-stage shape (tick → prepare → select → decide → guard) is the smallest split that fully separates the four real concerns: lifecycle/caps (tick), state prep (prepare), deterministic acquisition (select), LLM (decide), policy correction (guard).
Migration is mechanical: Phase 1 extracts module-level helper functions inside the existing plan_node; Phase 2 rewires the graph. Both phases pass existing tests if done right.
Minimal state surface change: decision_origin + llm_failed, both plain primitives.

4.2 Amendment A — adopt the anti-patterns guardrail from gpt-5.4¶

Before writing any code, codify these as ADR/AGENTS.md rules (gpt-5.4 phrases them well):

Conditional edges are pure label functions. No state mutation, no event emission. Today's _emit_decision_event call belongs in a node, not in route_*. (qwen-3.7-max and qwen-3.6-plus both violate this.)
Only tick increments iterations. Otherwise attempt logs and turn-scoped search extraction in run_planner_step() drift.
ask_user remains the only interrupting node. Gates create decisions; they don't pause.
One decision_policy / guard node, not one node per rule. Resist the opus-style fragmentation unless profiling demands it.

4.3 Amendment B — adopt the invariant from opus-proposal¶

Every gate either deposits a PlannerDecision into state["decision"] and short-circuits to the dispatcher, OR hands off to the next stage without setting decision. The dispatcher reads decision.action and routes — same contract as today's route_after_plan.

This invariant makes the migration provably equivalent: at every node boundary, you can write a test that asserts "if decision is set, action ∈ {search, ask_user, calculate, reflect, finish}; if decision is unset, next node was reached". It also means route_after_* functions stay trivial (Amendment A satisfied for free).

4.4 Deferred / explicit non-goals¶

Per-corrector node splitting (opus, qwen-3.7-max). Defer until profiling shows the single guard node is hot or a specific corrector needs its own observability hook.
decompose as a dedicated loop node (fable-5 follow-up, deepseek-4 self-loop). Defer until the planner LLM cost ratio justifies isolating the 2nd/3rd LLM calls.
Splitting observe_user into per-field observers (qwen-3.6-plus). Reject — duplicates existing routing.
Adding recipes to state (deepseek-4-pro). Reject until FieldAcquisition serialization is proven safe.

4.5 Concrete migration order (combining the above)¶

Step 0 — Guardrails. Add to src/venturescope/planner/AGENTS.md: edge functions are pure label-mappers; tick owns iteration; ask_user is the only interrupt; one policy node not many; the invariant from §4.3. This is the cheapest, highest-leverage change and unblocks reviewer disagreement.
Step 1 — Extract helpers (no graph change). Pull tick, prepare, select, decide, guard bodies out of plan_node as module-level functions called in sequence. Add decision_origin and llm_failed to State + PlannerState. Existing tests stay green. Ship as one commit.
Step 2 — Rewire graph. Register the 5 nodes, add conditional edges, point all action-node returns at tick, delete plan_node and route_after_plan. Bump the planner thread namespace to {conversation_id}:planner:v2 so in-flight threads re-bootstrap cleanly via prior_schema + prior_dynamic_decompositions. Update tests/planner/test_planner_agent.py to target stage nodes.
Step 3 — Docs. Regenerate docs/planner-graph-ref/current-graph.md from the new topology. The "Routing details" prose section largely disappears because the graph now expresses it.

5. Summary Table¶

Proposal	Nodes	Migration plan	Addresses checkpoint compat	New state	Verdict
fable-5	5	Helpers → rewire → docs	✅ (namespace bump)	`decision_origin`, `llm_failed`	Adopt as baseline
gpt-5.4	6	6 staged seams	❌	none	Adopt principles
gpt-5.5	8	4-step + test map	❌	optional `llm_failed`	Verbose gpt-5.4
opus	13+	3 steps	❌	`llm_failed`	Adopt invariant + naming
kimi-2.6	8	3-phase	❌	none	Alternative to fable-5
mimo-2.5-pro	3	3-phase	❌	none	Half-finished
gemini-3.1-pro	3	High-level	❌	none	Missing corrector stage
deepseek-4-pro	4	4-phase	❌	`recipes` (risky)	Avoid `recipes`
glm-5.1	2	4-phase	❌	transient `_pre_check_route`	Phase-1-only
qwen-3.6-plus	6+	5-phase	❌	none	Violates edge purity
qwen-3.7-max	10	5-phase	❌	transient `_route`, `_llm_failed`	Multiple anti-patterns

Bottom line: ship fable-5 with the gpt-5.4 anti-patterns guardrail and the opus invariant, in three commits, with a thread-namespace bump to protect in-flight conversations.

Planner Graph Refactor — Ranked Evaluation of Proposals¶

0. Common Ground (Shared Diagnosis)¶

1. Evaluation Criteria¶

2. Ranked List¶

3. Per-Proposal Analysis¶

3.1 fable-5-proposal.md — Rank 1 (A)¶

3.2 gpt-5.4-proposal.md — Rank 2 (A)¶

3.3 gpt-5.5-proposal.md — Rank 3 (A−)¶

3.4 opus-proposal.md — Rank 4 (B+)¶

3.5 kimi-2.6-proposal.md — Rank 5 (B+)¶

3.6 mimo-2.5-pro-proposal.md — Rank 6 (B)¶

3.7 gemini-3.1-pro-proposal.md — Rank 7 (B)¶

3.8 deepseek-4-pro-proposal.md — Rank 8 (B−)¶

3.9 glm-5.1-proposal.md — Rank 9 (C+)¶

3.10 qwen-3.6-plus-proposal.md — Rank 10 (C)¶

3.11 qwen-3.7-max-proposal.md — Rank 11 (C−)¶