Skip to content

Planner graph refactor proposals: ranked analysis

Scope and evaluation lens

I ranked the proposals against the current implementation in src/venturescope/planner/agent.py, src/venturescope/planner/__init__.py, src/venturescope/planner/schema.py, and the real test surface in tests/planner/test_planner_decisions.py and tests/planner/test_planner_runner.py.

The main criteria were:

  1. Behavior-order preservation: region/currency first, schema/decomposition before planning, calculator-before-finish, retry caps before repeated loops.
  2. Checkpoint safety: serializable state only, additive PlannerState changes only, no hidden edge mutations, clear story for in-flight threads.
  3. Interrupt safety: keep ask_user -> observe_user and the current interrupt payload shape stable.
  4. Graph honesty without graph spam: expose real durable phases, but do not turn every if into a node.
  5. Migration realism: can be landed incrementally without breaking run_planner_step() or direct plan_node() tests.
  6. Proposal accuracy: does it match the current code and test surface, or does it assume files/contracts that are not there?

High-level findings

  • The core diagnosis is correct across almost all proposals: plan_node() is doing too much.
  • The best proposals move durable orchestration phases into the graph while keeping domain rules in helpers.
  • The weakest proposals either:
  • push policy mutations into routing functions,
  • add too many checkpoint hops,
  • or invent transient state / file structure that does not match the current codebase.
  • One practical reality matters a lot here: the planner is checkpoint-owned, and run_planner_step() depends on stable iteration semantics and stable interrupt/resume behavior.

Tiered ranking

Rank Proposal Tier Verdict
1 gpt-5.4-proposal.md A Best base architecture
2 fable-5-proposal.md A Best operational supplement
3 gpt-5.5-proposal.md A Most complete engineering writeup
4 opus-proposal.md A- Strong but over-split
5 deepseek-4-pro-proposal.md B+ Solid 4-stage shape, risky state choices
6 glm-5.1-proposal.md B Good transition plan, weaker end state
7 kimi-2.6-proposal.md B- Reasonable structure, too much ceremony
8 gemini-3.1-pro-proposal.md C+ Directionally right, too shallow
9 mimo-2.5-pro-proposal.md C Internal inconsistencies hurt it
10 qwen-3.6-plus-proposal.md D Wrong place for policy logic
11 qwen-3.7-max-proposal.md D- Over-engineered and implementation-shaky

Detailed ranking

1. gpt-5.4-proposal.md

Why it ranks first

It finds the best balance between graph visibility and graph size. The proposed split - tick -> bootstrap_gate -> prepare_context -> acquisition_gate -> llm_plan -> decision_policy - maps well to the actual planner lifecycle without exploding the graph into a checkpoint-heavy policy chain.

Good sides

  • Keeps one loop-entry owner for iterations, which matches how run_planner_step() derives turn-local search history.
  • Keeps ask_user as the only interrupt node, which matches the current ask_user_node() / observe_user_node() contract.
  • Explicitly says conditional edges must stay side-effect free, which is the right LangGraph discipline.
  • Uses one decision_policy node instead of splitting every redirect/cap rule into separate nodes.
  • Preserves outer contracts: thread namespacing, run_planner_step() bootstrap/resume, and the action nodes.

Bad sides

  • It does not fully solve the current llm_failed handling by itself; you still need a durable way to preserve the “do not redirect failed LLM finish into reflect” rule.
  • It does not say enough about in-flight checkpoint migration when node names change.
  • It assumes the team will notice that direct plan_node tests live in tests/planner/test_planner_decisions.py, not the file names some proposals cite.

2. fable-5-proposal.md

Why it ranks second

This is the strongest proposal on checkpoint and state realism. Its tick / prepare / select / decide / guard shape is still compact, and its decision_origin + llm_failed additions solve a real current behavior split cleanly.

Good sides

  • Correctly notices that the current plan_node() can perform multiple LLM calls in one checkpoint step, which is a resume/replay problem.
  • Adds decision_origin and llm_failed in a way that cleanly separates deterministic decisions from LLM-produced ones.
  • Keeps ask_user as the shared interrupt surface and does not invent separate region/currency observe nodes.
  • Explicitly calls out mid-graph checkpoint compatibility and proposes a credible rollout strategy.

Bad sides

  • Slightly less clean than gpt-5.4 conceptually because select and guard together still carry quite a lot of policy weight.
  • The thread-namespace bump suggestion is operationally valid, but it is also a real migration cost.
  • It still leaves a couple of decomposition-triggered LLM calls inside non-decide stages.

3. gpt-5.5-proposal.md

Why it ranks third

This is the most complete engineering proposal overall. It is careful about PlannerState, retry behavior, and not serializing unsafe helper objects. I rank it just below the top two because it is a bit more granular than needed for the first implementation.

Good sides

  • Excellent state discipline: explicitly warns against caching prepared_recipes / FieldAcquisition objects in planner state.
  • Splits decision handling into meaningful buckets: normalize_decision, retry_gate, maybe_calculate.
  • Preserves the outer conversation graph and current public planner surface.
  • Migration plan is practical: extract helpers first, then rewire graph phases.

Bad sides

  • The proposed node count is a bit high for a planner that already writes through a Postgres-backed checkpointer.
  • Separate require_region and require_currency nodes are not wrong, but they do not add much compared with a single bootstrap gate.
  • Like gpt-5.4, it does not go deep enough on live checkpoint migration when topology changes.

4. opus-proposal.md

Why it ranks fourth

It is a very strong analysis and maps the current code carefully, but it over-optimizes for graph explicitness. The g_* + c_* chain tells the truth, but it adds more checkpoint boundaries than this planner needs.

Good sides

  • Excellent mapping from current line-level responsibilities to future nodes.
  • Strong insistence that plan_llm should become a single-purpose LLM node.
  • Clear migration staging and good awareness of the current llm_failed guard.
  • Central dispatch idea is clean.

Bad sides

  • Too many corrector nodes for this use case; each extra step becomes another checkpoint write.
  • The graph becomes more verbose than the human mental model needs.
  • Good for explanation, slightly worse for production pragmatism.

5. deepseek-4-pro-proposal.md

Why it ranks fifth

The guard -> prepare -> plan -> adjust pipeline is a solid medium-granularity design. It is easy to follow and keeps post-LLM normalization in a dedicated node.

Good sides

  • Clean 4-stage shape.
  • Correctly sends deterministic acquisition and auto-finish decisions through the same adjust/policy stage.
  • Good migration staging from helper extraction to graph rewiring.

Bad sides

  • Adds recipes to state, which is risky in this codebase because planner state is serializer-visible and recipe objects are not part of the current state contract.
  • Slightly optimistic about the cost of synchronizing cached recipes with dynamic decompositions.
  • Less explicit than the top proposals about interrupt/checkpoint invariants.

6. glm-5.1-proposal.md

Why it ranks sixth

This is a sensible transitional proposal, but not the best final architecture. pre_check is useful; acquire_or_plan remains too large.

Good sides

  • Very pragmatic first move: split out pre_check and get guards/bootstrap/calculator checks out of plan_node.
  • Realistic about leaving post-LLM validation inline for a first pass.
  • Good helper-extraction idea for output assembly.

Bad sides

  • Final end state is still too monolithic.
  • Relies on route-tag style state additions for control flow.
  • The claim that node-name changes are mostly harmless for checkpointing is too casual for this planner.

7. kimi-2.6-proposal.md

Why it ranks seventh

There is a lot to like here, but it adds ceremony that does not buy enough. Dedicated ask_region / ask_currency nodes and a route_direct adapter feel heavier than the current planner needs.

Good sides

  • Good distinction between deterministic acquisition and LLM decision making.
  • Sensible tick ownership of iteration counting.
  • enforce_policy as a single policy node is directionally strong.

Bad sides

  • Extra bootstrap nodes add graph surface even though observe_user_node() already handles region/currency specially.
  • route_direct is an adapter node that mostly exists because of the graph shape, not because the domain needs it.
  • Less careful than the top tier on state-contract and checkpoint details.

8. gemini-3.1-pro-proposal.md

Why it ranks eighth

It captures the main problem correctly, but it is too shallow for the real planner. The current code has harder edges around retry caps, calculator adjustment, blocked-field decomposition, and LLM-failure handling than this proposal accounts for.

Good sides

  • Simple and readable three-stage split.
  • Good instinct to isolate the LLM node.
  • Easy to explain and easy to start from.

Bad sides

  • Under-specifies post-LLM normalization.
  • Does not address current llm_failed behavior.
  • Does not really grapple with checkpoint migration or current direct tests.

9. mimo-2.5-pro-proposal.md

Why it ranks ninth

It has some good instincts, but it contains enough internal inconsistency that I would not use it as a base.

Good sides

  • Sensible wish to split guards, acquisition, and decision-making.
  • Reasonable staged migration idea.

Bad sides

  • The proposed graph is internally inconsistent: the diagram and text do not line up cleanly around acquisition vs LLM decision flow.
  • It under-specifies auto-finish and blocked-calculator details.
  • The plan_router concept feels like plumbing added to compensate for an unclear node split.

10. qwen-3.6-plus-proposal.md

Why it ranks tenth

This is one of the weakest architectural directions, even though it has some good observations. Its biggest mistake is moving too much real policy into routing functions.

Good sides

  • Correctly sees that bootstrap and decomposition are different concerns.
  • Tries to make the graph more explicit.

Bad sides

  • Pushes redirects, caps, and finish recalculation into route_after_plan, which is the wrong place for stateful policy.
  • Adds separate ask_region / ask_currency / observe_region / observe_currency nodes even though the current generic user-answer path already supports those cases.
  • Introduces route_finish_check, which overlaps awkwardly with the current finish_node() contract.
  • Says no new state fields are needed, but current LLM-failure semantics strongly suggest otherwise.

11. qwen-3.7-max-proposal.md

Why it ranks last

It is extremely detailed, but the detail hides a weaker fit to this codebase. It over-splits the graph, leans on transient routing state, and has implementation-level gaps.

Good sides

  • Very thorough mapping of planner concerns.
  • Good instinct that guard stages should be visible.
  • Strong attention to testability in principle.

Bad sides

  • Too many phases for a planner that already relies on checkpoint persistence and chat-paced loops.
  • Uses _route / _llm_failed transients and suggests excluding them from checkpointing, but the current planner state is serializer-driven; that is not a free capability.
  • Misses practical implementation details - for example, route_decision is treated like a graph step, but the proposal does not give it the same clean implementation story as the real nodes around it.
  • Feels closer to a framework exercise than a good fit for the current planner.

Cross-proposal observations that changed the ranking

1. Current test reality matters

Several proposals reference tests/planner/test_planner_agent.py, but the real direct plan_node() coverage is in tests/planner/test_planner_decisions.py. That matters because rename-heavy or topology-heavy proposals must account for those imports and fixtures.

2. observe_user_node() already owns bootstrap answer parsing

The current implementation already special-cases core.region and core.currency inside observe_user_node(). Proposals that create dedicated region/currency observe nodes are adding structure that the codebase does not need.

3. New planner state must be cheap and serializer-safe

Adding a plain bool/str field like llm_failed or decision_origin is fine. Adding cached recipe objects or pretending some returned fields are “transient only” is much weaker unless the serializer mirror is updated explicitly.

4. Search/noop behavior is part of the design problem

The planner can still reach search when the configured backend degrades to NoopSearch. Good proposals preserve deterministic fallback/cap behavior around that; weak proposals treat search as if it were always a productive live tool.

Best recommendation

Use gpt-5.4-proposal.md as the base architecture.

That is the best default shape for this planner:

  • tick
  • bootstrap_gate
  • prepare_context
  • acquisition_gate
  • llm_plan
  • decision_policy
  • existing action nodes unchanged

Do not adopt any single proposal literally. The best solution is a hybrid:

  1. Topology from gpt-5.4
  2. best node granularity
  3. best graph honesty / noise ratio
  4. best LangGraph edge discipline

  5. State discipline from fable-5

  6. add decision_origin
  7. add llm_failed
  8. reset them at loop entry
  9. keep them additive in PlannerState

  10. Helper partitioning from gpt-5.5

  11. keep normalize_decision, retry handling, and calculator adjustment as internal helper seams
  12. do not necessarily make each one a separate graph node in the first refactor

  13. Migration sequencing from opus / gpt-5.5

  14. extract helpers inside current plan_node() first
  15. rewire graph only after helper seams are tested
  16. then update docs and tests

What I would explicitly avoid

  • Caching recipes in planner state.
  • Moving policy rewrites into route_after_* functions.
  • Splitting region/currency into dedicated observe-node subloops.
  • Turning every corrector into its own checkpointed node.
  • Assuming topology changes are free for in-flight planner threads.

Final advice

If the goal is the best long-term planner graph, choose the gpt-5.4 shape.

If the goal is the safest real rollout, implement a gpt-5.4 + fable-5 hybrid, with gpt-5.5-style helper extraction as the migration path.

That combination gives the best balance of:

  • explicit graph phases,
  • preserved interrupt/resume behavior,
  • minimal state-risk,
  • testable decision seams,
  • and a rollout story that respects LangGraph checkpoints instead of hand-waving them away.