Planner graph refactor proposals: ranked analysis¶
Scope and evaluation lens¶
I ranked the proposals against the current implementation in src/venturescope/planner/agent.py, src/venturescope/planner/__init__.py, src/venturescope/planner/schema.py, and the real test surface in tests/planner/test_planner_decisions.py and tests/planner/test_planner_runner.py.
The main criteria were:
- Behavior-order preservation: region/currency first, schema/decomposition before planning, calculator-before-finish, retry caps before repeated loops.
- Checkpoint safety: serializable state only, additive
PlannerStatechanges only, no hidden edge mutations, clear story for in-flight threads. - Interrupt safety: keep
ask_user -> observe_userand the current interrupt payload shape stable. - Graph honesty without graph spam: expose real durable phases, but do not turn every
ifinto a node. - Migration realism: can be landed incrementally without breaking
run_planner_step()or directplan_node()tests. - Proposal accuracy: does it match the current code and test surface, or does it assume files/contracts that are not there?
High-level findings¶
- The core diagnosis is correct across almost all proposals:
plan_node()is doing too much. - The best proposals move durable orchestration phases into the graph while keeping domain rules in helpers.
- The weakest proposals either:
- push policy mutations into routing functions,
- add too many checkpoint hops,
- or invent transient state / file structure that does not match the current codebase.
- One practical reality matters a lot here: the planner is checkpoint-owned, and
run_planner_step()depends on stable iteration semantics and stable interrupt/resume behavior.
Tiered ranking¶
| Rank | Proposal | Tier | Verdict |
|---|---|---|---|
| 1 | gpt-5.4-proposal.md |
A | Best base architecture |
| 2 | fable-5-proposal.md |
A | Best operational supplement |
| 3 | gpt-5.5-proposal.md |
A | Most complete engineering writeup |
| 4 | opus-proposal.md |
A- | Strong but over-split |
| 5 | deepseek-4-pro-proposal.md |
B+ | Solid 4-stage shape, risky state choices |
| 6 | glm-5.1-proposal.md |
B | Good transition plan, weaker end state |
| 7 | kimi-2.6-proposal.md |
B- | Reasonable structure, too much ceremony |
| 8 | gemini-3.1-pro-proposal.md |
C+ | Directionally right, too shallow |
| 9 | mimo-2.5-pro-proposal.md |
C | Internal inconsistencies hurt it |
| 10 | qwen-3.6-plus-proposal.md |
D | Wrong place for policy logic |
| 11 | qwen-3.7-max-proposal.md |
D- | Over-engineered and implementation-shaky |
Detailed ranking¶
1. gpt-5.4-proposal.md¶
Why it ranks first
It finds the best balance between graph visibility and graph size. The proposed split - tick -> bootstrap_gate -> prepare_context -> acquisition_gate -> llm_plan -> decision_policy - maps well to the actual planner lifecycle without exploding the graph into a checkpoint-heavy policy chain.
Good sides
- Keeps one loop-entry owner for
iterations, which matches howrun_planner_step()derives turn-local search history. - Keeps
ask_useras the only interrupt node, which matches the currentask_user_node()/observe_user_node()contract. - Explicitly says conditional edges must stay side-effect free, which is the right LangGraph discipline.
- Uses one
decision_policynode instead of splitting every redirect/cap rule into separate nodes. - Preserves outer contracts: thread namespacing,
run_planner_step()bootstrap/resume, and the action nodes.
Bad sides
- It does not fully solve the current
llm_failedhandling by itself; you still need a durable way to preserve the “do not redirect failed LLM finish into reflect” rule. - It does not say enough about in-flight checkpoint migration when node names change.
- It assumes the team will notice that direct
plan_nodetests live intests/planner/test_planner_decisions.py, not the file names some proposals cite.
2. fable-5-proposal.md¶
Why it ranks second
This is the strongest proposal on checkpoint and state realism. Its tick / prepare / select / decide / guard shape is still compact, and its decision_origin + llm_failed additions solve a real current behavior split cleanly.
Good sides
- Correctly notices that the current
plan_node()can perform multiple LLM calls in one checkpoint step, which is a resume/replay problem. - Adds
decision_originandllm_failedin a way that cleanly separates deterministic decisions from LLM-produced ones. - Keeps
ask_useras the shared interrupt surface and does not invent separate region/currency observe nodes. - Explicitly calls out mid-graph checkpoint compatibility and proposes a credible rollout strategy.
Bad sides
- Slightly less clean than
gpt-5.4conceptually becauseselectandguardtogether still carry quite a lot of policy weight. - The thread-namespace bump suggestion is operationally valid, but it is also a real migration cost.
- It still leaves a couple of decomposition-triggered LLM calls inside non-
decidestages.
3. gpt-5.5-proposal.md¶
Why it ranks third
This is the most complete engineering proposal overall. It is careful about PlannerState, retry behavior, and not serializing unsafe helper objects. I rank it just below the top two because it is a bit more granular than needed for the first implementation.
Good sides
- Excellent state discipline: explicitly warns against caching
prepared_recipes/FieldAcquisitionobjects in planner state. - Splits decision handling into meaningful buckets:
normalize_decision,retry_gate,maybe_calculate. - Preserves the outer conversation graph and current public planner surface.
- Migration plan is practical: extract helpers first, then rewire graph phases.
Bad sides
- The proposed node count is a bit high for a planner that already writes through a Postgres-backed checkpointer.
- Separate
require_regionandrequire_currencynodes are not wrong, but they do not add much compared with a single bootstrap gate. - Like
gpt-5.4, it does not go deep enough on live checkpoint migration when topology changes.
4. opus-proposal.md¶
Why it ranks fourth
It is a very strong analysis and maps the current code carefully, but it over-optimizes for graph explicitness. The g_* + c_* chain tells the truth, but it adds more checkpoint boundaries than this planner needs.
Good sides
- Excellent mapping from current line-level responsibilities to future nodes.
- Strong insistence that
plan_llmshould become a single-purpose LLM node. - Clear migration staging and good awareness of the current
llm_failedguard. - Central
dispatchidea is clean.
Bad sides
- Too many corrector nodes for this use case; each extra step becomes another checkpoint write.
- The graph becomes more verbose than the human mental model needs.
- Good for explanation, slightly worse for production pragmatism.
5. deepseek-4-pro-proposal.md¶
Why it ranks fifth
The guard -> prepare -> plan -> adjust pipeline is a solid medium-granularity design. It is easy to follow and keeps post-LLM normalization in a dedicated node.
Good sides
- Clean 4-stage shape.
- Correctly sends deterministic acquisition and auto-finish decisions through the same adjust/policy stage.
- Good migration staging from helper extraction to graph rewiring.
Bad sides
- Adds
recipesto state, which is risky in this codebase because planner state is serializer-visible and recipe objects are not part of the current state contract. - Slightly optimistic about the cost of synchronizing cached recipes with dynamic decompositions.
- Less explicit than the top proposals about interrupt/checkpoint invariants.
6. glm-5.1-proposal.md¶
Why it ranks sixth
This is a sensible transitional proposal, but not the best final architecture. pre_check is useful; acquire_or_plan remains too large.
Good sides
- Very pragmatic first move: split out
pre_checkand get guards/bootstrap/calculator checks out ofplan_node. - Realistic about leaving post-LLM validation inline for a first pass.
- Good helper-extraction idea for output assembly.
Bad sides
- Final end state is still too monolithic.
- Relies on route-tag style state additions for control flow.
- The claim that node-name changes are mostly harmless for checkpointing is too casual for this planner.
7. kimi-2.6-proposal.md¶
Why it ranks seventh
There is a lot to like here, but it adds ceremony that does not buy enough. Dedicated ask_region / ask_currency nodes and a route_direct adapter feel heavier than the current planner needs.
Good sides
- Good distinction between deterministic acquisition and LLM decision making.
- Sensible
tickownership of iteration counting. enforce_policyas a single policy node is directionally strong.
Bad sides
- Extra bootstrap nodes add graph surface even though
observe_user_node()already handles region/currency specially. route_directis an adapter node that mostly exists because of the graph shape, not because the domain needs it.- Less careful than the top tier on state-contract and checkpoint details.
8. gemini-3.1-pro-proposal.md¶
Why it ranks eighth
It captures the main problem correctly, but it is too shallow for the real planner. The current code has harder edges around retry caps, calculator adjustment, blocked-field decomposition, and LLM-failure handling than this proposal accounts for.
Good sides
- Simple and readable three-stage split.
- Good instinct to isolate the LLM node.
- Easy to explain and easy to start from.
Bad sides
- Under-specifies post-LLM normalization.
- Does not address current
llm_failedbehavior. - Does not really grapple with checkpoint migration or current direct tests.
9. mimo-2.5-pro-proposal.md¶
Why it ranks ninth
It has some good instincts, but it contains enough internal inconsistency that I would not use it as a base.
Good sides
- Sensible wish to split guards, acquisition, and decision-making.
- Reasonable staged migration idea.
Bad sides
- The proposed graph is internally inconsistent: the diagram and text do not line up cleanly around acquisition vs LLM decision flow.
- It under-specifies auto-finish and blocked-calculator details.
- The
plan_routerconcept feels like plumbing added to compensate for an unclear node split.
10. qwen-3.6-plus-proposal.md¶
Why it ranks tenth
This is one of the weakest architectural directions, even though it has some good observations. Its biggest mistake is moving too much real policy into routing functions.
Good sides
- Correctly sees that bootstrap and decomposition are different concerns.
- Tries to make the graph more explicit.
Bad sides
- Pushes redirects, caps, and finish recalculation into
route_after_plan, which is the wrong place for stateful policy. - Adds separate
ask_region/ask_currency/observe_region/observe_currencynodes even though the current generic user-answer path already supports those cases. - Introduces
route_finish_check, which overlaps awkwardly with the currentfinish_node()contract. - Says no new state fields are needed, but current LLM-failure semantics strongly suggest otherwise.
11. qwen-3.7-max-proposal.md¶
Why it ranks last
It is extremely detailed, but the detail hides a weaker fit to this codebase. It over-splits the graph, leans on transient routing state, and has implementation-level gaps.
Good sides
- Very thorough mapping of planner concerns.
- Good instinct that guard stages should be visible.
- Strong attention to testability in principle.
Bad sides
- Too many phases for a planner that already relies on checkpoint persistence and chat-paced loops.
- Uses
_route/_llm_failedtransients and suggests excluding them from checkpointing, but the current planner state is serializer-driven; that is not a free capability. - Misses practical implementation details - for example,
route_decisionis treated like a graph step, but the proposal does not give it the same clean implementation story as the real nodes around it. - Feels closer to a framework exercise than a good fit for the current planner.
Cross-proposal observations that changed the ranking¶
1. Current test reality matters¶
Several proposals reference tests/planner/test_planner_agent.py, but the real direct plan_node() coverage is in tests/planner/test_planner_decisions.py. That matters because rename-heavy or topology-heavy proposals must account for those imports and fixtures.
2. observe_user_node() already owns bootstrap answer parsing¶
The current implementation already special-cases core.region and core.currency inside observe_user_node(). Proposals that create dedicated region/currency observe nodes are adding structure that the codebase does not need.
3. New planner state must be cheap and serializer-safe¶
Adding a plain bool/str field like llm_failed or decision_origin is fine. Adding cached recipe objects or pretending some returned fields are “transient only” is much weaker unless the serializer mirror is updated explicitly.
4. Search/noop behavior is part of the design problem¶
The planner can still reach search when the configured backend degrades to NoopSearch. Good proposals preserve deterministic fallback/cap behavior around that; weak proposals treat search as if it were always a productive live tool.
Best recommendation¶
Recommended base¶
Use gpt-5.4-proposal.md as the base architecture.
That is the best default shape for this planner:
tickbootstrap_gateprepare_contextacquisition_gatellm_plandecision_policy- existing action nodes unchanged
Recommended combination¶
Do not adopt any single proposal literally. The best solution is a hybrid:
- Topology from
gpt-5.4 - best node granularity
- best graph honesty / noise ratio
-
best LangGraph edge discipline
-
State discipline from
fable-5 - add
decision_origin - add
llm_failed - reset them at loop entry
-
keep them additive in
PlannerState -
Helper partitioning from
gpt-5.5 - keep
normalize_decision, retry handling, and calculator adjustment as internal helper seams -
do not necessarily make each one a separate graph node in the first refactor
-
Migration sequencing from
opus/gpt-5.5 - extract helpers inside current
plan_node()first - rewire graph only after helper seams are tested
- then update docs and tests
What I would explicitly avoid¶
- Caching
recipesin planner state. - Moving policy rewrites into
route_after_*functions. - Splitting region/currency into dedicated observe-node subloops.
- Turning every corrector into its own checkpointed node.
- Assuming topology changes are free for in-flight planner threads.
Final advice¶
If the goal is the best long-term planner graph, choose the gpt-5.4 shape.
If the goal is the safest real rollout, implement a gpt-5.4 + fable-5 hybrid, with gpt-5.5-style helper extraction as the migration path.
That combination gives the best balance of:
- explicit graph phases,
- preserved interrupt/resume behavior,
- minimal state-risk,
- testable decision seams,
- and a rollout story that respects LangGraph checkpoints instead of hand-waving them away.