GPT-5.5 best-analyst range for planner graph refactor reports¶

This document ranks the analyst reports in docs/planner-graph-ref/analyse/.

The ranking is not a ranking of the underlying refactor proposals. It answers a narrower question: if an implementer could read only one analyst report before choosing a planner-graph refactor proposal, which report would give enough accurate context to make a deliberated implementation decision?

Evaluation lens¶

I weighted each report by:

Current-code groundedness: checks claims against docs/planner-graph-ref/current-graph.md, src/venturescope/planner/agent.py, planner state/checkpoint constraints, and the real interrupt/resume surface.
Decision-critical risk coverage: identifies the issues that can invalidate a proposal, especially checkpoint compatibility, state serialization, edge-function purity, single interrupt handling, iteration semantics, LLM-failure behavior, and deterministic-vs-LLM decision rewrite scope.
Proposal coverage: evaluates all 11 proposals individually rather than grouping them too coarsely.
Implementation usefulness: gives a clear recommendation, migration path, and explicit ideas to adopt or reject.
Reliability: avoids material factual mistakes, unsupported optimizations, or self-favoring rankings that would mislead a reader.
Clarity: readable enough to serve as the single briefing document.

Ranked summary¶

Rank	Analyst report	Grade	Best use	Main limitation
1	`fable-5-range.md`	A+	Best single report to read	Strongly favors its sibling proposal; line-number-heavy
2	`gpt-5.4-range.md`	A	Best practical architecture-and-rollout lens	Under-specifies mechanics compared with top report
3	`gpt-5.5-range.md`	A-	Concise, balanced implementation synthesis	Less forensic than top two
4	`opus-4.7-range.md`	A-	Best taxonomy and guardrail framing	Slightly weaker as a one-report guide due to verbosity and a few practical slips
5	`qwen-3.7-max-range.md`	B+	Good compact independent confirmation	Too forgiving of risky ideas such as recipe caching
6	`qwen-3.6-plus-range.md`	B	Good line-level concern map and synthesis	Some overconfidence in over-split designs
7	`glm-5.1-range.md`	B-	Good structured scorecard	Misses or softens several implementation hazards
8	`kimi-2.6-range.md`	C+	Useful broad comparison, catches some bugs	Several rankings would mislead a lone reader
9	`mimo-2.5-pro-range.md`	C	Systematic scoring and convergence appendix	Scorecard is less critical than needed
10	`deepseek-4-pro-range.md`	C-	Long, detailed alternative view	Best-proposal ranking is materially unreliable
11	`gemini-3.1-pro-range.md`	D	Quick paradigm overview only	Too shallow and contains material classification errors

Detailed ranking¶

1. `fable-5-range.md` — best single decision briefing¶

Good sides

Establishes concrete ground-truth facts before ranking: current plan_node() scope, Postgres checkpointing, single ask_user interrupt site, multiple LLM calls inside the current planner tick, deterministic-vs-LLM rewrite differences, and the iteration-floor dependency in run_planner_step().
Evaluates every proposal against those facts, not just against subjective graph aesthetics.
Identifies the two highest-risk production issues most reports underplay: in-flight checkpoint compatibility after topology changes, and preserving the current rewrite-scope split between deterministic and LLM-originated decisions.
Gives specific failure modes, not just labels: edge functions cannot persist rewritten decisions; loop paths that bypass the tick break max-iteration and turn attribution; cached complex acquisition objects in checkpoint state create serialization/drift risk.
Ends with a usable synthesis: fable-5 topology as backbone, gpt-5.4 guardrails, gpt-5.5 specification/test assets, opus taxonomy, and explicit rejected ideas.

Bad sides

It evaluates a proposal from the same model family and ranks it first. The report discloses this and supports the ranking well, but a reader should still treat the top choice as requiring review.
It leans heavily on line numbers and exact current source layout, so it will age faster than more conceptual reports.
It is dense. As the best one-report briefing, that density is useful; as a quick executive summary, it is heavy.

Why it ranks here

It is the only report that is both comprehensive and hard-nosed about the hidden operational risks. A reader could choose an implementation plan from this report alone without missing the major traps.

2. `gpt-5.4-range.md` — best practical architecture lens¶

Good sides

Uses the right criteria: behavior-order preservation, checkpoint safety, interrupt safety, graph honesty without graph spam, migration realism, and proposal accuracy.
Strongly emphasizes current project realities: planner checkpoint ownership, stable iteration semantics, ask_user -> observe_user interrupt contract, real test surface, and search/noop fallback behavior.
Its recommendation is pragmatic: use the gpt-5.4 shape for topology, borrow fable-5 state discipline, use gpt-5.5 helper/test partitioning, and avoid over-node or edge-mutation designs.
The report is clear and implementation-oriented without becoming too long.

Bad sides

Ranks its matching proposal first and fable-5 second, even though fable-5 handles checkpoint namespace migration more concretely.
It says less about exactly how to preserve deterministic-vs-LLM rewrite scope than fable-5-range.md.
It under-specifies the state additions needed for llm_failed and decision_origin compared with the strongest reports.

Why it ranks here

It is the best report for a team that wants a maintainable architecture and rollout strategy. It is slightly less self-contained than fable-5-range.md because it needs borrowed mechanics.

3. `gpt-5.5-range.md` — best concise synthesis¶

Good sides

Sets out the right constraints up front: behavior preservation, LangGraph fit, right-sized graph, state-surface discipline, and migration risk.
Correctly identifies checkpointed graph state and route-hint persistence as design constraints, not style concerns.
Gives a balanced ranking: fable-5 first, gpt-5.4 second, gpt-5.5 third, with lower ranks for edge-mutation and over-split proposals.
Provides a concrete hybrid target graph and a migration order that an implementer could follow.
Stays focused; it is easier to read end-to-end than the longer forensic reports.

Bad sides

It is less forensic than fable-5-range.md: fewer hard code facts, fewer exact failure traces, and less detailed critique of each proposal.
It notes checkpoint namespace decisions but does not develop the parked-thread migration story as strongly as the top report.
Some proposal weaknesses are summarized rather than proven.

Why it ranks here

It is a strong one-report candidate if the reader wants a concise decision aid. It ranks below the top two because a deliberated implementation decision benefits from more evidence than it provides.

4. `opus-4.7-range.md` — best taxonomy and guardrails¶

Good sides

Defines excellent evaluation axes: granularity fit, migration safety, checkpoint compatibility, state discipline, LLM-call isolation, edge purity, and test impact.
Provides a clear gate/corrector mental model and recommends adopting that invariant even while rejecting a too-granular graph.
Correctly places fable-5 as the baseline and gpt-5.4 as the anti-pattern guardrail source.
The final migration order is practical: add guardrails, extract helpers without graph change, rewire with a planner thread namespace bump, update docs.

Bad sides

The report includes local URI-style links, which are less portable and less suitable for repo markdown than plain relative paths.
It references the direct plan_node test surface somewhat imprecisely compared with the gpt-5.4 report.
It is verbose and taxonomy-heavy; excellent as a companion, slightly less efficient as the only report.

Why it ranks here

It is architecturally strong and close to the top tier. It loses to gpt-5.5-range.md only because the latter is a cleaner single-read synthesis.

5. `qwen-3.7-max-range.md` — good compact confirmation¶

Good sides

Clean criteria and a readable ranked list.
Correctly ranks fable-5 first and explains why decision_origin, checkpoint compatibility, and llm_failed matter.
Has a useful recommendation section with targeted borrowings from deepseek, gpt-5.4, opus, and gpt-5.5.
Clearly explains why 10-16 node designs become counterproductive and why 2-3 node designs leave the monolith mostly intact.

Bad sides

Too accepting of recipes in state as a possible optimization. The safer reports treat this as a serialization and drift hazard until proven otherwise.
Describes qwen-3.7's own code examples as useful implementation reference while underplaying the wiring and transient-state problems other reports found.
Less grounded in the current source than the top four.

Why it ranks here

It is a solid secondary analyst report. It is good enough to confirm the direction, but not strong enough to be the only basis for implementation.

6. `qwen-3.6-plus-range.md` — good concern map with some overconfidence¶

Good sides

Starts with a concise table of the current plan_node() concerns and whether they involve LLM calls.
Correctly elevates fable-5, gpt-5.5, and gpt-5.4 as the top proposal cluster.
Identifies important hazards in lower-ranked proposals, including loop-backs to the wrong entry node, routing-function mutation, duplicated region/currency observers, missing decision_origin, and incomplete post-LLM handling.
Gives a useful synthesis: 5-6 nodes, decision_origin, sole tick owner, pure edges, namespace bump, no _route state.

Bad sides

Overvalues gpt-5.5's more granular post-LLM node split and migration safety relative to checkpoint overhead.
Some judgments are too confident about over-split bootstrap nodes and explicit node boundaries.
The report is less rigorous about checkpoint and serializer details than the top reports.

Why it ranks here

It is useful and mostly directionally right, but a lone reader would need a stronger report to avoid over-implementing graph detail.

7. `glm-5.1-range.md` — strong scorecard, weaker hazard detection¶

Good sides

Has a clear criteria table and per-proposal assessment format.
Correctly identifies fable-5's decision_origin as the key insight and recommends the five-node tick → prepare → select → decide → guard shape.
The summary table is fast to scan and the adoption strategy is concrete.
Calls out several important anti-patterns: _route state, separate corrector nodes, and route-finish-check as a separate node.

Bad sides

Gives several proposals too much correctness credit. For example, deepseek's recipes state and self-loop risks are softened, and gpt-5.4's one-node-at-a-time migration is treated as safer than it may be.
Some statements are imprecise, such as treating new fields as “runtime-only” even though planner state is checkpoint-owned.
It is more of a structured scorecard than a forensic decision memo.

Why it ranks here

It is readable and often right, but it does not catch enough load-bearing implementation hazards to be the only report.

8. `kimi-2.6-range.md` — broad and useful, but misleading in places¶

Good sides

Covers every proposal with understandable prose and a useful summary table.
Correctly highlights fable-5's granularity, decision_origin, action-node loopback, and thread namespace bump.
Catches real problems in qwen-3.7-max and qwen-3.6-plus, including tick bypass and routing-function misuse.
Has a practical final recommendation and “what to avoid” list.

Bad sides

Ranks gemini-3.1 too high and deepseek-4 too high for a single decision briefing.
Treats deepseek's recipes caching as a good practical touch instead of a serializer-risky checkpoint-state expansion.
Says applying all corrections to deterministic decisions is “slightly wasteful but not wrong”; in this planner, broadening that rewrite scope can change behavior.
Several rank choices would push a reader toward less safe proposal details.

Why it ranks here

It contains useful observations, but the report is not reliable enough as the only guide for choosing the implementation proposal.

9. `mimo-2.5-pro-range.md` — systematic but insufficiently critical¶

Good sides

Uses explicit weighted criteria and gives each proposal a scorecard.
Correctly ranks fable-5 first and highlights checkpoint compatibility and decision_origin.
The convergence appendix is useful for seeing where proposals agree: isolate the LLM call, extract iteration guards, loop back to the guard entry, and split post-LLM correction somehow.
The final recommendation combines fable-5, gpt-5.4 guardrails, and optional gpt-5.5 post-LLM granularity.

Bad sides

Scores can give false precision. Several proposal scores are too generous relative to their behavioral or checkpoint risks.
Treats dedicated region/currency nodes and recipes state too positively.
Does not catch or emphasize enough of the fatal implementation issues found by stronger reports.
The ranking is less useful than the top reports for deciding what to reject.

Why it ranks here

It is a good cross-check, not a good single source of truth. The report is orderly but too tolerant of risky designs.

10. `deepseek-4-pro-range.md` — detailed but materially unreliable¶

Good sides

Very thorough and easy to read.
Has useful cross-cutting observations and an alternatives section.
Recognizes broad consensus around extracting a loop-entry guard, isolating the LLM call, keeping action nodes unchanged, and extracting helpers before rewiring.
Provides a concrete synthesized architecture and migration plan.

Bad sides

The ranking is highly misleading: it puts kimi-2.6 first, gpt-5.4 sixth, and opus last, despite stronger reports showing that kimi misses key state/checkpoint issues and opus is an excellent reference even if too granular.
Endorses recipes in checkpoint state as a practical optimization without resolving serialization and drift risks.
Underplays or accepts dedicated bootstrap nodes and adapter nodes that widen graph surface without enough benefit.
Treats several behavior-changing details as minor implementation adjustments.

Why it ranks here

The report has a lot of content, but if it were the only report read, it could lead the team toward the wrong base proposal and unsafe state choices.

11. `gemini-3.1-pro-range.md` — too shallow for the decision¶

Good sides

Gives a quick taxonomy of broad design paradigms: phase-based pipeline, minimalist split, micro-node explosion, and edge-heavy routing.
Correctly recommends a phase-based pipeline over graph micro-node explosion.
Easy to read in a few minutes.

Bad sides

Does not evaluate all 11 reports individually; it groups proposals too coarsely for an implementation decision.
Contains a material classification error: it calls opus the edge-heavy approach, while the more relevant edge-mutation problem is qwen-3.6-plus-style routing-function policy.
Omits key planner-specific hazards: checkpoint namespace migration, serializer-safe state, single interrupt ownership, deterministic-vs-LLM rewrite scope, llm_failed, and iteration-floor semantics.
The recommended four-node architecture is plausible at a high level but not substantiated enough for this codebase.

Why it ranks here

It is a useful first-pass framing document, not a deliberated decision aid. A reader relying only on it would miss the most important implementation risks.

Final recommendation for readers¶

If you can read only one report, read fable-5-range.md.

If you can read two, read fable-5-range.md and gpt-5.4-range.md: together they give the best mix of verified operational risk and maintainable architecture principles.

The strongest implementation consensus across the best reports is: use a 5-6 stage planner pipeline, keep edge functions pure, keep ask_user as the only interrupt node, let only the loop-entry tick increment iterations, add serializer-safe decision provenance such as decision_origin/llm_failed, avoid caching complex recipe objects in checkpoint state, and explicitly decide on a planner thread namespace bump before changing node topology.

GPT-5.5 best-analyst range for planner graph refactor reports¶

Evaluation lens¶

Ranked summary¶

Detailed ranking¶

1. fable-5-range.md — best single decision briefing¶

2. gpt-5.4-range.md — best practical architecture lens¶

3. gpt-5.5-range.md — best concise synthesis¶

4. opus-4.7-range.md — best taxonomy and guardrails¶

5. qwen-3.7-max-range.md — good compact confirmation¶

6. qwen-3.6-plus-range.md — good concern map with some overconfidence¶

7. glm-5.1-range.md — strong scorecard, weaker hazard detection¶

8. kimi-2.6-range.md — broad and useful, but misleading in places¶

9. mimo-2.5-pro-range.md — systematic but insufficiently critical¶

10. deepseek-4-pro-range.md — detailed but materially unreliable¶

11. gemini-3.1-pro-range.md — too shallow for the decision¶

Final recommendation for readers¶

1. `fable-5-range.md` — best single decision briefing¶

2. `gpt-5.4-range.md` — best practical architecture lens¶

3. `gpt-5.5-range.md` — best concise synthesis¶

4. `opus-4.7-range.md` — best taxonomy and guardrails¶

5. `qwen-3.7-max-range.md` — good compact confirmation¶

6. `qwen-3.6-plus-range.md` — good concern map with some overconfidence¶

7. `glm-5.1-range.md` — strong scorecard, weaker hazard detection¶

8. `kimi-2.6-range.md` — broad and useful, but misleading in places¶

9. `mimo-2.5-pro-range.md` — systematic but insufficiently critical¶

10. `deepseek-4-pro-range.md` — detailed but materially unreliable¶

11. `gemini-3.1-pro-range.md` — too shallow for the decision¶