Best-Analyst Ranking — Planner Graph Refactor Reports¶
Evaluator: Fable 5 — the same Claude-family persona that authored analyse/fable-5-range.md and proposals/fable-5-proposal.md (see §0).
Date: 2026-06-12
Inputs: the 11 analyst reports in docs/planner-graph-ref/analyse/*-range.md, checked against docs/planner-graph-ref/current-graph.md, the 11 proposals in docs/planner-graph-ref/proposals/, and the live source tree (src/venturescope/planner/, src/venturescope/conversation/, tests/planner/).
Question answered: if you could read only ONE of these reports before deciding which refactor proposal to implement, which should it be?
0. Conflict-of-interest disclosure¶
This evaluation is written by the same persona that authored one of the eleven reports under review (fable-5-range.md) and the proposal that report ranks first (fable-5-proposal.md). That report is ranked #1 below, and a sibling Claude-family report (opus-4.7-range.md) is #2. Treat that stack of coincidences with suspicion. What was done to keep the result honest:
- The principle is mechanical. Reports are scored on only-report decision sufficiency under the five-dimension rubric in §2, applied uniformly.
- Nothing from fable-5-range was taken on faith. All six of its "ground-truth facts" were re-verified against the live source during this evaluation (§3.1) and all six hold. Its two bug-finds were re-traced independently: the kimi double-adjust directly against the proposal text (§3.3), the qwen-3.7 iteration-skip directly against the proposal's graph wiring (§3.4).
- Its headline is cross-family consensus, not family taste. 8 of the 10 reports that rank proposals individually put the fable-5 proposal first; the other two put it second. GPT-, GLM-, Qwen-, Kimi-, and MiMo-family evaluators are among the eight. Ranking first the report that also says so follows the evidence.
- Claude-family errors are named, not hidden. opus-4.7-range is explicitly docked for citing a non-existent test file twice and for praising a bugged wiring (see its entry).
- The closest call cuts my way, so it is argued in the open. opus-4.7-range (#2) vs gpt-5.5-range (#3) is the one decision where family bias could matter; both entries lay out the trade, and a reader who weights factual cleanliness over decision depth should swap them. Nothing else in the list moves either way.
1. Ranking principle: the best only report¶
A report wins not by being the most pleasant read but by being sufficient: a decision-maker who reads nothing else must come away with (a) a true picture of the current graph, (b) visibility into all eleven proposals, (c) a concrete, defensible recommendation with an execution order, and (d) awareness of the risks that would burn an implementer.
Errors are therefore weighted by position, not just by count: a wrong fact in a load-bearing place — a crowned winner with an unmentioned bug, a proposal rejected for a sin it doesn't commit — costs far more than a cosmetic slip, because the only-reader has no second source to catch it.
2. Rubric¶
Five dimensions, applied to every report:
- Fact integrity — are its claims about the current code and the proposals true? Checked against source, not against other reports.
- Option-space coverage — are all 11 proposals individually assessed, or does the reader lose sight of options?
- Decision power — does it end in an actionable adopt/combine/reject recommendation with a migration order, or stop at commentary?
- Risk & constraint coverage — the facts that decide this refactor: every inner node boundary is a Postgres checkpoint write (the planner subgraph compiles with the outer
AsyncPostgresSaver,conversation/graph.py:58); there is exactly oneinterrupt()surface (planner/agent.py:2051) with live threads parked at it; new state types must passMSGPACK_ALLOWLIST(checkpoint_serde.py); and the test suite contains what it contains (§3.2). - Calibration & self-consistency — an only-report whose bottom line is a silent outlier against the field misleads precisely when it matters most. Divergence is fine; unflagged divergence that traces back to a factual miss is heavily penalized.
For calibration reference, the consensus mean rank per proposal across the eleven reports (gemini's tiers mapped to tied ranks):
| Proposal | fable-5 | gpt-5.4 | deepseek | gpt-5.5 | kimi | opus | glm | gemini | mimo | qwen-3.7 | qwen-3.6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean rank | 1.30 | 3.00 | 4.10 | 4.15 | 5.75 | 6.30 | 6.60 | 8.50 | 8.50 | 9.45 | 9.55 |
Centrality is not correctness — this table is used only to detect unflagged outlier headlines, and every outlier penalized below was first checked against code (§3) and failed on facts, not on nonconformity.
3. Referee facts re-verified for this evaluation¶
Everything in this section was checked directly against the repository during this evaluation. Line numbers are valid as of today and will rot; re-anchor by symbol before relying on them.
3.1 Current-graph facts (all six of fable-5-range's claims hold)¶
| # | Fact | Evidence |
|---|---|---|
| 1 | plan_node is the god-node |
src/venturescope/planner/agent.py:846, body running to ~1192 |
| 2 | Exactly one interrupt() in the planner, inside ask_user |
agent.py:2051 |
| 3 | route_after_plan / _build_state_graph locations |
agent.py:2070 / agent.py:2082 |
| 4 | The planner subgraph compiles with the outer Postgres saver — every inner-node superstep is a checkpoint write; the node-count debate is about real I/O, not style | conversation/graph.py:58 |
| 5 | _turn_searches is reset per turn, outside graph state |
planner/__init__.py:212,228 |
| 6 | Deterministic bypasses: the acquisition fast path (agent.py:1018–1024) and auto-finish (agent.py:1026–1035) return early applying only _adjust_calculation_decision — today's deterministic paths skip the rest of the policy stack. Reports arguing "policy must be a shared exit gate" (fable-5, gpt-5.5) are grounded in this |
agent.py:1014–1041 |
3.2 Test-suite reality¶
tests/planner/ contains test_planner_decisions.py and test_planner_runner.py — nothing else. gpt-5.4-range's check is right; opus-4.7-range cites tests/planner/test_planner_agent.py twice; that file does not exist.
3.3 kimi-2.6-proposal double-adjust (confirmed)¶
In proposals/kimi-2.6-proposal.md, route_direct applies _adjust_calculation_decision (line 132) and then routes unconditionally to enforce_policy (line 135), whose spec includes _adjust_calculation_decision again (line 157, mapping to current agent.py:1177–1178). The adjustment runs twice per deterministic decision; the proposal never establishes idempotency, so this is at minimum an unanalyzed behavioral risk. Flagged by fable-5-range and glm-5.1-range; praised as clean wiring by deepseek-4-pro-range and opus-4.7-range.
3.4 qwen-3.7-max-proposal iteration-skip (confirmed at the wiring level)¶
In proposals/qwen-3.7-max-proposal.md, check_termination is the only node that increments iterations (line 89: iters = state["iterations"] + 1). The internal loops return there (search/observe/calculate/reflect → check_termination), but the post-answer path does not: ask_user → observe_user → check_region_currency → … → llm_decide re-enters the loop bypassing check_termination entirely (mermaid lines 36–38, 75). Ask-driven cycles therefore never count against max_iters; only the separate ask cap contains them. Found by fable-5-range, independently by kimi-2.6-range; re-traced here.
3.5 gemini-3.1-pro-range's tiering misattribution (confirmed wrong)¶
gemini-3.1-pro-range places the opus proposal in its "Tier 4: edge-heavy anti-pattern". The opus proposal is a plain node-chain — trivial routing functions, c_* corrector nodes wired with add_edge. The design that actually locates redirect/cap logic in a conditional-edge function is qwen-3.6-plus's route_after_plan. The tier label fits a different proposal than the one it condemns.
3.6 The recipes serializer landmine¶
deepseek-4-pro-range's recommended synthesis state and kimi-2.6-range's secondary recommendation both embed recipes: dict[str, FieldAcquisition] in graph state. Anything checkpointed must pass MSGPACK_ALLOWLIST (src/venturescope/checkpoint_serde.py) — an unregistered type warns (or refuses in strict mode) on every conversation. Only qwen-3.7-max-range caveats this correctly ("only if serialization proven safe; otherwise rebuild per-node — cheap").
4. The ranking¶
1. fable-5-range.md (~29 KB)¶
Their order: fable-5 → gpt-5.5 → gpt-5.4 → opus → deepseek → kimi → glm → qwen-3.7 → mimo → qwen-3.6 → gemini.
Good. The only report with a code-verified referee layer: six ground-truth facts with file:line citations, and all six survive independent re-verification (§3.1) — in this corpus, that is unique. It found both wiring bugs that actually matter to the decision: the kimi double-adjust (§3.3) and the qwen-3.7 iteration-skip (§3.4); no other report finds both, most find neither, and each find was later independently confirmed by a different evaluator family (glm-5.1 the first, kimi-2.6 the second) — the strongest external validation any claim in the corpus receives. Coverage is complete: all eleven proposals under stable criteria A–F, an 8-column comparison matrix, and an explicit rejected-ideas list with named sources, so the reader sees what was discarded and why, not just what won. It ends in decision power: a composition recommendation plus a 3-step execution order. Its headline matches the field consensus, and its conflict of interest (own-family proposal first) is disclosed inside the report itself.
Bad. Heavy line-number anchoring will rot as agent.py changes — an implementer must re-anchor by symbol. It is dense, and it refuses numeric scoring, so two readers can reconstruct slightly different mid-table orderings from the same prose. And it ranks its own family's proposal first; disclosed and corroborated by 8 of 10 peers, but an only-reader still has to take the disclosure at face value.
Only-report verdict: sufficient. True facts, full coverage, both decisive bug-finds, a concrete execution order, and a calibrated headline. This is the report to read if you read one.
2. opus-4.7-range.md (~29 KB, self-identifies as "Sisyphus, Claude Opus 4.7")¶
Their order: fable-5 (A) → gpt-5.4 (A) → gpt-5.5 (A−) → opus (B+) → kimi (B+) → mimo (B) → gemini (B) → deepseek (B−) → glm (C+) → qwen-3.6 (C) → qwen-3.7 (C−).
Good. The deepest decision package of the eleven: seven explicit criteria (C1–C7), letter grades, thorough per-proposal verdicts, two concrete amendments (A: adopt gpt-5.4's guardrail framing; B: an explicit graph invariant lifted from the opus proposal), and a migration plan whose Step 0 — write the guardrails into planner/AGENTS.md before touching code — is the only place in the corpus that operationalizes keeping the refactor clean after it lands. Its top tier matches the field, and its per-proposal engagement is consistently substantive.
Bad. Two real factual misses. It cites tests/planner/test_planner_agent.py twice; the file does not exist (§3.2), so its test-impact reasoning points at a phantom. And it grades kimi-2.6 B+ at #5 while calling the route_direct → enforce_policy wiring a "nice touch" — that exact wiring is the double-adjust (§3.3); an only-reader could shortlist kimi unaware. It also places the gemini proposal at #7 (B), well above the field's ~8.5 mean, without flagging the divergence.
Only-report verdict: sufficient with corrections. The richest "what to do next" in the corpus. Read alone, apply two patches: strike the phantom test file, and reread kimi's "nice touch" as a bug. The rest of the package survives intact — which is why it stays at #2 despite the cleaner #3.
3. gpt-5.5-range.md (~17 KB)¶
Their order: fable-5 (A) → gpt-5.4 (A−) → gpt-5.5 (B+) → deepseek → opus → kimi → glm → qwen-3.7 → gemini → mimo → qwen-3.6 (D).
Good. The cleanest factual record in the corpus — verification found no false claims. It has the best articulation of the constraint model that should drive the whole decision: node boundaries are replay/checkpoint boundaries (and with the planner on the outer Postgres saver, every added node is a per-turn write); conditional-edge functions are not "free" transient code; and a checkpoint-namespace bump is a product decision because live threads are parked at the ask_user interrupt. This is the one report whose reader could correctly re-derive judgments about node-count proposals on their own. It is also uniquely self-critical: it flags its own family proposal's ordering flaw (prepare_schema placed before region/currency normalization) — candor that raises trust in everything else. Ends with a concrete hybrid recommendation, a full target mermaid, and a 7-step migration.
Bad. Per-proposal sections are brief and there is no comparison matrix, so some options get verdicts thinner than the decision deserves. It finds neither wiring bug (§3.3, §3.4).
Only-report verdict: sufficient, and the safest. Nothing in it will mislead you; you will simply know less per option than fable-5 or opus-4.7 would have told you. If you weight factual cleanliness over decision depth, swap this with #2.
4. gpt-5.4-range.md (~16 KB)¶
Their order: gpt-5.4 → fable-5 → gpt-5.5 → opus → deepseek → glm → kimi → gemini → mimo → qwen-3.6 (D) → qwen-3.7 (D−).
Good. Contributes the corpus's one unique verified fact about the repo: what tests/planner/ actually contains (§3.2) — which both grounds migration-test planning and exposes opus-4.7-range's phantom citation. Sharp cross-proposal observations others lack: which designs let observe_user own bootstrap parsing, which keep state serializer-safe, the NoopSearch trick. Its bottom line is goal-conditional (different winners for minimal-diff vs. maximal-structure goals) — honest decision framing rather than a single flattering crown.
Bad. Ranks its own family's proposal #1 with no conflict-of-interest disclosure (the field puts gpt-5.4 at mean 3.00 — defensible, but the silence contrasts with fable-5-range's explicit disclosure). Per-proposal detail is thin, with no line-level evidence, and it misses both wiring bugs.
Only-report verdict: sufficient-minus. Accurate and decision-shaped; the only-reader gets less per-option depth and an undisclosed self-first headline.
5. glm-5.1-range.md (~25 KB)¶
Their order: fable-5 → deepseek → gpt-5.4 → opus → glm → gpt-5.5 → kimi → qwen-3.6 → qwen-3.7 → gemini → mimo.
Good. Systematic star-rated aspect tables for every proposal with a consistent structure; full coverage. One of only two reports to catch the kimi double-adjust ("may be a subtle behavioral change" — under-sold, but present and correct, §3.3). A genuinely useful cross-proposal naming-mapping table (what each design calls the same node) and a Phase 0–5 migration plan. Calibrated headline, modest self-rank (#5).
Bad. Conflates the gpt-5.5 proposal with observe_region/observe_currency micro-nodes that actually belong to qwen-3.6-plus — a reader comparing those two designs would mis-attribute the architecture. Star ratings imply more precision than the prose supports, and gpt-5.5 at #6 sits below field norm without argued justification.
Only-report verdict: adequate. Correct headline, systematic coverage, one real catch; one cross-attribution error and some rating noise keep it out of the sufficient tier.
6. qwen-3.7-max-range.md (~21 KB)¶
Their order: fable-5 → deepseek → gpt-5.4 → opus → gpt-5.5 → kimi → glm → qwen-3.7 → qwen-3.6 → mimo → gemini.
Good. The best didactics in the corpus: per-proposal "why not more nodes / why not fewer" sections teach the reader the design space, not just the verdict. Eight explicit weighted axes, full coverage, and an honest self-rank (#8 for its own proposal). Uniquely, it is the only report that correctly caveats the recipes-in-state idea (§3.6) — exactly the right treatment of a landmine two other reports recommend uncaveated.
Bad. It praises the deepseek proposal's plan → plan self-loop as "elegant … keeping the LLM re-prompting behavior explicit" — the field correctly treats that loop as a behavior change versus the current in-node retry (each re-prompt becomes a separate superstep and checkpoint write), and deepseek's #2 rests partly on that praise. No code-verification layer of its own.
Only-report verdict: adequate. You would learn the most about how to think about the options; you would also carry one mis-weighted #2.
7. deepseek-4-pro-range.md (~37 KB)¶
Their order: kimi → fable-5 → gpt-5.5 → deepseek → glm → gpt-5.4 → gemini → mimo → qwen-3.7 → qwen-3.6 → opus last.
Good. The densest information artifact of the eleven: per-proposal agree/disagree tables, a unique §2.2 disagreement-resolution table that adjudicates cross-report conflicts, a full synthesis design (tick / prepare / calc_gate / acquire / decide / enforce_policy with decision_origin and llm_failed flags) with mermaid, LOC accounting, a 3-phase migration, and an appendix star matrix. As a second report, arguably the most valuable in the corpus.
Bad — and load-bearing. It crowns kimi-2.6 #1 without flagging that this is a hard outlier (field mean ≈ 5.75) and calls the route_direct → enforce_policy wiring "a clean pattern" — the very wiring where _adjust_calculation_decision runs twice (§3.3). A crowned winner with an undisclosed bug is the single most misleading headline an only-report can carry. Its recommended synthesis state embeds recipes: dict[str, FieldAcquisition] — the serializer landmine (§3.6) — without caveat. And it ranks the opus proposal #11, far below field norm, again unflagged.
Only-report verdict: insufficient alone; excellent second. The depth is real, but the bottom line is what a decision-maker acts on — and it points at a bugged winner with a risky state design.
8. qwen-3.6-plus-range.md (~22 KB)¶
Their order: fable-5 → gpt-5.5 → gpt-5.4 → opus → kimi → glm → deepseek → qwen-3.6 → qwen-3.7 → gemini → mimo.
Good. Its line-range concern table decomposing plan_node (lines 848–1192, with an "LLM needed?" column) is the single best artifact in the corpus for understanding why the god-node should split and along which seams — and it is consistent with the source. Brutally honest self-treatment: ranks its own proposal #8 and explains its own loop-back-to-plan behavioral regression. Good synthesis tables (concept → source → why adopt). Calibrated top-3.
Bad. Mid-table judgments are noisy: deepseek at #7 under an "over-engineered" heading that mismatches that proposal's actual 4-node content; kimi at #5 with no awareness of the double-adjust; a confused critique of qwen-3.7's _route default branch. No verification layer.
Only-report verdict: adequate-minus. Excellent inputs for understanding the problem; mid-table judgments too noisy to act on alone.
9. mimo-2.5-pro-range.md (~23 KB)¶
Their order: fable-5 (9.2) → gpt-5.5 (9.0) → gpt-5.4 (8.8) → deepseek (8.5) → opus (8.3) → kimi (8.1) → mimo (7.8) → gemini (7.5) → glm (7.3) → qwen-3.6 (7.0) → qwen-3.7 (6.5).
Good. Its convergence-analysis appendix — tallying where the eleven proposals independently agree (split search out: 10/11; keep a single interrupt: 11/11; …) — is the closest thing to a consensus map any report offers and is genuinely valuable to a decision-maker. Full coverage, calibrated headline, sane recommendation (fable-5 base + gpt-5.4's anti-pattern list + optional gpt-5.5 splits).
Bad. Precision-washing: 0.1–0.3 score gaps imply measurement the brief per-proposal justifications cannot support, and mid-table scores drift from the accompanying prose. Contrarian praise of qwen-3.6's explicit bootstrap never engages the checkpoint-cost the field charges it. Finds no bugs; no verification layer.
Only-report verdict: adequate-minus. The appendix and the headline are right; the middle of the table is numerically confident and evidentially thin.
10. kimi-2.6-range.md (~27 KB)¶
Their order: fable-5 → deepseek → gpt-5.4 → gemini (#4, "best bang for buck") → glm → mimo → gpt-5.5 → kimi → qwen-3.6 → qwen-3.7 → opus last.
Good. Independently catches the qwen-3.7 iteration-skip bug (§3.4) — one of only two reports to do so. High-integrity self-treatment: ranks its own proposal #8 with genuine criticism, and its "what to avoid" list openly contradicts its own design. Detailed, full coverage.
Bad — two only-report landmines. It ranks the gemini proposal #4 on a confused reading of that design's loop-back behavior — a hard outlier (field mean ≈ 8.5) presented as a bargain. And its secondary recommendation is deepseek's recipes caching, the serializer landmine (§3.6), uncaveated. It also joins deepseek-range in putting the opus proposal #11, unflagged.
Only-report verdict: insufficient alone. Real catches and honest self-assessment, but an only-reader could walk away shortlisting gemini's 4-node design plus a state shape that fights MSGPACK_ALLOWLIST.
11. gemini-3.1-pro-range.md (~6 KB)¶
Their order: four paradigm tiers, no per-proposal ranks (Tier 1 phase-based: gpt-5.4, fable-5, deepseek; Tier 2 minimalist: mimo, glm, gemini; Tier 3 micro-node, not recommended: qwen-3.7, gpt-5.5, kimi, qwen-3.6; Tier 4 edge-heavy anti-pattern: opus).
Good. The tier framing (phase-based vs. minimalist vs. micro-node) is a useful 30-second mental model, and its own 4-node recommendation (prepare / acquisition_gate / llm_plan / policy_guard) is coherent.
Bad — disqualifying for this exercise. There is no per-proposal evaluation at all: eleven proposals are reduced to four buckets, so an only-reader cannot weigh any individual option. Its one strong per-proposal claim is verifiably wrong: the opus proposal is condemned as "edge-heavy anti-pattern" when it is a plain node-chain (§3.5), and gpt-5.5 is tiered "micro-node" alongside designs it does not resemble. At ~6 KB it simply does not carry the information this decision needs.
Only-report verdict: insufficient. Useful as a one-page taxonomy after reading a real report; as the only report it would have you reject a proposal for a sin it doesn't commit.
5. Comparison table¶
| # | Report | Fact integrity | Coverage | Decision power | Risk coverage | Calibration | Only-report verdict |
|---|---|---|---|---|---|---|---|
| 1 | fable-5 | High — 6/6 facts re-verified; both bugs real | Full + matrix + rejected-ideas list | High — composition + 3-step order | High | On-consensus; COI disclosed | Sufficient |
| 2 | opus-4.7 | Med — phantom test file ×2 | Full, deepest per-proposal prose | Highest — amendments + Step-0 guardrails | High | Mostly on-consensus; soft on kimi/gemini | Sufficient w/ corrections |
| 3 | gpt-5.5 | Highest — no errors found | Full but brief; no matrix | High — hybrid + mermaid + 7 steps | Highest — best constraints primer | On-consensus; self-critical | Sufficient, safest |
| 4 | gpt-5.4 | High — unique test-reality check | Full, thin detail | Med-high — goal-conditional advice | Med | Self-first, undisclosed | Sufficient-minus |
| 5 | glm-5.1 | Med-high — one design conflation | Full + aspect tables | Med-high — Phase 0–5 plan | Med | On-consensus | Adequate |
| 6 | qwen-3.7-max | Med-high — self-loop praise miss | Full + best didactics | Med-high — 5 borrowings | Med-high — recipes caveat | Mostly on-consensus | Adequate |
| 7 | deepseek-4-pro | Med — bug in its crowned #1 | Full, deepest tables | High but compromised headline | Med — recipes landmine | Outlier #1 and #11, unflagged | Insufficient alone; best second |
| 8 | qwen-3.6-plus | Med | Full + line-range table | Med | Med | Top calibrated; mid noisy | Adequate-minus |
| 9 | mimo-2.5-pro | Med | Full + convergence appendix | Med | Low-med | Headline fine; precision-washed | Adequate-minus |
| 10 | kimi-2.6 | Med — gemini misread | Full | Med — landmine secondary rec | Low-med | Two hard outliers, unflagged | Insufficient alone |
| 11 | gemini-3.1-pro | Low — opus misattribution | None per-proposal | Low-med | Low | Tier outliers | Insufficient |
6. What even the best report misses¶
- Nobody measures. The entire node-count debate rides on "each node boundary is a Postgres checkpoint write" — true (
conversation/graph.py:58) — yet no report instruments a single turn to quantify the cost per superstep. One instrumented run would settle the 4-node vs. 8-node argument with a number. - Line-anchored evidence rots. fable-5-range's referee facts and qwen-3.6-plus's line table are correct today; whoever implements should re-anchor every citation by symbol first.
- No report runs anything. None validates its winner by adapting
tests/planner/test_planner_decisions.pyortest_planner_runner.pyto a prototype graph — the cheapest possible falsification, left on the table by all eleven. - The parked-thread migration is named, not solved. gpt-5.5-range correctly flags that a checkpoint-namespace bump is a product decision because live threads sit at the
ask_userinterrupt; no report designs the drain/upgrade/rollback path for those threads. - No rollout mechanics. No feature flag, dual-graph, or shadow-run strategy appears anywhere in the corpus.
7. Bottom line¶
- Read only one:
fable-5-range.md— the only report whose factual layer survives end-to-end re-verification, with both decisive bug-finds, full coverage, and an execution order. - Allowed two: add
opus-4.7-range.mdfor its amendments and Step-0 guardrail discipline — with the §3.2/§3.3 corrections applied. - If you discount Claude-family self-ranking entirely:
gpt-5.5-range.mdis the best single read from outside the family, and its headline independently agrees. - Never as sole source:
gemini-3.1-pro-range.md. Treatdeepseek-4-pro-range.mdandkimi-2.6-range.mdas rich seconds whose headlines need this document's §3 before use.