Planner Graph Refactor — Consolidated Rankings Matrix¶
Date: 2026-06-11 (remade analyses, including the new fable-5 proposal)
Source: docs/planner-graph-ref/analyse/*-range.md (11 model rankings)
Subject: 11 architectural proposals for decomposing plan_node in src/venturescope/planner/agent.py
This document collapses all 11 model rankings onto a single matrix so the agreement (and disagreement) between evaluators is visible at one glance.
Scores follow the intuitive convention: 11 = best, 1 = worst. Each
evaluator's original 1–11 ranking has been inverted with score = 12 − rank,
so higher values mean a better proposal everywhere in this document.
For the method-based evaluator analysis (which evaluator sits closest to the
consensus, and which is the medoid — with self-rankings excluded), see
evaluator-consensus-ranking.md.
How to read¶
- Rows = proposal being judged (the architectural design under evaluation).
- Columns = evaluator model (the analyst that produced the ranking).
- Cell value = the score that evaluator gave to that proposal (11 = best, 1 = worst).
gemini-3.1-proagain produced ordered tiers instead of a strict 1–11 ranking — but this time covering all 11 proposals, so its column holds tier-imputed slot-average scores (10 / 7 / 3.5 / 1) rather than—gaps.- Rows are sorted by mean score (highest at top); means include self-scores.
- The diagonal (highlighted with
*) shows each model's self-score.
Full rankings matrix¶
| DS | FAB | GEM | GLM | G54 | G55 | KIM | MIM | OPS | Q36 | Q37 | Mean | Median | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fable-5 | 10 | 11 | 10 | 11 | 10 | 11 | 11 | 11 | 11 | 11 | 11 | 10.7 | 11 |
| gpt-5.4 | 6 | 9 | 10 | 9 | 11 | 10 | 9 | 9 | 10 | 9 | 9 | 9.2 | 9 |
| gpt-5.5 | 9 | 10 | 3.5 | 6 | 9 | 9 | 5 | 10 | 9 | 10 | 7 | 8.0 | 9 |
| deepseek-4-pro | 8 | 7 | 10 | 10 | 7 | 8 | 10 | 8 | 4 | 5 | 10 | 7.9 | 8 |
| kimi-2.6 | 11 | 6 | 3.5 | 5 | 5 | 6 | 4 | 6 | 7 | 7 | 6 | 6.0 | 6 |
| opus | 1 | 8 | 1 | 8 | 8 | 7 | 1 | 7 | 8 | 8 | 8 | 5.9 | 8 |
| glm-5.1 | 7 | 5 | 7 | 7 | 6 | 5 | 7 | 3 | 3 | 6 | 5 | 5.5 | 6 |
| gemini-3.1-pro | 5 | 1 | 7 | 2 | 4 | 3 | 8 | 4 | 5 | 2 | 1 | 3.8 | 4 |
| mimo-2.5-pro | 4 | 3 | 7 | 1 | 3 | 2 | 6 | 5 | 6 | 1 | 2 | 3.6 | 3 |
| qwen-3.7-max | 3 | 4 | 3.5 | 3 | 1 | 4 | 2 | 1 | 1 | 3 | 4 | 2.7 | 3 |
| qwen-3.6-plus | 2 | 2 | 3.5 | 4 | 2 | 1 | 3 | 2 | 2 | 4 | 3 | 2.6 | 2 |
Column legend (evaluators): DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max
Aggregate ranking (single visual)¶
Bars use mean score (higher = better). Each █ ≈ 0.5 score points.
fable-5 10.7 █████████████████████▍
gpt-5.4 9.2 ██████████████████▍
gpt-5.5 8.0 ████████████████
deepseek-4-pro 7.9 ███████████████▊
kimi-2.6 6.0 ████████████
opus 5.9 ███████████▊
glm-5.1 5.5 ███████████
gemini-3.1-pro 3.8 ███████▋
mimo-2.5-pro 3.6 ███████▎
qwen-3.7-max 2.7 █████▍
qwen-3.6-plus 2.6 █████▎
└──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
1 2 3 4 5 6 7 8 9 10 11 score
(worst) (best)
Full matrix line chart¶
Each line is one proposal, labeled in its own colour at the right-hand endpoint of the line. A dotted horizontal line in the same colour marks that proposal's aggregate mean score from the ranking table. The x-axis lists evaluators; the y-axis is the score that evaluator assigned (11 = best, 1 = worst — so higher lines = better-scored proposals, a flat line = consensus rating, a zig-zag line = disagreement among evaluators).
| # | Proposal | Mean score | Visual signature |
|---|---|---|---|
| 1 | fable-5 | 10.7 | pinned to the top — 11 from eight evaluators, never below 10 |
| 2 | gpt-5.4 | 9.2 | high and stable 9–11, one dip to 6 (deepseek-4-pro) |
| 3 | gpt-5.5 | 8.0 | mostly 9–10, dips to 3.5/5/6 (gemini tier, kimi, glm) |
| 4 | deepseek-4-pro | 7.9 | high band with a late slide to 4–5 (opus-4.7, qwen-3.6-plus) |
| 5 | kimi-2.6 | 6.0 | flat around 5–7 with one spike to 11 (deepseek-4-pro) |
| 6 | opus | 5.9 | bimodal — 7–8 from eight evaluators, three crashes to 1 |
| 7 | glm-5.1 | 5.5 | mid band 5–7, two dips to 3 (mimo-2.5-pro, opus-4.7) |
| 8 | gemini-3.1-pro | 3.8 | low 1–5 with one spike to 8 (kimi-2.6) |
| 9 | mimo-2.5-pro | 3.6 | low 1–7, peak from gemini's mid tier |
| 10 | qwen-3.7-max | 2.7 | hugs the bottom 1–4 |
| 11 | qwen-3.6-plus | 2.6 | hugs the bottom 1–4 |
X-axis evaluators: DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max.
Unlike the 2026-06-08 edition, the GEM (gemini-3.1-pro) column is included
in the chart: its remade tier output covers all 11 proposals, so its
tier-imputed scores (10 / 7 / 3.5 / 1) plot as ordinary points.
Cross-cutting observations¶
Consensus winner¶
fable-5 — the proposal that did not exist in the 2026-06-08 edition — swept
the field: eight of eleven evaluators ranked it #1 (score 11, including its
own disclosed self-vote) and the other three ranked it #2 (score 10). It is
never scored below 10 (mean 10.7, median 11). The June-8 winner gpt-5.4
(then 9.4) is displaced to second.
Stable second tier¶
gpt-5.4 keeps its remarkable consistency: ten of eleven evaluators scored it
9 or higher; the only dissent is deepseek-4-pro (6). gpt-5.5 (8.0) and
deepseek-4-pro (7.9) are effectively tied for third — their unrounded
means differ by 0.05.
Consensus bottom¶
The qwen pair sits at the bottom again, and they are also effectively tied:
qwen-3.6-plus (2.6) and qwen-3.7-max (2.7) are never scored above 4 by any
evaluator. mimo-2.5-pro (3.6) and gemini-3.1-pro (3.8) cluster just above.
High-variance proposals¶
opusis the most polarising proposal in the set: three evaluators (deepseek-4-pro,gemini-3.1-pro,kimi-2.6) score it 1 while the remaining eight give 7–8 — hence median 8 against mean 5.9. The split tracks whether the evaluator punishes its per-corrector node explosion or rewards its analysis depth.kimi-2.6has the single biggest outlier cell:deepseek-4-procrowns it #1 (score 11) while no other evaluator goes above 7.gpt-5.5spans 3.5 → 10: gemini's "micro-node" tier andkimi-2.6(5) punish its granularity; five evaluators score it 9–10.deepseek-4-prospans 4 → 10: four evaluators give it 10, butopus-4.7(4) andqwen-3.6-plus(5) flag itsrecipes-in-state and self-loop risks.
Self-ranking bias (diagonal)¶
Group mean rank = each proposal's average rank over the other ten
evaluators (from evaluator-consensus-ranking.md §2.1; lower = better).
| Evaluator | Self-score | Self-rank | Group mean rank | Reads as |
|---|---|---|---|---|
| fable-5 | 11 | 1 | 1.30 | self-favouring — and the group agrees |
| gpt-5.4 | 11 | 1 | 3.00 | self-favouring (group says #2–3) |
| gpt-5.5 | 9 | 3 | 4.15 | mildly self-favouring |
| deepseek-4-pro | 8 | 4 | 4.10 | honest — matches the group |
| opus-4.7 | 8 | 4 | 6.30 | self-favouring (group says #6–7) |
| gemini-3.1-pro | 7 | 5 (mid tier) | 8.50 | self-favouring (group says #8–9) |
| glm-5.1 | 7 | 5 | 6.60 | mildly self-favouring |
| mimo-2.5-pro | 5 | 7 | 8.50 | self-critical — the group is harsher still |
| kimi-2.6 | 4 | 8 | 5.75 | overly self-critical (group says #5–6) |
| qwen-3.6-plus | 4 | 8 | 9.55 | self-critical — the group is harsher still |
| qwen-3.7-max | 4 | 8 | 9.45 | self-critical — the group is harsher still |
Tier-grouped ranking¶
gemini-3.1-pro again emitted tiers instead of a strict order, but its remade
output covers all 11 proposals in 4 ordered tiers, so no cells are missing
this time. Tier members receive the average of the slots the tier occupies:
| Tier | Members | Imputed rank → score |
|---|---|---|
| 1 "Phase-Based Pipeline" | gpt-5.4, fable-5, deepseek-4-pro | 2 → 10 |
| 2 "Minimalist Stage" | mimo-2.5-pro, glm-5.1, gemini-3.1-pro | 5 → 7 |
| 3 "Micro-node Explosion" | qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus | 8.5 → 3.5 |
| 4 "Edge-Heavy Anti-Pattern" | opus | 11 → 1 |
The column still sums to 66 (= Σ1..11 in score form), so it is rank-consistent and included in all means.
Final ranking (mean-sorted)¶
| Final rank | Proposal | Mean | Median | Best | Worst |
|---|---|---|---|---|---|
| 1 | fable-5 | 10.7 | 11 | 11 | 10 |
| 2 | gpt-5.4 | 9.2 | 9 | 11 | 6 |
| 3 | gpt-5.5 | 8.0 | 9 | 10 | 3.5 |
| 4 | deepseek-4-pro | 7.9 | 8 | 10 | 4 |
| 5 | kimi-2.6 | 6.0 | 6 | 11 | 3.5 |
| 6 | opus | 5.9 | 8 | 8 | 1 |
| 7 | glm-5.1 | 5.5 | 6 | 7 | 3 |
| 8 | gemini-3.1-pro | 3.8 | 4 | 8 | 1 |
| 9 | mimo-2.5-pro | 3.6 | 3 | 7 | 1 |
| 10 | qwen-3.7-max | 2.7 | 3 | 4 | 1 |
| 11 | qwen-3.6-plus | 2.6 | 2 | 4 | 1 |
Ranks 3–4 (gpt-5.5 vs deepseek-4-pro, Δmean = 0.05) and ranks 10–11
(qwen-3.7-max vs qwen-3.6-plus, Δmean = 0.09) are effectively ties.