Skip to content

Planner Graph Refactor — Consolidated Rankings Matrix

Date: 2026-06-11 (remade analyses, including the new fable-5 proposal) Source: docs/planner-graph-ref/analyse/*-range.md (11 model rankings) Subject: 11 architectural proposals for decomposing plan_node in src/venturescope/planner/agent.py

This document collapses all 11 model rankings onto a single matrix so the agreement (and disagreement) between evaluators is visible at one glance.

Scores follow the intuitive convention: 11 = best, 1 = worst. Each evaluator's original 1–11 ranking has been inverted with score = 12 − rank, so higher values mean a better proposal everywhere in this document.

For the method-based evaluator analysis (which evaluator sits closest to the consensus, and which is the medoid — with self-rankings excluded), see evaluator-consensus-ranking.md.


How to read

  • Rows = proposal being judged (the architectural design under evaluation).
  • Columns = evaluator model (the analyst that produced the ranking).
  • Cell value = the score that evaluator gave to that proposal (11 = best, 1 = worst).
  • gemini-3.1-pro again produced ordered tiers instead of a strict 1–11 ranking — but this time covering all 11 proposals, so its column holds tier-imputed slot-average scores (10 / 7 / 3.5 / 1) rather than gaps.
  • Rows are sorted by mean score (highest at top); means include self-scores.
  • The diagonal (highlighted with *) shows each model's self-score.

Full rankings matrix

DS FAB GEM GLM G54 G55 KIM MIM OPS Q36 Q37 Mean Median
fable-5 10 11 10 11 10 11 11 11 11 11 11 10.7 11
gpt-5.4 6 9 10 9 11 10 9 9 10 9 9 9.2 9
gpt-5.5 9 10 3.5 6 9 9 5 10 9 10 7 8.0 9
deepseek-4-pro 8 7 10 10 7 8 10 8 4 5 10 7.9 8
kimi-2.6 11 6 3.5 5 5 6 4 6 7 7 6 6.0 6
opus 1 8 1 8 8 7 1 7 8 8 8 5.9 8
glm-5.1 7 5 7 7 6 5 7 3 3 6 5 5.5 6
gemini-3.1-pro 5 1 7 2 4 3 8 4 5 2 1 3.8 4
mimo-2.5-pro 4 3 7 1 3 2 6 5 6 1 2 3.6 3
qwen-3.7-max 3 4 3.5 3 1 4 2 1 1 3 4 2.7 3
qwen-3.6-plus 2 2 3.5 4 2 1 3 2 2 4 3 2.6 2

Column legend (evaluators): DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max


Aggregate ranking (single visual)

Bars use mean score (higher = better). Each ≈ 0.5 score points.

fable-5         10.7 █████████████████████▍
gpt-5.4          9.2 ██████████████████▍
gpt-5.5          8.0 ████████████████
deepseek-4-pro   7.9 ███████████████▊
kimi-2.6         6.0 ████████████
opus             5.9 ███████████▊
glm-5.1          5.5 ███████████
gemini-3.1-pro   3.8 ███████▋
mimo-2.5-pro     3.6 ███████▎
qwen-3.7-max     2.7 █████▍
qwen-3.6-plus    2.6 █████▎
                   └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
                   1  2  3  4  5  6  7  8  9  10 11  score
                  (worst)                         (best)

Full matrix line chart

Each line is one proposal, labeled in its own colour at the right-hand endpoint of the line. A dotted horizontal line in the same colour marks that proposal's aggregate mean score from the ranking table. The x-axis lists evaluators; the y-axis is the score that evaluator assigned (11 = best, 1 = worst — so higher lines = better-scored proposals, a flat line = consensus rating, a zig-zag line = disagreement among evaluators).

Score per evaluator: one line per proposal, labeled at endpoint

# Proposal Mean score Visual signature
1 fable-5 10.7 pinned to the top — 11 from eight evaluators, never below 10
2 gpt-5.4 9.2 high and stable 9–11, one dip to 6 (deepseek-4-pro)
3 gpt-5.5 8.0 mostly 9–10, dips to 3.5/5/6 (gemini tier, kimi, glm)
4 deepseek-4-pro 7.9 high band with a late slide to 4–5 (opus-4.7, qwen-3.6-plus)
5 kimi-2.6 6.0 flat around 5–7 with one spike to 11 (deepseek-4-pro)
6 opus 5.9 bimodal — 7–8 from eight evaluators, three crashes to 1
7 glm-5.1 5.5 mid band 5–7, two dips to 3 (mimo-2.5-pro, opus-4.7)
8 gemini-3.1-pro 3.8 low 1–5 with one spike to 8 (kimi-2.6)
9 mimo-2.5-pro 3.6 low 1–7, peak from gemini's mid tier
10 qwen-3.7-max 2.7 hugs the bottom 1–4
11 qwen-3.6-plus 2.6 hugs the bottom 1–4

X-axis evaluators: DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max.

Unlike the 2026-06-08 edition, the GEM (gemini-3.1-pro) column is included in the chart: its remade tier output covers all 11 proposals, so its tier-imputed scores (10 / 7 / 3.5 / 1) plot as ordinary points.


Cross-cutting observations

Consensus winner

fable-5 — the proposal that did not exist in the 2026-06-08 edition — swept the field: eight of eleven evaluators ranked it #1 (score 11, including its own disclosed self-vote) and the other three ranked it #2 (score 10). It is never scored below 10 (mean 10.7, median 11). The June-8 winner gpt-5.4 (then 9.4) is displaced to second.

Stable second tier

gpt-5.4 keeps its remarkable consistency: ten of eleven evaluators scored it 9 or higher; the only dissent is deepseek-4-pro (6). gpt-5.5 (8.0) and deepseek-4-pro (7.9) are effectively tied for third — their unrounded means differ by 0.05.

Consensus bottom

The qwen pair sits at the bottom again, and they are also effectively tied: qwen-3.6-plus (2.6) and qwen-3.7-max (2.7) are never scored above 4 by any evaluator. mimo-2.5-pro (3.6) and gemini-3.1-pro (3.8) cluster just above.

High-variance proposals

  • opus is the most polarising proposal in the set: three evaluators (deepseek-4-pro, gemini-3.1-pro, kimi-2.6) score it 1 while the remaining eight give 7–8 — hence median 8 against mean 5.9. The split tracks whether the evaluator punishes its per-corrector node explosion or rewards its analysis depth.
  • kimi-2.6 has the single biggest outlier cell: deepseek-4-pro crowns it #1 (score 11) while no other evaluator goes above 7.
  • gpt-5.5 spans 3.5 → 10: gemini's "micro-node" tier and kimi-2.6 (5) punish its granularity; five evaluators score it 9–10.
  • deepseek-4-pro spans 4 → 10: four evaluators give it 10, but opus-4.7 (4) and qwen-3.6-plus (5) flag its recipes-in-state and self-loop risks.

Self-ranking bias (diagonal)

Group mean rank = each proposal's average rank over the other ten evaluators (from evaluator-consensus-ranking.md §2.1; lower = better).

Evaluator Self-score Self-rank Group mean rank Reads as
fable-5 11 1 1.30 self-favouring — and the group agrees
gpt-5.4 11 1 3.00 self-favouring (group says #2–3)
gpt-5.5 9 3 4.15 mildly self-favouring
deepseek-4-pro 8 4 4.10 honest — matches the group
opus-4.7 8 4 6.30 self-favouring (group says #6–7)
gemini-3.1-pro 7 5 (mid tier) 8.50 self-favouring (group says #8–9)
glm-5.1 7 5 6.60 mildly self-favouring
mimo-2.5-pro 5 7 8.50 self-critical — the group is harsher still
kimi-2.6 4 8 5.75 overly self-critical (group says #5–6)
qwen-3.6-plus 4 8 9.55 self-critical — the group is harsher still
qwen-3.7-max 4 8 9.45 self-critical — the group is harsher still

Tier-grouped ranking

gemini-3.1-pro again emitted tiers instead of a strict order, but its remade output covers all 11 proposals in 4 ordered tiers, so no cells are missing this time. Tier members receive the average of the slots the tier occupies:

Tier Members Imputed rank → score
1 "Phase-Based Pipeline" gpt-5.4, fable-5, deepseek-4-pro 2 → 10
2 "Minimalist Stage" mimo-2.5-pro, glm-5.1, gemini-3.1-pro 5 → 7
3 "Micro-node Explosion" qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus 8.5 → 3.5
4 "Edge-Heavy Anti-Pattern" opus 11 → 1

The column still sums to 66 (= Σ1..11 in score form), so it is rank-consistent and included in all means.


Final ranking (mean-sorted)

Final rank Proposal Mean Median Best Worst
1 fable-5 10.7 11 11 10
2 gpt-5.4 9.2 9 11 6
3 gpt-5.5 8.0 9 10 3.5
4 deepseek-4-pro 7.9 8 10 4
5 kimi-2.6 6.0 6 11 3.5
6 opus 5.9 8 8 1
7 glm-5.1 5.5 6 7 3
8 gemini-3.1-pro 3.8 4 8 1
9 mimo-2.5-pro 3.6 3 7 1
10 qwen-3.7-max 2.7 3 4 1
11 qwen-3.6-plus 2.6 2 4 1

Ranks 3–4 (gpt-5.5 vs deepseek-4-pro, Δmean = 0.05) and ranks 10–11 (qwen-3.7-max vs qwen-3.6-plus, Δmean = 0.09) are effectively ties.