Planner Graph Refactor — Consolidated Rankings Matrix¶

Date: 2026-06-11 (remade analyses, including the new fable-5 proposal) Source: docs/planner-graph-ref/analyse/*-range.md (11 model rankings) Subject: 11 architectural proposals for decomposing plan_node in src/venturescope/planner/agent.py

This document collapses all 11 model rankings onto a single matrix so the agreement (and disagreement) between evaluators is visible at one glance.

Scores follow the intuitive convention: 11 = best, 1 = worst. Each evaluator's original 1–11 ranking has been inverted with score = 12 − rank, so higher values mean a better proposal everywhere in this document.

For the method-based evaluator analysis (which evaluator sits closest to the consensus, and which is the medoid — with self-rankings excluded), see evaluator-consensus-ranking.md.

How to read¶

Rows = proposal being judged (the architectural design under evaluation).
Columns = evaluator model (the analyst that produced the ranking).
Cell value = the score that evaluator gave to that proposal (11 = best, 1 = worst).
gemini-3.1-pro again produced ordered tiers instead of a strict 1–11 ranking — but this time covering all 11 proposals, so its column holds tier-imputed slot-average scores (10 / 7 / 3.5 / 1) rather than — gaps.
Rows are sorted by mean score (highest at top); means include self-scores.
The diagonal (highlighted with *) shows each model's self-score.

Full rankings matrix¶

	DS	FAB	GEM	GLM	G54	G55	KIM	MIM	OPS	Q36	Q37	Mean	Median
fable-5	10	11	10	11	10	11	11	11	11	11	11	10.7	11
gpt-5.4	6	9	10	9	11	10	9	9	10	9	9	9.2	9
gpt-5.5	9	10	3.5	6	9	9	5	10	9	10	7	8.0	9
deepseek-4-pro	8	7	10	10	7	8	10	8	4	5	10	7.9	8
kimi-2.6	11	6	3.5	5	5	6	4	6	7	7	6	6.0	6
opus	1	8	1	8	8	7	1	7	8	8	8	5.9	8
glm-5.1	7	5	7	7	6	5	7	3	3	6	5	5.5	6
gemini-3.1-pro	5	1	7	2	4	3	8	4	5	2	1	3.8	4
mimo-2.5-pro	4	3	7	1	3	2	6	5	6	1	2	3.6	3
qwen-3.7-max	3	4	3.5	3	1	4	2	1	1	3	4	2.7	3
qwen-3.6-plus	2	2	3.5	4	2	1	3	2	2	4	3	2.6	2

Column legend (evaluators): DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max

Aggregate ranking (single visual)¶

Bars use mean score (higher = better). Each █ ≈ 0.5 score points.

fable-5         10.7 █████████████████████▍
gpt-5.4          9.2 ██████████████████▍
gpt-5.5          8.0 ████████████████
deepseek-4-pro   7.9 ███████████████▊
kimi-2.6         6.0 ████████████
opus             5.9 ███████████▊
glm-5.1          5.5 ███████████
gemini-3.1-pro   3.8 ███████▋
mimo-2.5-pro     3.6 ███████▎
qwen-3.7-max     2.7 █████▍
qwen-3.6-plus    2.6 █████▎
                   └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
                   1  2  3  4  5  6  7  8  9  10 11  score
                  (worst)                         (best)

Full matrix line chart¶

Each line is one proposal, labeled in its own colour at the right-hand endpoint of the line. A dotted horizontal line in the same colour marks that proposal's aggregate mean score from the ranking table. The x-axis lists evaluators; the y-axis is the score that evaluator assigned (11 = best, 1 = worst — so higher lines = better-scored proposals, a flat line = consensus rating, a zig-zag line = disagreement among evaluators).

Score per evaluator: one line per proposal, labeled at endpoint

#	Proposal	Mean score	Visual signature
1	fable-5	10.7	pinned to the top — 11 from eight evaluators, never below 10
2	gpt-5.4	9.2	high and stable 9–11, one dip to 6 (deepseek-4-pro)
3	gpt-5.5	8.0	mostly 9–10, dips to 3.5/5/6 (gemini tier, kimi, glm)
4	deepseek-4-pro	7.9	high band with a late slide to 4–5 (opus-4.7, qwen-3.6-plus)
5	kimi-2.6	6.0	flat around 5–7 with one spike to 11 (deepseek-4-pro)
6	opus	5.9	bimodal — 7–8 from eight evaluators, three crashes to 1
7	glm-5.1	5.5	mid band 5–7, two dips to 3 (mimo-2.5-pro, opus-4.7)
8	gemini-3.1-pro	3.8	low 1–5 with one spike to 8 (kimi-2.6)
9	mimo-2.5-pro	3.6	low 1–7, peak from gemini's mid tier
10	qwen-3.7-max	2.7	hugs the bottom 1–4
11	qwen-3.6-plus	2.6	hugs the bottom 1–4

X-axis evaluators: DS = deepseek-4-pro · FAB = fable-5 · GEM = gemini-3.1-pro · GLM = glm-5.1 · G54 = gpt-5.4 · G55 = gpt-5.5 · KIM = kimi-2.6 · MIM = mimo-2.5-pro · OPS = opus-4.7 · Q36 = qwen-3.6-plus · Q37 = qwen-3.7-max.

Unlike the 2026-06-08 edition, the GEM (gemini-3.1-pro) column is included in the chart: its remade tier output covers all 11 proposals, so its tier-imputed scores (10 / 7 / 3.5 / 1) plot as ordinary points.

Cross-cutting observations¶

Consensus winner¶

fable-5 — the proposal that did not exist in the 2026-06-08 edition — swept the field: eight of eleven evaluators ranked it #1 (score 11, including its own disclosed self-vote) and the other three ranked it #2 (score 10). It is never scored below 10 (mean 10.7, median 11). The June-8 winner gpt-5.4 (then 9.4) is displaced to second.

Stable second tier¶

gpt-5.4 keeps its remarkable consistency: ten of eleven evaluators scored it 9 or higher; the only dissent is deepseek-4-pro (6). gpt-5.5 (8.0) and deepseek-4-pro (7.9) are effectively tied for third — their unrounded means differ by 0.05.

Consensus bottom¶

The qwen pair sits at the bottom again, and they are also effectively tied: qwen-3.6-plus (2.6) and qwen-3.7-max (2.7) are never scored above 4 by any evaluator. mimo-2.5-pro (3.6) and gemini-3.1-pro (3.8) cluster just above.

High-variance proposals¶

opus is the most polarising proposal in the set: three evaluators (deepseek-4-pro, gemini-3.1-pro, kimi-2.6) score it 1 while the remaining eight give 7–8 — hence median 8 against mean 5.9. The split tracks whether the evaluator punishes its per-corrector node explosion or rewards its analysis depth.
kimi-2.6 has the single biggest outlier cell: deepseek-4-pro crowns it #1 (score 11) while no other evaluator goes above 7.
gpt-5.5 spans 3.5 → 10: gemini's "micro-node" tier and kimi-2.6 (5) punish its granularity; five evaluators score it 9–10.
deepseek-4-pro spans 4 → 10: four evaluators give it 10, but opus-4.7 (4) and qwen-3.6-plus (5) flag its recipes-in-state and self-loop risks.

Self-ranking bias (diagonal)¶

Group mean rank = each proposal's average rank over the other ten evaluators (from evaluator-consensus-ranking.md §2.1; lower = better).

Evaluator	Self-score	Self-rank	Group mean rank	Reads as
fable-5	11	1	1.30	self-favouring — and the group agrees
gpt-5.4	11	1	3.00	self-favouring (group says #2–3)
gpt-5.5	9	3	4.15	mildly self-favouring
deepseek-4-pro	8	4	4.10	honest — matches the group
opus-4.7	8	4	6.30	self-favouring (group says #6–7)
gemini-3.1-pro	7	5 (mid tier)	8.50	self-favouring (group says #8–9)
glm-5.1	7	5	6.60	mildly self-favouring
mimo-2.5-pro	5	7	8.50	self-critical — the group is harsher still
kimi-2.6	4	8	5.75	overly self-critical (group says #5–6)
qwen-3.6-plus	4	8	9.55	self-critical — the group is harsher still
qwen-3.7-max	4	8	9.45	self-critical — the group is harsher still

Tier-grouped ranking¶

gemini-3.1-pro again emitted tiers instead of a strict order, but its remade output covers all 11 proposals in 4 ordered tiers, so no cells are missing this time. Tier members receive the average of the slots the tier occupies:

Tier	Members	Imputed rank → score
1 "Phase-Based Pipeline"	gpt-5.4, fable-5, deepseek-4-pro	2 → 10
2 "Minimalist Stage"	mimo-2.5-pro, glm-5.1, gemini-3.1-pro	5 → 7
3 "Micro-node Explosion"	qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus	8.5 → 3.5
4 "Edge-Heavy Anti-Pattern"	opus	11 → 1

The column still sums to 66 (= Σ1..11 in score form), so it is rank-consistent and included in all means.

Final ranking (mean-sorted)¶

Final rank	Proposal	Mean	Median	Best	Worst
1	fable-5	10.7	11	11	10
2	gpt-5.4	9.2	9	11	6
3	gpt-5.5	8.0	9	10	3.5
4	deepseek-4-pro	7.9	8	10	4
5	kimi-2.6	6.0	6	11	3.5
6	opus	5.9	8	8	1
7	glm-5.1	5.5	6	7	3
8	gemini-3.1-pro	3.8	4	8	1
9	mimo-2.5-pro	3.6	3	7	1
10	qwen-3.7-max	2.7	3	4	1
11	qwen-3.6-plus	2.6	2	4	1

Ranks 3–4 (gpt-5.5 vs deepseek-4-pro, Δmean = 0.05) and ranks 10–11 (qwen-3.7-max vs qwen-3.6-plus, Δmean = 0.09) are effectively ties.