Evaluator Consensus & Medoid Ranking — 11 Evaluators × 11 Proposals¶

Run date: 2026-06-11 · Dataset: the 11 remade *-range.md analyses (2026-06-11, including the new fable-5 proposal/evaluator) · Weights: equal, wᵢ = 1/11 · Generated by: consensus_ranking.py (re-run it to regenerate every number below).

What this answers¶

Each of 11 LLM evaluators ranked the same 11 architectural proposals (docs/planner-graph-ref/proposals/). This report ranks the evaluators by how representative their opinion is of the whole group, two ways, on two tracks:

Closest to overall — smallest distance to the group consensus (method §5, argmin dᵢ).
Medoid — smallest total weighted distance to all other evaluators (method §6, argmin Σⱼ wⱼ·d(Tᵢ,Tⱼ)).

Tracks: a rank track (proposals as aspects, ranks as scores) and a semantic track (Gemini text embeddings of each evaluator's full report).

Method & parameters¶

Mapped to analyse-evaluation-method.md (formulas restated below; the variance/agreement definition sits in its §4 text-generation step):

Scores sᵢⱼ: raw ranks (1 = best … 11 = worst). The method's [−1,1] sentiment scale is an affine image of ranks; applied uniformly it changes no consensus value, distance, medoid, or ρ, so raw ranks are used for auditability.
Confidence cᵢⱼ ∈ {0,1} (§2): cᵢⱼ = 0 on each evaluator's own proposal (self-ranking excluded — 11 cells); 1 elsewhere.
Weighted center (§2): s̄ⱼ = (Σᵢ wᵢ cᵢⱼ sᵢⱼ) / (Σᵢ wᵢ cᵢⱼ) — with equal wᵢ this is the mean rank over the 10 non-author evaluators.
Per-aspect agreement (§4): σ²ⱼ = (Σᵢ wᵢ cᵢⱼ (sᵢⱼ − s̄ⱼ)²) / (Σᵢ wᵢ cᵢⱼ).
Closest to overall (§5): normalized (confidence-weighted) Euclidean dᵢ = √[ Σⱼ cᵢⱼ αⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ αⱼ ] with αⱼ = 1 — an RMS over the 10 proposals evaluator i ranked, so evaluators omitting different cells stay comparable. Cross-check: Spearman 1 − ρ on the same 10 cells.
Medoid (§6): pairwise d(Tᵢ,Tⱼ) = RMS Euclidean over the 9 proposals ranked by both (own(i) and own(j) excluded); totals Sᵢ = Σⱼ wⱼ d(Tᵢ,Tⱼ).
Near-ties: orderings are flagged when adjacent distances differ by less than ε = 0.05 (rank track) / ε = 0.002 (cosine). n = 11 is small; do not over-read hairline orderings.
Semantic track: paragraph-chunked documents (blank-line blocks; fenced code/Mermaid/tables kept intact; blocks < 5 tokens dropped), embedded with gemini-embedding-001 (3072-dim, task_type SEMANTIC_SIMILARITY), pooled per document by length-weighted mean (weight = chunk token count). Centroid = equal-weighted mean of the 11 document vectors; cosine distance primary, Euclidean-to-centroid secondary. Self-exclusion does not apply (an evaluator's self-assessment cannot be excised from its text).

1. Rank matrix R¶

Rows = evaluators, columns = proposals, 1 = best. — marks the excluded self-ranking cell (cᵢⱼ = 0). The gemini-3.1-pro row is tier-imputed (table below).

Evaluator ↓ \ Proposal →	deepseek4	fable5	gemini3.1	glm5.1	gpt5.4	gpt5.5	kimi2.6	mimo2.5	opus	qwen3.6+	qwen3.7
deepseek-4-pro	—	2	7	5	6	3	1	8	11	10	9
fable-5	5	—	11	7	3	2	6	9	4	10	8
gemini-3.1-pro (imputed)	2	2	—	5	2	8.5	8.5	5	11	8.5	8.5
glm-5.1	2	1	10	—	3	6	7	11	4	8	9
gpt-5.4	5	2	8	6	—	3	7	9	4	10	11
gpt-5.5	4	1	9	7	2	—	6	10	5	11	8
kimi-2.6	2	1	4	5	3	7	—	6	11	9	10
mimo-2.5-pro	4	1	8	9	3	2	6	—	5	10	11
opus-4.7	8	1	7	9	2	3	5	6	—	10	11
qwen-3.6-plus	7	1	10	6	3	2	5	11	4	—	9
qwen-3.7-max	2	1	11	7	3	5	6	10	4	9	—

gemini-3.1-pro imputation. Its remade analysis groups all 11 proposals into 4 ranked tiers with no strict intra-tier order; each tier member receives the average of the slot positions the tier occupies (rank-consistent, row sums to 66 = Σ1..11):

Tier	Members	Slots	Imputed rank
1 — "Phase-Based Pipeline"	gpt-5.4, fable-5, deepseek-4-pro	1–3	2
2 — "Minimalist Stage"	mimo-2.5-pro, glm-5.1, gemini-3.1-pro	4–6	5
3 — "Micro-node Explosion"	qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus	7–10	8.5
4 — "Edge-Heavy Anti-Pattern"	opus	11	11

Self-ranking exclusion. The 11 diagonal (author) cells are dropped from the consensus and every rank-track distance. The excluded values (for the record): deepseek-4-pro→4, fable-5→1, gemini-3.1-pro→5, glm-5.1→5, gpt-5.4→1, gpt-5.5→3, kimi-2.6→8, mimo-2.5-pro→7, opus-4.7→4, qwen-3.6-plus→8, qwen-3.7-max→8. Self-bias is visible — 7 of 11 evaluators put their own proposal in their top 5 (fable-5 and gpt-5.4 self-rank #1).

2. Rank track¶

2.1 Consensus (weighted center) and per-proposal agreement¶

s̄ⱼ = mean rank over each proposal's 10 non-author rankers; σ²ⱼ = confidence-weighted variance (lower = stronger agreement).

Consensus #	Proposal	s̄ⱼ (mean rank)	σ²ⱼ	σⱼ
1	fable-5	1.30	0.21	0.46
2	gpt-5.4	3.00	1.20	1.10
3	deepseek-4-pro	4.10	4.29	2.07
4	gpt-5.5	4.15	4.90	2.21
5	kimi-2.6	5.75	3.46	1.86
6	opus	6.30	9.61	3.10
7	glm-5.1	6.60	2.04	1.43
8	gemini-3.1-pro	8.50	4.25	2.06
9	mimo-2.5-pro	8.50	4.25	2.06
10	qwen-3.7-max	9.45	1.32	1.15
11	qwen-3.6-plus	9.55	0.72	0.85

(Derived by-product: the proposals' own consensus order. The deliverable remains the evaluator rankings below.)

2.2 Closest to overall — normalized Euclidean dᵢ (primary)¶

dᵢ = √[ Σⱼ cᵢⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ ] — RMS over evaluator i's 10 ranked proposals.

#	Evaluator	dᵢ (RMS rank units)
1	gpt-5.5	0.9858
2	gpt-5.4	1.1375
3	mimo-2.5-pro	1.2284
4	qwen-3.7-max	1.3978
5	fable-5	1.4053
6	glm-5.1	1.6087
7	qwen-3.6-plus	1.6744
8	opus-4.7	1.8722
9	deepseek-4-pro	2.4684
10	kimi-2.6	2.5373
11	gemini-3.1-pro	2.6700

Near-ties (Δ < ε = 0.05): {qwen-3.7-max, fable-5} — treat as effectively tied.

2.3 Closest to overall — Spearman cross-check¶

1 − ρ between evaluator i's 10 ranks and the consensus restricted to those proposals.

#	Evaluator	ρ	1 − ρ
1	gpt-5.5	0.9483	0.0517
2	mimo-2.5-pro	0.9273	0.0727
3	qwen-3.7-max	0.9119	0.0881
4	gpt-5.4	0.8997	0.1003
5	fable-5	0.8450	0.1550
6	glm-5.1	0.8389	0.1611
7	opus-4.7	0.8024	0.1976
8	qwen-3.6-plus	0.8024	0.1976
9	deepseek-4-pro	0.6809	0.3191
10	kimi-2.6	0.6626	0.3374
11	gemini-3.1-pro	0.5975	0.4025

Near-ties (Δ < ε = 0.05): {gpt-5.5, mimo-2.5-pro, qwen-3.7-max, gpt-5.4}; {fable-5, glm-5.1, opus-4.7, qwen-3.6-plus}; {deepseek-4-pro, kimi-2.6} — treat as effectively tied.

2.4 Pairwise distance matrix (RMS Euclidean, 9 common proposals)¶

↓ \ →	deepseek4	fable5	gemini3.1	glm5.1	gpt5.4	gpt5.5	kimi2.6	mimo2.5	opus4.7	qwen3.6+	qwen3.7
deepseek4	0.0000	3.4319	3.5590	3.7417	3.1972	3.1972	2.1344	3.2146	2.5166	3.2489	3.5434
fable5	3.4319	0.0000	3.8006	2.0000	1.5275	1.0000	4.0825	1.6330	2.3570	1.1547	1.4907
gemini3.1	3.5590	3.8006	0.0000	3.2660	3.5901	3.0322	0.9280	3.5746	3.4157	4.3173	3.3124
glm5.1	3.7417	2.0000	3.2660	0.0000	1.9720	1.4530	3.5434	1.9437	3.1972	2.2361	0.7454
gpt5.4	3.1972	1.5275	3.5901	1.9720	0.0000	1.3744	3.3830	1.2472	1.9149	1.5635	1.7321
gpt5.5	3.1972	1.0000	3.0322	1.4530	1.3744	0.0000	3.2318	1.3333	2.3805	1.3333	1.2472
kimi2.6	2.1344	4.0825	0.9280	3.5434	3.3830	3.2318	0.0000	3.3166	2.9814	4.2426	3.6818
mimo2.5	3.2146	1.6330	3.5746	1.9437	1.2472	1.3333	3.3166	0.0000	1.4907	1.7638	1.7638
opus4.7	2.5166	2.3570	3.4157	3.1972	1.9149	2.3805	2.9814	1.4907	0.0000	2.3570	2.9627
qwen3.6+	3.2489	1.1547	4.3173	2.2361	1.5635	1.3333	4.2426	1.7638	2.3570	0.0000	2.0548
qwen3.7	3.5434	1.4907	3.3124	0.7454	1.7321	1.2472	3.6818	1.7638	2.9627	2.0548	0.0000

2.5 Medoid (rank track)¶

Sᵢ = Σⱼ wⱼ·d(Tᵢ,Tⱼ) with wⱼ = 1/11 (self-distance 0). Smallest total = medoid.

#	Evaluator	Sᵢ (weighted total)
1	gpt-5.5	1.7803
2	mimo-2.5-pro	1.9347
3	gpt-5.4	1.9547
4	fable-5	2.0434
5	qwen-3.7-max	2.0486
6	glm-5.1	2.1908
7	qwen-3.6-plus	2.2066
8	opus-4.7	2.3249
9	kimi-2.6	2.8659
10	deepseek-4-pro	2.8895
11	gemini-3.1-pro	2.9814

Near-ties (Δ < ε = 0.05): {mimo-2.5-pro, gpt-5.4}; {fable-5, qwen-3.7-max}; {glm-5.1, qwen-3.6-plus}; {kimi-2.6, deepseek-4-pro} — treat as effectively tied.

2.6 Why the two closest-to-overall rankings differ (Euclidean vs Spearman)¶

The two metrics answer subtly different questions:

Normalized Euclidean (§2.2) measures magnitude: it squares each gap (sᵢⱼ − s̄ⱼ) between an evaluator's rank and the consensus mean. One big miss costs more than several small ones, and swapping two proposals whose consensus values are nearly tied costs almost nothing.
Spearman 1−ρ (§2.3) measures order: it charges for every inversion equally, no matter how small the consensus gap it crosses. A swap across a 0.05-wide consensus gap costs as much as a swap across a 3-point gap.

In this 11×11 run the two rankings agree unusually well — same #1 (gpt-5.5: dᵢ = 0.986 and 1−ρ = 0.052) and the same bottom three in the same order (deepseek-4-pro, kimi-2.6, gemini-3.1-pro) — because the consensus is sharply stratified at the extremes (fable-5 at 1.30; the qwen pair near 9.5). The divergence is confined to the #2–#4 band, and it is driven by the consensus's near-tied pairs (§2.1):

Near-tied consensus pair	s̄ⱼ values	Gap
deepseek-4-pro ≈ gpt-5.5	4.10 vs 4.15	0.05
gemini-3.1-pro = mimo-2.5-pro	8.50 vs 8.50	0.00
qwen-3.7-max ≈ qwen-3.6-plus	9.45 vs 9.55	0.10
kimi-2.6 / opus / glm-5.1 cluster	5.75 / 6.30 / 6.60	≤ 0.85

Poster cases:

gpt-5.4 — Euclidean #2 (dᵢ = 1.138) but Spearman #4 (1−ρ = 0.100). Numerically a tight fit — its only sizeable miss is opus at rank 4 vs consensus 6.30. But it inverts the near-tied pairs: gpt-5.5 above deepseek-4-pro (consensus gap 0.05), qwen-3.6-plus above qwen-3.7-max (gap 0.10), and promoting opus also lifts it above kimi-2.6 and glm-5.1, adding two more inversions. Each inversion is almost free in squared-gap terms and full price in rank-correlation terms.
mimo-2.5-pro — Spearman #2 (1−ρ = 0.073) vs Euclidean #3; qwen-3.7-max — Spearman #3 (0.088) vs Euclidean #4. The mirror image: consensus-consistent stories carrying a few large numeric misses that Euclidean squares — mimo-2.5-pro puts glm-5.1 at 9 vs consensus 6.60 (|gap| = 2.40) and gpt-5.5 at 2 vs 4.15; qwen-3.7-max puts gemini-3.1-pro at 11 vs 8.50 (2.50), opus at 4 vs 6.30 (2.30) and deepseek-4-pro at 2 vs 4.10 (2.10).
gpt-5.5 — #1 on both: right magnitudes and right order; when an evaluator gets both, the metrics cannot disagree.

Reading guide: Euclidean is the primary metric (it is the method's dᵢ with the confidence-weighted normalization); Spearman is the cross-check. When they disagree about an evaluator, the disagreement itself is informative: Euclidean-better means "right numbers, shuffled near-ties"; Spearman-better means "right story, one or two big numeric misses".

3. Semantic track (Gemini embeddings)¶

Model gemini-embedding-001 (3072-dim, task_type SEMANTIC_SIMILARITY), batched requests ≤ 50 inputs, throttle 60s per 50 requests + 3s between batches. Self-exclusion not applicable (full text embedded).

3.1 Chunking statistics¶

Document	Raw blocks	Kept chunks (≥5 tok)	Tokens	Max chunk
deepseek-4-pro	109	93	8827	569
fable-5	76	68	6853	551
gemini-3.1-pro	29	26	1353	164
glm-5.1	79	60	6004	440
gpt-5.4	114	85	3399	288
gpt-5.5	74	60	3654	355
kimi-2.6	100	81	6389	514
mimo-2.5-pro	89	69	5797	494
opus-4.7	90	84	7550	1089
qwen-3.6-plus	84	71	5141	255
qwen-3.7-max	101	82	4673	290
total	945	779	59640

Token counts via tiktoken cl100k_base (proxy); per-input limit 2048 tokens (no chunk required truncation if max chunk below the limit).

3.2 Closest to centroid — cosine (primary), Euclidean (secondary)¶

#	Evaluator	cosine distance to centroid	Euclidean to centroid
1	qwen-3.7-max	0.0029	0.0694
2	qwen-3.6-plus	0.0037	0.0782
3	deepseek-4-pro	0.0040	0.0819
4	kimi-2.6	0.0054	0.0959
5	opus-4.7	0.0063	0.1034
6	gpt-5.5	0.0067	0.1060
7	gpt-5.4	0.0076	0.1125
8	fable-5	0.0077	0.1139
9	mimo-2.5-pro	0.0097	0.1287
10	glm-5.1	0.0112	0.1396
11	gemini-3.1-pro	0.0120	0.1412

Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5, mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.

3.3 Pairwise cosine distance matrix¶

↓ \ →	deepseek4	fable5	gemini3.1	glm5.1	gpt5.4	gpt5.5	kimi2.6	mimo2.5	opus4.7	qwen3.6+	qwen3.7
deepseek4	0.0000	0.0111	0.0180	0.0158	0.0160	0.0126	0.0084	0.0128	0.0113	0.0069	0.0080
fable5	0.0111	0.0000	0.0211	0.0248	0.0152	0.0151	0.0159	0.0211	0.0134	0.0115	0.0127
gemini3.1	0.0180	0.0211	0.0000	0.0287	0.0195	0.0192	0.0177	0.0270	0.0227	0.0179	0.0166
glm5.1	0.0158	0.0248	0.0287	0.0000	0.0256	0.0242	0.0166	0.0165	0.0175	0.0168	0.0142
gpt5.4	0.0160	0.0152	0.0195	0.0256	0.0000	0.0070	0.0159	0.0230	0.0150	0.0134	0.0100
gpt5.5	0.0126	0.0151	0.0192	0.0242	0.0070	0.0000	0.0131	0.0216	0.0141	0.0123	0.0111
kimi2.6	0.0084	0.0159	0.0177	0.0166	0.0159	0.0131	0.0000	0.0166	0.0147	0.0088	0.0093
mimo2.5	0.0128	0.0211	0.0270	0.0165	0.0230	0.0216	0.0166	0.0000	0.0173	0.0151	0.0131
opus4.7	0.0113	0.0134	0.0227	0.0175	0.0150	0.0141	0.0147	0.0173	0.0000	0.0106	0.0101
qwen3.6+	0.0069	0.0115	0.0179	0.0168	0.0134	0.0123	0.0088	0.0151	0.0106	0.0000	0.0040
qwen3.7	0.0080	0.0127	0.0166	0.0142	0.0100	0.0111	0.0093	0.0131	0.0101	0.0040	0.0000

3.4 Cosine medoid¶

#	Evaluator	Sᵢ (weighted cosine total)
1	qwen-3.7-max	0.0099
2	qwen-3.6-plus	0.0107
3	deepseek-4-pro	0.0110
4	kimi-2.6	0.0124
5	opus-4.7	0.0133
6	gpt-5.5	0.0137
7	gpt-5.4	0.0146
8	fable-5	0.0147
9	mimo-2.5-pro	0.0167
10	glm-5.1	0.0182
11	gemini-3.1-pro	0.0190

Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5}; {mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.

4. Per-metric comparison (no combined winner)¶

Evaluator	Closest: rank-Euclidean	Closest: Spearman	Closest: cosine	Medoid: rank-Euclidean	Medoid: cosine
deepseek-4-pro	9	9	3	10	3
fable-5	5	5	8	4	8
gemini-3.1-pro	11	11	11	11	11
glm-5.1	6	6	10	6	10
gpt-5.4	2	4	7	3	7
gpt-5.5	1	1	6	1	6
kimi-2.6	10	10	4	9	4
mimo-2.5-pro	3	2	9	2	9
opus-4.7	8	7	5	8	5
qwen-3.6-plus	7	8	2	7	2
qwen-3.7-max	4	3	1	5	1

Agreement across metrics. The five orderings agree at the extremes of each track, but the two tracks measure different things:

gpt-5.5 owns the rank track: #1 on all three rank-based metrics (Euclidean dᵢ = 0.986, Spearman 1−ρ = 0.052, medoid Sᵢ = 1.780). By the method's primary definitions it is both the closest to the overall opinion and the most central evaluator.
qwen-3.7-max owns the semantic track: #1 cosine-to-centroid (0.0029) and #1 cosine medoid (0.0099) — and it is also top-5 on every rank-track metric (#4 Euclidean / #3 Spearman / #5 medoid), making it the only evaluator in the top five of all five columns. If one evaluator had to stand in for the group across both tracks, it is qwen-3.7-max.
gemini-3.1-pro is last on all five metrics — the unambiguous outlier. Its coarse 4-tier opinion distorts the rank vector (opus last vs consensus 6.30; the whole gpt-5.5/kimi-2.6 tier parked at 8.5), and its document is also the shortest and least typical text.
The tracks genuinely disagree in the middle. deepseek-4-pro and kimi-2.6 — the rank track's bottom pair after gemini (#9/#10 on closeness, #10/#9 on medoid) — are semantically central (#3 and #4): their texts follow the shared template and vocabulary, but their orderings deviate (deepseek-4-pro crowns kimi-2.6 #1 and buries opus at #11; kimi-2.6 lifts gemini-3.1-pro to #4 and likewise buries opus). The mirror image: mimo-2.5-pro and glm-5.1 are strong on the rank track (closeness #3/#2 and #6/#6) but semantically peripheral (#9 and #10). Rank distance measures what an evaluator concluded; cosine distance measures how it wrote — and here the two are nearly uncorrelated.
The semantic track is flat and should be read as weak evidence. Cosine distances to the centroid span only 0.0029–0.0120 and most adjacent gaps fall under ε = 0.002 (see the near-tie chains flagged in §3.2/§3.4): all eleven documents share format, headings, and subject matter. Only the endpoints — qwen-3.7-max clearly closest; glm-5.1 and gemini-3.1-pro clearly farthest — rise above the noise.

Per the spec, no combined ranking is computed: each metric stands on its own, and the table above is the complete per-metric picture.

5. Notes & caveats¶

Two missing-data policies by design: gemini-3.1-pro's tier structure is imputed (completing a coarse-but-total opinion); self-rankings are excluded (cᵢⱼ = 0, removing self-bias). Both flow through the method's confidence term.
n = 11 with one imputed (tiered) row: orderings near the flagged ties are not robust; the extremes (top and bottom two on each metric) are.
Embedding vectors are not hand-verifiable; the derived distance matrices above are the auditable artifact. Re-running re-embeds (or reuses a content-keyed cache) and reproduces the rank track exactly.
Self-exclusion applies to the rank track only; the semantic track embeds each report's full text including its self-assessment.
Reproduce: poetry run python3 docs/planner-graph-ref/analyse/consensus_ranking.py (project venv; GOOGLE_API_KEY required for the semantic track).