Skip to content

Evaluator Consensus & Medoid Ranking — 11 Evaluators × 11 Proposals

Run date: 2026-06-11 · Dataset: the 11 remade *-range.md analyses (2026-06-11, including the new fable-5 proposal/evaluator) · Weights: equal, wᵢ = 1/11 · Generated by: consensus_ranking.py (re-run it to regenerate every number below).

What this answers

Each of 11 LLM evaluators ranked the same 11 architectural proposals (docs/planner-graph-ref/proposals/). This report ranks the evaluators by how representative their opinion is of the whole group, two ways, on two tracks:

  1. Closest to overall — smallest distance to the group consensus (method §5, argmin dᵢ).
  2. Medoid — smallest total weighted distance to all other evaluators (method §6, argmin Σⱼ wⱼ·d(Tᵢ,Tⱼ)).

Tracks: a rank track (proposals as aspects, ranks as scores) and a semantic track (Gemini text embeddings of each evaluator's full report).

Method & parameters

Mapped to analyse-evaluation-method.md (formulas restated below; the variance/agreement definition sits in its §4 text-generation step):

  • Scores sᵢⱼ: raw ranks (1 = best … 11 = worst). The method's [−1,1] sentiment scale is an affine image of ranks; applied uniformly it changes no consensus value, distance, medoid, or ρ, so raw ranks are used for auditability.
  • Confidence cᵢⱼ ∈ {0,1} (§2): cᵢⱼ = 0 on each evaluator's own proposal (self-ranking excluded — 11 cells); 1 elsewhere.
  • Weighted center (§2): s̄ⱼ = (Σᵢ wᵢ cᵢⱼ sᵢⱼ) / (Σᵢ wᵢ cᵢⱼ) — with equal wᵢ this is the mean rank over the 10 non-author evaluators.
  • Per-aspect agreement (§4): σ²ⱼ = (Σᵢ wᵢ cᵢⱼ (sᵢⱼ − s̄ⱼ)²) / (Σᵢ wᵢ cᵢⱼ).
  • Closest to overall (§5): normalized (confidence-weighted) Euclidean dᵢ = √[ Σⱼ cᵢⱼ αⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ αⱼ ] with αⱼ = 1 — an RMS over the 10 proposals evaluator i ranked, so evaluators omitting different cells stay comparable. Cross-check: Spearman 1 − ρ on the same 10 cells.
  • Medoid (§6): pairwise d(Tᵢ,Tⱼ) = RMS Euclidean over the 9 proposals ranked by both (own(i) and own(j) excluded); totals Sᵢ = Σⱼ wⱼ d(Tᵢ,Tⱼ).
  • Near-ties: orderings are flagged when adjacent distances differ by less than ε = 0.05 (rank track) / ε = 0.002 (cosine). n = 11 is small; do not over-read hairline orderings.
  • Semantic track: paragraph-chunked documents (blank-line blocks; fenced code/Mermaid/tables kept intact; blocks < 5 tokens dropped), embedded with gemini-embedding-001 (3072-dim, task_type SEMANTIC_SIMILARITY), pooled per document by length-weighted mean (weight = chunk token count). Centroid = equal-weighted mean of the 11 document vectors; cosine distance primary, Euclidean-to-centroid secondary. Self-exclusion does not apply (an evaluator's self-assessment cannot be excised from its text).

1. Rank matrix R

Rows = evaluators, columns = proposals, 1 = best. marks the excluded self-ranking cell (cᵢⱼ = 0). The gemini-3.1-pro row is tier-imputed (table below).

Evaluator ↓ \ Proposal → deepseek4 fable5 gemini3.1 glm5.1 gpt5.4 gpt5.5 kimi2.6 mimo2.5 opus qwen3.6+ qwen3.7
deepseek-4-pro 2 7 5 6 3 1 8 11 10 9
fable-5 5 11 7 3 2 6 9 4 10 8
gemini-3.1-pro (imputed) 2 2 5 2 8.5 8.5 5 11 8.5 8.5
glm-5.1 2 1 10 3 6 7 11 4 8 9
gpt-5.4 5 2 8 6 3 7 9 4 10 11
gpt-5.5 4 1 9 7 2 6 10 5 11 8
kimi-2.6 2 1 4 5 3 7 6 11 9 10
mimo-2.5-pro 4 1 8 9 3 2 6 5 10 11
opus-4.7 8 1 7 9 2 3 5 6 10 11
qwen-3.6-plus 7 1 10 6 3 2 5 11 4 9
qwen-3.7-max 2 1 11 7 3 5 6 10 4 9

gemini-3.1-pro imputation. Its remade analysis groups all 11 proposals into 4 ranked tiers with no strict intra-tier order; each tier member receives the average of the slot positions the tier occupies (rank-consistent, row sums to 66 = Σ1..11):

Tier Members Slots Imputed rank
1 — "Phase-Based Pipeline" gpt-5.4, fable-5, deepseek-4-pro 1–3 2
2 — "Minimalist Stage" mimo-2.5-pro, glm-5.1, gemini-3.1-pro 4–6 5
3 — "Micro-node Explosion" qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus 7–10 8.5
4 — "Edge-Heavy Anti-Pattern" opus 11 11

Self-ranking exclusion. The 11 diagonal (author) cells are dropped from the consensus and every rank-track distance. The excluded values (for the record): deepseek-4-pro→4, fable-5→1, gemini-3.1-pro→5, glm-5.1→5, gpt-5.4→1, gpt-5.5→3, kimi-2.6→8, mimo-2.5-pro→7, opus-4.7→4, qwen-3.6-plus→8, qwen-3.7-max→8. Self-bias is visible — 7 of 11 evaluators put their own proposal in their top 5 (fable-5 and gpt-5.4 self-rank #1).

2. Rank track

2.1 Consensus (weighted center) and per-proposal agreement

s̄ⱼ = mean rank over each proposal's 10 non-author rankers; σ²ⱼ = confidence-weighted variance (lower = stronger agreement).

Consensus # Proposal s̄ⱼ (mean rank) σ²ⱼ σⱼ
1 fable-5 1.30 0.21 0.46
2 gpt-5.4 3.00 1.20 1.10
3 deepseek-4-pro 4.10 4.29 2.07
4 gpt-5.5 4.15 4.90 2.21
5 kimi-2.6 5.75 3.46 1.86
6 opus 6.30 9.61 3.10
7 glm-5.1 6.60 2.04 1.43
8 gemini-3.1-pro 8.50 4.25 2.06
9 mimo-2.5-pro 8.50 4.25 2.06
10 qwen-3.7-max 9.45 1.32 1.15
11 qwen-3.6-plus 9.55 0.72 0.85

(Derived by-product: the proposals' own consensus order. The deliverable remains the evaluator rankings below.)

2.2 Closest to overall — normalized Euclidean dᵢ (primary)

dᵢ = √[ Σⱼ cᵢⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ ] — RMS over evaluator i's 10 ranked proposals.

# Evaluator dᵢ (RMS rank units)
1 gpt-5.5 0.9858
2 gpt-5.4 1.1375
3 mimo-2.5-pro 1.2284
4 qwen-3.7-max 1.3978
5 fable-5 1.4053
6 glm-5.1 1.6087
7 qwen-3.6-plus 1.6744
8 opus-4.7 1.8722
9 deepseek-4-pro 2.4684
10 kimi-2.6 2.5373
11 gemini-3.1-pro 2.6700

Near-ties (Δ < ε = 0.05): {qwen-3.7-max, fable-5} — treat as effectively tied.

2.3 Closest to overall — Spearman cross-check

1 − ρ between evaluator i's 10 ranks and the consensus restricted to those proposals.

# Evaluator ρ 1 − ρ
1 gpt-5.5 0.9483 0.0517
2 mimo-2.5-pro 0.9273 0.0727
3 qwen-3.7-max 0.9119 0.0881
4 gpt-5.4 0.8997 0.1003
5 fable-5 0.8450 0.1550
6 glm-5.1 0.8389 0.1611
7 opus-4.7 0.8024 0.1976
8 qwen-3.6-plus 0.8024 0.1976
9 deepseek-4-pro 0.6809 0.3191
10 kimi-2.6 0.6626 0.3374
11 gemini-3.1-pro 0.5975 0.4025

Near-ties (Δ < ε = 0.05): {gpt-5.5, mimo-2.5-pro, qwen-3.7-max, gpt-5.4}; {fable-5, glm-5.1, opus-4.7, qwen-3.6-plus}; {deepseek-4-pro, kimi-2.6} — treat as effectively tied.

2.4 Pairwise distance matrix (RMS Euclidean, 9 common proposals)

↓ \ → deepseek4 fable5 gemini3.1 glm5.1 gpt5.4 gpt5.5 kimi2.6 mimo2.5 opus4.7 qwen3.6+ qwen3.7
deepseek4 0.0000 3.4319 3.5590 3.7417 3.1972 3.1972 2.1344 3.2146 2.5166 3.2489 3.5434
fable5 3.4319 0.0000 3.8006 2.0000 1.5275 1.0000 4.0825 1.6330 2.3570 1.1547 1.4907
gemini3.1 3.5590 3.8006 0.0000 3.2660 3.5901 3.0322 0.9280 3.5746 3.4157 4.3173 3.3124
glm5.1 3.7417 2.0000 3.2660 0.0000 1.9720 1.4530 3.5434 1.9437 3.1972 2.2361 0.7454
gpt5.4 3.1972 1.5275 3.5901 1.9720 0.0000 1.3744 3.3830 1.2472 1.9149 1.5635 1.7321
gpt5.5 3.1972 1.0000 3.0322 1.4530 1.3744 0.0000 3.2318 1.3333 2.3805 1.3333 1.2472
kimi2.6 2.1344 4.0825 0.9280 3.5434 3.3830 3.2318 0.0000 3.3166 2.9814 4.2426 3.6818
mimo2.5 3.2146 1.6330 3.5746 1.9437 1.2472 1.3333 3.3166 0.0000 1.4907 1.7638 1.7638
opus4.7 2.5166 2.3570 3.4157 3.1972 1.9149 2.3805 2.9814 1.4907 0.0000 2.3570 2.9627
qwen3.6+ 3.2489 1.1547 4.3173 2.2361 1.5635 1.3333 4.2426 1.7638 2.3570 0.0000 2.0548
qwen3.7 3.5434 1.4907 3.3124 0.7454 1.7321 1.2472 3.6818 1.7638 2.9627 2.0548 0.0000

2.5 Medoid (rank track)

Sᵢ = Σⱼ wⱼ·d(Tᵢ,Tⱼ) with wⱼ = 1/11 (self-distance 0). Smallest total = medoid.

# Evaluator Sᵢ (weighted total)
1 gpt-5.5 1.7803
2 mimo-2.5-pro 1.9347
3 gpt-5.4 1.9547
4 fable-5 2.0434
5 qwen-3.7-max 2.0486
6 glm-5.1 2.1908
7 qwen-3.6-plus 2.2066
8 opus-4.7 2.3249
9 kimi-2.6 2.8659
10 deepseek-4-pro 2.8895
11 gemini-3.1-pro 2.9814

Near-ties (Δ < ε = 0.05): {mimo-2.5-pro, gpt-5.4}; {fable-5, qwen-3.7-max}; {glm-5.1, qwen-3.6-plus}; {kimi-2.6, deepseek-4-pro} — treat as effectively tied.

2.6 Why the two closest-to-overall rankings differ (Euclidean vs Spearman)

The two metrics answer subtly different questions:

  • Normalized Euclidean (§2.2) measures magnitude: it squares each gap (sᵢⱼ − s̄ⱼ) between an evaluator's rank and the consensus mean. One big miss costs more than several small ones, and swapping two proposals whose consensus values are nearly tied costs almost nothing.
  • Spearman 1−ρ (§2.3) measures order: it charges for every inversion equally, no matter how small the consensus gap it crosses. A swap across a 0.05-wide consensus gap costs as much as a swap across a 3-point gap.

In this 11×11 run the two rankings agree unusually well — same #1 (gpt-5.5: dᵢ = 0.986 and 1−ρ = 0.052) and the same bottom three in the same order (deepseek-4-pro, kimi-2.6, gemini-3.1-pro) — because the consensus is sharply stratified at the extremes (fable-5 at 1.30; the qwen pair near 9.5). The divergence is confined to the #2–#4 band, and it is driven by the consensus's near-tied pairs (§2.1):

Near-tied consensus pair s̄ⱼ values Gap
deepseek-4-pro ≈ gpt-5.5 4.10 vs 4.15 0.05
gemini-3.1-pro = mimo-2.5-pro 8.50 vs 8.50 0.00
qwen-3.7-max ≈ qwen-3.6-plus 9.45 vs 9.55 0.10
kimi-2.6 / opus / glm-5.1 cluster 5.75 / 6.30 / 6.60 ≤ 0.85

Poster cases:

  • gpt-5.4 — Euclidean #2 (dᵢ = 1.138) but Spearman #4 (1−ρ = 0.100). Numerically a tight fit — its only sizeable miss is opus at rank 4 vs consensus 6.30. But it inverts the near-tied pairs: gpt-5.5 above deepseek-4-pro (consensus gap 0.05), qwen-3.6-plus above qwen-3.7-max (gap 0.10), and promoting opus also lifts it above kimi-2.6 and glm-5.1, adding two more inversions. Each inversion is almost free in squared-gap terms and full price in rank-correlation terms.
  • mimo-2.5-pro — Spearman #2 (1−ρ = 0.073) vs Euclidean #3; qwen-3.7-max — Spearman #3 (0.088) vs Euclidean #4. The mirror image: consensus-consistent stories carrying a few large numeric misses that Euclidean squares — mimo-2.5-pro puts glm-5.1 at 9 vs consensus 6.60 (|gap| = 2.40) and gpt-5.5 at 2 vs 4.15; qwen-3.7-max puts gemini-3.1-pro at 11 vs 8.50 (2.50), opus at 4 vs 6.30 (2.30) and deepseek-4-pro at 2 vs 4.10 (2.10).
  • gpt-5.5 — #1 on both: right magnitudes and right order; when an evaluator gets both, the metrics cannot disagree.

Reading guide: Euclidean is the primary metric (it is the method's dᵢ with the confidence-weighted normalization); Spearman is the cross-check. When they disagree about an evaluator, the disagreement itself is informative: Euclidean-better means "right numbers, shuffled near-ties"; Spearman-better means "right story, one or two big numeric misses".

3. Semantic track (Gemini embeddings)

Model gemini-embedding-001 (3072-dim, task_type SEMANTIC_SIMILARITY), batched requests ≤ 50 inputs, throttle 60s per 50 requests + 3s between batches. Self-exclusion not applicable (full text embedded).

3.1 Chunking statistics

Document Raw blocks Kept chunks (≥5 tok) Tokens Max chunk
deepseek-4-pro 109 93 8827 569
fable-5 76 68 6853 551
gemini-3.1-pro 29 26 1353 164
glm-5.1 79 60 6004 440
gpt-5.4 114 85 3399 288
gpt-5.5 74 60 3654 355
kimi-2.6 100 81 6389 514
mimo-2.5-pro 89 69 5797 494
opus-4.7 90 84 7550 1089
qwen-3.6-plus 84 71 5141 255
qwen-3.7-max 101 82 4673 290
total 945 779 59640

Token counts via tiktoken cl100k_base (proxy); per-input limit 2048 tokens (no chunk required truncation if max chunk below the limit).

3.2 Closest to centroid — cosine (primary), Euclidean (secondary)

# Evaluator cosine distance to centroid Euclidean to centroid
1 qwen-3.7-max 0.0029 0.0694
2 qwen-3.6-plus 0.0037 0.0782
3 deepseek-4-pro 0.0040 0.0819
4 kimi-2.6 0.0054 0.0959
5 opus-4.7 0.0063 0.1034
6 gpt-5.5 0.0067 0.1060
7 gpt-5.4 0.0076 0.1125
8 fable-5 0.0077 0.1139
9 mimo-2.5-pro 0.0097 0.1287
10 glm-5.1 0.0112 0.1396
11 gemini-3.1-pro 0.0120 0.1412

Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5, mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.

3.3 Pairwise cosine distance matrix

↓ \ → deepseek4 fable5 gemini3.1 glm5.1 gpt5.4 gpt5.5 kimi2.6 mimo2.5 opus4.7 qwen3.6+ qwen3.7
deepseek4 0.0000 0.0111 0.0180 0.0158 0.0160 0.0126 0.0084 0.0128 0.0113 0.0069 0.0080
fable5 0.0111 0.0000 0.0211 0.0248 0.0152 0.0151 0.0159 0.0211 0.0134 0.0115 0.0127
gemini3.1 0.0180 0.0211 0.0000 0.0287 0.0195 0.0192 0.0177 0.0270 0.0227 0.0179 0.0166
glm5.1 0.0158 0.0248 0.0287 0.0000 0.0256 0.0242 0.0166 0.0165 0.0175 0.0168 0.0142
gpt5.4 0.0160 0.0152 0.0195 0.0256 0.0000 0.0070 0.0159 0.0230 0.0150 0.0134 0.0100
gpt5.5 0.0126 0.0151 0.0192 0.0242 0.0070 0.0000 0.0131 0.0216 0.0141 0.0123 0.0111
kimi2.6 0.0084 0.0159 0.0177 0.0166 0.0159 0.0131 0.0000 0.0166 0.0147 0.0088 0.0093
mimo2.5 0.0128 0.0211 0.0270 0.0165 0.0230 0.0216 0.0166 0.0000 0.0173 0.0151 0.0131
opus4.7 0.0113 0.0134 0.0227 0.0175 0.0150 0.0141 0.0147 0.0173 0.0000 0.0106 0.0101
qwen3.6+ 0.0069 0.0115 0.0179 0.0168 0.0134 0.0123 0.0088 0.0151 0.0106 0.0000 0.0040
qwen3.7 0.0080 0.0127 0.0166 0.0142 0.0100 0.0111 0.0093 0.0131 0.0101 0.0040 0.0000

3.4 Cosine medoid

# Evaluator Sᵢ (weighted cosine total)
1 qwen-3.7-max 0.0099
2 qwen-3.6-plus 0.0107
3 deepseek-4-pro 0.0110
4 kimi-2.6 0.0124
5 opus-4.7 0.0133
6 gpt-5.5 0.0137
7 gpt-5.4 0.0146
8 fable-5 0.0147
9 mimo-2.5-pro 0.0167
10 glm-5.1 0.0182
11 gemini-3.1-pro 0.0190

Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5}; {mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.

4. Per-metric comparison (no combined winner)

Evaluator Closest: rank-Euclidean Closest: Spearman Closest: cosine Medoid: rank-Euclidean Medoid: cosine
deepseek-4-pro 9 9 3 10 3
fable-5 5 5 8 4 8
gemini-3.1-pro 11 11 11 11 11
glm-5.1 6 6 10 6 10
gpt-5.4 2 4 7 3 7
gpt-5.5 1 1 6 1 6
kimi-2.6 10 10 4 9 4
mimo-2.5-pro 3 2 9 2 9
opus-4.7 8 7 5 8 5
qwen-3.6-plus 7 8 2 7 2
qwen-3.7-max 4 3 1 5 1

Agreement across metrics. The five orderings agree at the extremes of each track, but the two tracks measure different things:

  • gpt-5.5 owns the rank track: #1 on all three rank-based metrics (Euclidean dᵢ = 0.986, Spearman 1−ρ = 0.052, medoid Sᵢ = 1.780). By the method's primary definitions it is both the closest to the overall opinion and the most central evaluator.
  • qwen-3.7-max owns the semantic track: #1 cosine-to-centroid (0.0029) and #1 cosine medoid (0.0099) — and it is also top-5 on every rank-track metric (#4 Euclidean / #3 Spearman / #5 medoid), making it the only evaluator in the top five of all five columns. If one evaluator had to stand in for the group across both tracks, it is qwen-3.7-max.
  • gemini-3.1-pro is last on all five metrics — the unambiguous outlier. Its coarse 4-tier opinion distorts the rank vector (opus last vs consensus 6.30; the whole gpt-5.5/kimi-2.6 tier parked at 8.5), and its document is also the shortest and least typical text.
  • The tracks genuinely disagree in the middle. deepseek-4-pro and kimi-2.6 — the rank track's bottom pair after gemini (#9/#10 on closeness, #10/#9 on medoid) — are semantically central (#3 and #4): their texts follow the shared template and vocabulary, but their orderings deviate (deepseek-4-pro crowns kimi-2.6 #1 and buries opus at #11; kimi-2.6 lifts gemini-3.1-pro to #4 and likewise buries opus). The mirror image: mimo-2.5-pro and glm-5.1 are strong on the rank track (closeness #3/#2 and #6/#6) but semantically peripheral (#9 and #10). Rank distance measures what an evaluator concluded; cosine distance measures how it wrote — and here the two are nearly uncorrelated.
  • The semantic track is flat and should be read as weak evidence. Cosine distances to the centroid span only 0.0029–0.0120 and most adjacent gaps fall under ε = 0.002 (see the near-tie chains flagged in §3.2/§3.4): all eleven documents share format, headings, and subject matter. Only the endpoints — qwen-3.7-max clearly closest; glm-5.1 and gemini-3.1-pro clearly farthest — rise above the noise.

Per the spec, no combined ranking is computed: each metric stands on its own, and the table above is the complete per-metric picture.

5. Notes & caveats

  • Two missing-data policies by design: gemini-3.1-pro's tier structure is imputed (completing a coarse-but-total opinion); self-rankings are excluded (cᵢⱼ = 0, removing self-bias). Both flow through the method's confidence term.
  • n = 11 with one imputed (tiered) row: orderings near the flagged ties are not robust; the extremes (top and bottom two on each metric) are.
  • Embedding vectors are not hand-verifiable; the derived distance matrices above are the auditable artifact. Re-running re-embeds (or reuses a content-keyed cache) and reproduces the rank track exactly.
  • Self-exclusion applies to the rank track only; the semantic track embeds each report's full text including its self-assessment.
  • Reproduce: poetry run python3 docs/planner-graph-ref/analyse/consensus_ranking.py (project venv; GOOGLE_API_KEY required for the semantic track).