Evaluator Consensus & Medoid Ranking — 11 Evaluators × 11 Proposals¶
Run date: 2026-06-11 · Dataset: the 11 remade *-range.md analyses (2026-06-11, including the new fable-5 proposal/evaluator) · Weights: equal, wᵢ = 1/11 · Generated by: consensus_ranking.py (re-run it to regenerate every number below).
What this answers¶
Each of 11 LLM evaluators ranked the same 11 architectural proposals (docs/planner-graph-ref/proposals/). This report ranks the evaluators by how representative their opinion is of the whole group, two ways, on two tracks:
- Closest to overall — smallest distance to the group consensus (method §5,
argmin dᵢ). - Medoid — smallest total weighted distance to all other evaluators (method §6,
argmin Σⱼ wⱼ·d(Tᵢ,Tⱼ)).
Tracks: a rank track (proposals as aspects, ranks as scores) and a semantic track (Gemini text embeddings of each evaluator's full report).
Method & parameters¶
Mapped to analyse-evaluation-method.md (formulas restated below; the variance/agreement definition sits in its §4 text-generation step):
- Scores
sᵢⱼ: raw ranks (1 = best … 11 = worst). The method's[−1,1]sentiment scale is an affine image of ranks; applied uniformly it changes no consensus value, distance, medoid, or ρ, so raw ranks are used for auditability. - Confidence
cᵢⱼ ∈ {0,1}(§2):cᵢⱼ = 0on each evaluator's own proposal (self-ranking excluded — 11 cells); 1 elsewhere. - Weighted center (§2):
s̄ⱼ = (Σᵢ wᵢ cᵢⱼ sᵢⱼ) / (Σᵢ wᵢ cᵢⱼ)— with equal wᵢ this is the mean rank over the 10 non-author evaluators. - Per-aspect agreement (§4):
σ²ⱼ = (Σᵢ wᵢ cᵢⱼ (sᵢⱼ − s̄ⱼ)²) / (Σᵢ wᵢ cᵢⱼ). - Closest to overall (§5): normalized (confidence-weighted) Euclidean
dᵢ = √[ Σⱼ cᵢⱼ αⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ αⱼ ]with αⱼ = 1 — an RMS over the 10 proposals evaluator i ranked, so evaluators omitting different cells stay comparable. Cross-check: Spearman1 − ρon the same 10 cells. - Medoid (§6): pairwise
d(Tᵢ,Tⱼ)= RMS Euclidean over the 9 proposals ranked by both (own(i) and own(j) excluded); totalsSᵢ = Σⱼ wⱼ d(Tᵢ,Tⱼ). - Near-ties: orderings are flagged when adjacent distances differ by less than ε = 0.05 (rank track) / ε = 0.002 (cosine). n = 11 is small; do not over-read hairline orderings.
- Semantic track: paragraph-chunked documents (blank-line blocks; fenced code/Mermaid/tables kept intact; blocks < 5 tokens dropped), embedded with
gemini-embedding-001(3072-dim, task_type SEMANTIC_SIMILARITY), pooled per document by length-weighted mean (weight = chunk token count). Centroid = equal-weighted mean of the 11 document vectors; cosine distance primary, Euclidean-to-centroid secondary. Self-exclusion does not apply (an evaluator's self-assessment cannot be excised from its text).
1. Rank matrix R¶
Rows = evaluators, columns = proposals, 1 = best. — marks the excluded self-ranking cell (cᵢⱼ = 0). The gemini-3.1-pro row is tier-imputed (table below).
| Evaluator ↓ \ Proposal → | deepseek4 | fable5 | gemini3.1 | glm5.1 | gpt5.4 | gpt5.5 | kimi2.6 | mimo2.5 | opus | qwen3.6+ | qwen3.7 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| deepseek-4-pro | — | 2 | 7 | 5 | 6 | 3 | 1 | 8 | 11 | 10 | 9 |
| fable-5 | 5 | — | 11 | 7 | 3 | 2 | 6 | 9 | 4 | 10 | 8 |
| gemini-3.1-pro (imputed) | 2 | 2 | — | 5 | 2 | 8.5 | 8.5 | 5 | 11 | 8.5 | 8.5 |
| glm-5.1 | 2 | 1 | 10 | — | 3 | 6 | 7 | 11 | 4 | 8 | 9 |
| gpt-5.4 | 5 | 2 | 8 | 6 | — | 3 | 7 | 9 | 4 | 10 | 11 |
| gpt-5.5 | 4 | 1 | 9 | 7 | 2 | — | 6 | 10 | 5 | 11 | 8 |
| kimi-2.6 | 2 | 1 | 4 | 5 | 3 | 7 | — | 6 | 11 | 9 | 10 |
| mimo-2.5-pro | 4 | 1 | 8 | 9 | 3 | 2 | 6 | — | 5 | 10 | 11 |
| opus-4.7 | 8 | 1 | 7 | 9 | 2 | 3 | 5 | 6 | — | 10 | 11 |
| qwen-3.6-plus | 7 | 1 | 10 | 6 | 3 | 2 | 5 | 11 | 4 | — | 9 |
| qwen-3.7-max | 2 | 1 | 11 | 7 | 3 | 5 | 6 | 10 | 4 | 9 | — |
gemini-3.1-pro imputation. Its remade analysis groups all 11 proposals into 4 ranked tiers with no strict intra-tier order; each tier member receives the average of the slot positions the tier occupies (rank-consistent, row sums to 66 = Σ1..11):
| Tier | Members | Slots | Imputed rank |
|---|---|---|---|
| 1 — "Phase-Based Pipeline" | gpt-5.4, fable-5, deepseek-4-pro | 1–3 | 2 |
| 2 — "Minimalist Stage" | mimo-2.5-pro, glm-5.1, gemini-3.1-pro | 4–6 | 5 |
| 3 — "Micro-node Explosion" | qwen-3.7-max, gpt-5.5, kimi-2.6, qwen-3.6-plus | 7–10 | 8.5 |
| 4 — "Edge-Heavy Anti-Pattern" | opus | 11 | 11 |
Self-ranking exclusion. The 11 diagonal (author) cells are dropped from the consensus and every rank-track distance. The excluded values (for the record): deepseek-4-pro→4, fable-5→1, gemini-3.1-pro→5, glm-5.1→5, gpt-5.4→1, gpt-5.5→3, kimi-2.6→8, mimo-2.5-pro→7, opus-4.7→4, qwen-3.6-plus→8, qwen-3.7-max→8. Self-bias is visible — 7 of 11 evaluators put their own proposal in their top 5 (fable-5 and gpt-5.4 self-rank #1).
2. Rank track¶
2.1 Consensus (weighted center) and per-proposal agreement¶
s̄ⱼ = mean rank over each proposal's 10 non-author rankers; σ²ⱼ = confidence-weighted variance (lower = stronger agreement).
| Consensus # | Proposal | s̄ⱼ (mean rank) | σ²ⱼ | σⱼ |
|---|---|---|---|---|
| 1 | fable-5 | 1.30 | 0.21 | 0.46 |
| 2 | gpt-5.4 | 3.00 | 1.20 | 1.10 |
| 3 | deepseek-4-pro | 4.10 | 4.29 | 2.07 |
| 4 | gpt-5.5 | 4.15 | 4.90 | 2.21 |
| 5 | kimi-2.6 | 5.75 | 3.46 | 1.86 |
| 6 | opus | 6.30 | 9.61 | 3.10 |
| 7 | glm-5.1 | 6.60 | 2.04 | 1.43 |
| 8 | gemini-3.1-pro | 8.50 | 4.25 | 2.06 |
| 9 | mimo-2.5-pro | 8.50 | 4.25 | 2.06 |
| 10 | qwen-3.7-max | 9.45 | 1.32 | 1.15 |
| 11 | qwen-3.6-plus | 9.55 | 0.72 | 0.85 |
(Derived by-product: the proposals' own consensus order. The deliverable remains the evaluator rankings below.)
2.2 Closest to overall — normalized Euclidean dᵢ (primary)¶
dᵢ = √[ Σⱼ cᵢⱼ (sᵢⱼ − s̄ⱼ)² / Σⱼ cᵢⱼ ] — RMS over evaluator i's 10 ranked proposals.
| # | Evaluator | dᵢ (RMS rank units) |
|---|---|---|
| 1 | gpt-5.5 | 0.9858 |
| 2 | gpt-5.4 | 1.1375 |
| 3 | mimo-2.5-pro | 1.2284 |
| 4 | qwen-3.7-max | 1.3978 |
| 5 | fable-5 | 1.4053 |
| 6 | glm-5.1 | 1.6087 |
| 7 | qwen-3.6-plus | 1.6744 |
| 8 | opus-4.7 | 1.8722 |
| 9 | deepseek-4-pro | 2.4684 |
| 10 | kimi-2.6 | 2.5373 |
| 11 | gemini-3.1-pro | 2.6700 |
Near-ties (Δ < ε = 0.05): {qwen-3.7-max, fable-5} — treat as effectively tied.
2.3 Closest to overall — Spearman cross-check¶
1 − ρ between evaluator i's 10 ranks and the consensus restricted to those proposals.
| # | Evaluator | ρ | 1 − ρ |
|---|---|---|---|
| 1 | gpt-5.5 | 0.9483 | 0.0517 |
| 2 | mimo-2.5-pro | 0.9273 | 0.0727 |
| 3 | qwen-3.7-max | 0.9119 | 0.0881 |
| 4 | gpt-5.4 | 0.8997 | 0.1003 |
| 5 | fable-5 | 0.8450 | 0.1550 |
| 6 | glm-5.1 | 0.8389 | 0.1611 |
| 7 | opus-4.7 | 0.8024 | 0.1976 |
| 8 | qwen-3.6-plus | 0.8024 | 0.1976 |
| 9 | deepseek-4-pro | 0.6809 | 0.3191 |
| 10 | kimi-2.6 | 0.6626 | 0.3374 |
| 11 | gemini-3.1-pro | 0.5975 | 0.4025 |
Near-ties (Δ < ε = 0.05): {gpt-5.5, mimo-2.5-pro, qwen-3.7-max, gpt-5.4}; {fable-5, glm-5.1, opus-4.7, qwen-3.6-plus}; {deepseek-4-pro, kimi-2.6} — treat as effectively tied.
2.4 Pairwise distance matrix (RMS Euclidean, 9 common proposals)¶
| ↓ \ → | deepseek4 | fable5 | gemini3.1 | glm5.1 | gpt5.4 | gpt5.5 | kimi2.6 | mimo2.5 | opus4.7 | qwen3.6+ | qwen3.7 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| deepseek4 | 0.0000 | 3.4319 | 3.5590 | 3.7417 | 3.1972 | 3.1972 | 2.1344 | 3.2146 | 2.5166 | 3.2489 | 3.5434 |
| fable5 | 3.4319 | 0.0000 | 3.8006 | 2.0000 | 1.5275 | 1.0000 | 4.0825 | 1.6330 | 2.3570 | 1.1547 | 1.4907 |
| gemini3.1 | 3.5590 | 3.8006 | 0.0000 | 3.2660 | 3.5901 | 3.0322 | 0.9280 | 3.5746 | 3.4157 | 4.3173 | 3.3124 |
| glm5.1 | 3.7417 | 2.0000 | 3.2660 | 0.0000 | 1.9720 | 1.4530 | 3.5434 | 1.9437 | 3.1972 | 2.2361 | 0.7454 |
| gpt5.4 | 3.1972 | 1.5275 | 3.5901 | 1.9720 | 0.0000 | 1.3744 | 3.3830 | 1.2472 | 1.9149 | 1.5635 | 1.7321 |
| gpt5.5 | 3.1972 | 1.0000 | 3.0322 | 1.4530 | 1.3744 | 0.0000 | 3.2318 | 1.3333 | 2.3805 | 1.3333 | 1.2472 |
| kimi2.6 | 2.1344 | 4.0825 | 0.9280 | 3.5434 | 3.3830 | 3.2318 | 0.0000 | 3.3166 | 2.9814 | 4.2426 | 3.6818 |
| mimo2.5 | 3.2146 | 1.6330 | 3.5746 | 1.9437 | 1.2472 | 1.3333 | 3.3166 | 0.0000 | 1.4907 | 1.7638 | 1.7638 |
| opus4.7 | 2.5166 | 2.3570 | 3.4157 | 3.1972 | 1.9149 | 2.3805 | 2.9814 | 1.4907 | 0.0000 | 2.3570 | 2.9627 |
| qwen3.6+ | 3.2489 | 1.1547 | 4.3173 | 2.2361 | 1.5635 | 1.3333 | 4.2426 | 1.7638 | 2.3570 | 0.0000 | 2.0548 |
| qwen3.7 | 3.5434 | 1.4907 | 3.3124 | 0.7454 | 1.7321 | 1.2472 | 3.6818 | 1.7638 | 2.9627 | 2.0548 | 0.0000 |
2.5 Medoid (rank track)¶
Sᵢ = Σⱼ wⱼ·d(Tᵢ,Tⱼ) with wⱼ = 1/11 (self-distance 0). Smallest total = medoid.
| # | Evaluator | Sᵢ (weighted total) |
|---|---|---|
| 1 | gpt-5.5 | 1.7803 |
| 2 | mimo-2.5-pro | 1.9347 |
| 3 | gpt-5.4 | 1.9547 |
| 4 | fable-5 | 2.0434 |
| 5 | qwen-3.7-max | 2.0486 |
| 6 | glm-5.1 | 2.1908 |
| 7 | qwen-3.6-plus | 2.2066 |
| 8 | opus-4.7 | 2.3249 |
| 9 | kimi-2.6 | 2.8659 |
| 10 | deepseek-4-pro | 2.8895 |
| 11 | gemini-3.1-pro | 2.9814 |
Near-ties (Δ < ε = 0.05): {mimo-2.5-pro, gpt-5.4}; {fable-5, qwen-3.7-max}; {glm-5.1, qwen-3.6-plus}; {kimi-2.6, deepseek-4-pro} — treat as effectively tied.
2.6 Why the two closest-to-overall rankings differ (Euclidean vs Spearman)¶
The two metrics answer subtly different questions:
- Normalized Euclidean (§2.2) measures magnitude: it squares each gap
(sᵢⱼ − s̄ⱼ)between an evaluator's rank and the consensus mean. One big miss costs more than several small ones, and swapping two proposals whose consensus values are nearly tied costs almost nothing. - Spearman 1−ρ (§2.3) measures order: it charges for every inversion equally, no matter how small the consensus gap it crosses. A swap across a 0.05-wide consensus gap costs as much as a swap across a 3-point gap.
In this 11×11 run the two rankings agree unusually well — same #1 (gpt-5.5: dᵢ = 0.986 and 1−ρ = 0.052) and the same bottom three in the same order (deepseek-4-pro, kimi-2.6, gemini-3.1-pro) — because the consensus is sharply stratified at the extremes (fable-5 at 1.30; the qwen pair near 9.5). The divergence is confined to the #2–#4 band, and it is driven by the consensus's near-tied pairs (§2.1):
| Near-tied consensus pair | s̄ⱼ values | Gap |
|---|---|---|
| deepseek-4-pro ≈ gpt-5.5 | 4.10 vs 4.15 | 0.05 |
| gemini-3.1-pro = mimo-2.5-pro | 8.50 vs 8.50 | 0.00 |
| qwen-3.7-max ≈ qwen-3.6-plus | 9.45 vs 9.55 | 0.10 |
| kimi-2.6 / opus / glm-5.1 cluster | 5.75 / 6.30 / 6.60 | ≤ 0.85 |
Poster cases:
- gpt-5.4 — Euclidean #2 (dᵢ = 1.138) but Spearman #4 (1−ρ = 0.100). Numerically a tight fit — its only sizeable miss is opus at rank 4 vs consensus 6.30. But it inverts the near-tied pairs: gpt-5.5 above deepseek-4-pro (consensus gap 0.05), qwen-3.6-plus above qwen-3.7-max (gap 0.10), and promoting opus also lifts it above kimi-2.6 and glm-5.1, adding two more inversions. Each inversion is almost free in squared-gap terms and full price in rank-correlation terms.
- mimo-2.5-pro — Spearman #2 (1−ρ = 0.073) vs Euclidean #3; qwen-3.7-max — Spearman #3 (0.088) vs Euclidean #4. The mirror image: consensus-consistent stories carrying a few large numeric misses that Euclidean squares — mimo-2.5-pro puts glm-5.1 at 9 vs consensus 6.60 (|gap| = 2.40) and gpt-5.5 at 2 vs 4.15; qwen-3.7-max puts gemini-3.1-pro at 11 vs 8.50 (2.50), opus at 4 vs 6.30 (2.30) and deepseek-4-pro at 2 vs 4.10 (2.10).
- gpt-5.5 — #1 on both: right magnitudes and right order; when an evaluator gets both, the metrics cannot disagree.
Reading guide: Euclidean is the primary metric (it is the method's dᵢ with the confidence-weighted normalization); Spearman is the cross-check. When they disagree about an evaluator, the disagreement itself is informative: Euclidean-better means "right numbers, shuffled near-ties"; Spearman-better means "right story, one or two big numeric misses".
3. Semantic track (Gemini embeddings)¶
Model gemini-embedding-001 (3072-dim, task_type SEMANTIC_SIMILARITY), batched requests ≤ 50 inputs, throttle 60s per 50 requests + 3s between batches. Self-exclusion not applicable (full text embedded).
3.1 Chunking statistics¶
| Document | Raw blocks | Kept chunks (≥5 tok) | Tokens | Max chunk |
|---|---|---|---|---|
| deepseek-4-pro | 109 | 93 | 8827 | 569 |
| fable-5 | 76 | 68 | 6853 | 551 |
| gemini-3.1-pro | 29 | 26 | 1353 | 164 |
| glm-5.1 | 79 | 60 | 6004 | 440 |
| gpt-5.4 | 114 | 85 | 3399 | 288 |
| gpt-5.5 | 74 | 60 | 3654 | 355 |
| kimi-2.6 | 100 | 81 | 6389 | 514 |
| mimo-2.5-pro | 89 | 69 | 5797 | 494 |
| opus-4.7 | 90 | 84 | 7550 | 1089 |
| qwen-3.6-plus | 84 | 71 | 5141 | 255 |
| qwen-3.7-max | 101 | 82 | 4673 | 290 |
| total | 945 | 779 | 59640 |
Token counts via tiktoken cl100k_base (proxy); per-input limit 2048 tokens (no chunk required truncation if max chunk below the limit).
3.2 Closest to centroid — cosine (primary), Euclidean (secondary)¶
| # | Evaluator | cosine distance to centroid | Euclidean to centroid |
|---|---|---|---|
| 1 | qwen-3.7-max | 0.0029 | 0.0694 |
| 2 | qwen-3.6-plus | 0.0037 | 0.0782 |
| 3 | deepseek-4-pro | 0.0040 | 0.0819 |
| 4 | kimi-2.6 | 0.0054 | 0.0959 |
| 5 | opus-4.7 | 0.0063 | 0.1034 |
| 6 | gpt-5.5 | 0.0067 | 0.1060 |
| 7 | gpt-5.4 | 0.0076 | 0.1125 |
| 8 | fable-5 | 0.0077 | 0.1139 |
| 9 | mimo-2.5-pro | 0.0097 | 0.1287 |
| 10 | glm-5.1 | 0.0112 | 0.1396 |
| 11 | gemini-3.1-pro | 0.0120 | 0.1412 |
Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5, mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.
3.3 Pairwise cosine distance matrix¶
| ↓ \ → | deepseek4 | fable5 | gemini3.1 | glm5.1 | gpt5.4 | gpt5.5 | kimi2.6 | mimo2.5 | opus4.7 | qwen3.6+ | qwen3.7 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| deepseek4 | 0.0000 | 0.0111 | 0.0180 | 0.0158 | 0.0160 | 0.0126 | 0.0084 | 0.0128 | 0.0113 | 0.0069 | 0.0080 |
| fable5 | 0.0111 | 0.0000 | 0.0211 | 0.0248 | 0.0152 | 0.0151 | 0.0159 | 0.0211 | 0.0134 | 0.0115 | 0.0127 |
| gemini3.1 | 0.0180 | 0.0211 | 0.0000 | 0.0287 | 0.0195 | 0.0192 | 0.0177 | 0.0270 | 0.0227 | 0.0179 | 0.0166 |
| glm5.1 | 0.0158 | 0.0248 | 0.0287 | 0.0000 | 0.0256 | 0.0242 | 0.0166 | 0.0165 | 0.0175 | 0.0168 | 0.0142 |
| gpt5.4 | 0.0160 | 0.0152 | 0.0195 | 0.0256 | 0.0000 | 0.0070 | 0.0159 | 0.0230 | 0.0150 | 0.0134 | 0.0100 |
| gpt5.5 | 0.0126 | 0.0151 | 0.0192 | 0.0242 | 0.0070 | 0.0000 | 0.0131 | 0.0216 | 0.0141 | 0.0123 | 0.0111 |
| kimi2.6 | 0.0084 | 0.0159 | 0.0177 | 0.0166 | 0.0159 | 0.0131 | 0.0000 | 0.0166 | 0.0147 | 0.0088 | 0.0093 |
| mimo2.5 | 0.0128 | 0.0211 | 0.0270 | 0.0165 | 0.0230 | 0.0216 | 0.0166 | 0.0000 | 0.0173 | 0.0151 | 0.0131 |
| opus4.7 | 0.0113 | 0.0134 | 0.0227 | 0.0175 | 0.0150 | 0.0141 | 0.0147 | 0.0173 | 0.0000 | 0.0106 | 0.0101 |
| qwen3.6+ | 0.0069 | 0.0115 | 0.0179 | 0.0168 | 0.0134 | 0.0123 | 0.0088 | 0.0151 | 0.0106 | 0.0000 | 0.0040 |
| qwen3.7 | 0.0080 | 0.0127 | 0.0166 | 0.0142 | 0.0100 | 0.0111 | 0.0093 | 0.0131 | 0.0101 | 0.0040 | 0.0000 |
3.4 Cosine medoid¶
| # | Evaluator | Sᵢ (weighted cosine total) |
|---|---|---|
| 1 | qwen-3.7-max | 0.0099 |
| 2 | qwen-3.6-plus | 0.0107 |
| 3 | deepseek-4-pro | 0.0110 |
| 4 | kimi-2.6 | 0.0124 |
| 5 | opus-4.7 | 0.0133 |
| 6 | gpt-5.5 | 0.0137 |
| 7 | gpt-5.4 | 0.0146 |
| 8 | fable-5 | 0.0147 |
| 9 | mimo-2.5-pro | 0.0167 |
| 10 | glm-5.1 | 0.0182 |
| 11 | gemini-3.1-pro | 0.0190 |
Near-ties (Δ < ε = 0.002): {qwen-3.7-max, qwen-3.6-plus, deepseek-4-pro, kimi-2.6, opus-4.7, gpt-5.5, gpt-5.4, fable-5}; {mimo-2.5-pro, glm-5.1, gemini-3.1-pro} — treat as effectively tied.
4. Per-metric comparison (no combined winner)¶
| Evaluator | Closest: rank-Euclidean | Closest: Spearman | Closest: cosine | Medoid: rank-Euclidean | Medoid: cosine |
|---|---|---|---|---|---|
| deepseek-4-pro | 9 | 9 | 3 | 10 | 3 |
| fable-5 | 5 | 5 | 8 | 4 | 8 |
| gemini-3.1-pro | 11 | 11 | 11 | 11 | 11 |
| glm-5.1 | 6 | 6 | 10 | 6 | 10 |
| gpt-5.4 | 2 | 4 | 7 | 3 | 7 |
| gpt-5.5 | 1 | 1 | 6 | 1 | 6 |
| kimi-2.6 | 10 | 10 | 4 | 9 | 4 |
| mimo-2.5-pro | 3 | 2 | 9 | 2 | 9 |
| opus-4.7 | 8 | 7 | 5 | 8 | 5 |
| qwen-3.6-plus | 7 | 8 | 2 | 7 | 2 |
| qwen-3.7-max | 4 | 3 | 1 | 5 | 1 |
Agreement across metrics. The five orderings agree at the extremes of each track, but the two tracks measure different things:
- gpt-5.5 owns the rank track: #1 on all three rank-based metrics (Euclidean dᵢ = 0.986, Spearman 1−ρ = 0.052, medoid Sᵢ = 1.780). By the method's primary definitions it is both the closest to the overall opinion and the most central evaluator.
- qwen-3.7-max owns the semantic track: #1 cosine-to-centroid (0.0029) and #1 cosine medoid (0.0099) — and it is also top-5 on every rank-track metric (#4 Euclidean / #3 Spearman / #5 medoid), making it the only evaluator in the top five of all five columns. If one evaluator had to stand in for the group across both tracks, it is qwen-3.7-max.
- gemini-3.1-pro is last on all five metrics — the unambiguous outlier. Its coarse 4-tier opinion distorts the rank vector (opus last vs consensus 6.30; the whole gpt-5.5/kimi-2.6 tier parked at 8.5), and its document is also the shortest and least typical text.
- The tracks genuinely disagree in the middle. deepseek-4-pro and kimi-2.6 — the rank track's bottom pair after gemini (#9/#10 on closeness, #10/#9 on medoid) — are semantically central (#3 and #4): their texts follow the shared template and vocabulary, but their orderings deviate (deepseek-4-pro crowns kimi-2.6 #1 and buries opus at #11; kimi-2.6 lifts gemini-3.1-pro to #4 and likewise buries opus). The mirror image: mimo-2.5-pro and glm-5.1 are strong on the rank track (closeness #3/#2 and #6/#6) but semantically peripheral (#9 and #10). Rank distance measures what an evaluator concluded; cosine distance measures how it wrote — and here the two are nearly uncorrelated.
- The semantic track is flat and should be read as weak evidence. Cosine distances to the centroid span only 0.0029–0.0120 and most adjacent gaps fall under ε = 0.002 (see the near-tie chains flagged in §3.2/§3.4): all eleven documents share format, headings, and subject matter. Only the endpoints — qwen-3.7-max clearly closest; glm-5.1 and gemini-3.1-pro clearly farthest — rise above the noise.
Per the spec, no combined ranking is computed: each metric stands on its own, and the table above is the complete per-metric picture.
5. Notes & caveats¶
- Two missing-data policies by design: gemini-3.1-pro's tier structure is imputed (completing a coarse-but-total opinion); self-rankings are excluded (cᵢⱼ = 0, removing self-bias). Both flow through the method's confidence term.
- n = 11 with one imputed (tiered) row: orderings near the flagged ties are not robust; the extremes (top and bottom two on each metric) are.
- Embedding vectors are not hand-verifiable; the derived distance matrices above are the auditable artifact. Re-running re-embeds (or reuses a content-keyed cache) and reproduces the rank track exactly.
- Self-exclusion applies to the rank track only; the semantic track embeds each report's full text including its self-assessment.
- Reproduce:
poetry run python3 docs/planner-graph-ref/analyse/consensus_ranking.py(project venv;GOOGLE_API_KEYrequired for the semantic track).