Evaluator Ranking by Common-Thesis Score¶
This file ranks evaluators by how many of the most common theses they contain.
For a broader meta-ranking of evaluator preferability for deliberated and comprehensive analysis, see evaluator-preferability-ranking.md.
Method¶
- Source of truth:
thesis-matrix.md - Covered theses:
1-40 - Evaluator stance counted: only
A(agree) - Thesis commonness = number of evaluators marked
Afor that thesis - Evaluator score = sum of thesis commonness values across all theses that evaluator marks
A
In formula form:
Interpretation:
- Higher score means the evaluator contains more of the directory's most widely shared theses.
- This metric rewards both breadth and alignment with high-consensus theses.
- It is different from simple thesis-count coverage: an evaluator gets more credit for carrying a thesis agreed by 10 evaluators than one agreed by 2 evaluators.
Ranked evaluators¶
| Rank | Evaluator | Weighted score | Agreed theses | Avg commonness per agreed thesis |
|---|---|---|---|---|
| 1 | deepseek-4-pro |
248 | 36 | 6.89 |
| 2 | qwen-3.7-max |
242 | 33 | 7.33 |
| 3 | opus-4.7 |
233 | 31 | 7.52 |
| 4 | gpt-5.5 |
217 | 28 | 7.75 |
| 5 | glm-5.1 |
207 | 26 | 7.96 |
| 6 | kimi-2.6 |
199 | 24 | 8.29 |
| 7 | gpt-5.4 |
197 | 22 | 8.95 |
| 8 | qwen-3.6-plus |
192 | 23 | 8.35 |
| 9 | mimo-2.5-pro |
190 | 21 | 9.05 |
| 10 | gemini-3.1-pro |
178 | 19 | 9.37 |
Reading the ranking¶
deepseek-4-proranks first because it agrees with the widest spread of normalized theses, including many mid-consensus and proposal-specific ones.qwen-3.7-maxandopus-4.7rank high for the same reason: broad agreement footprint beyond the universal core.gemini-3.1-pro,mimo-2.5-pro, andgpt-5.4have higher average commonness per agreed thesis, but they agree with fewer total theses, so their weighted totals are lower.
Notes¶
- The universal theses (
1-12) contribute equally to every evaluator, so most rank separation comes from theses13-40. - Aggregate ranking-count theses (
41-44) are excluded because they are already derived summary facts, not independent stance rows. - If needed, a second ranking can be added later using a different weighting rule, such as inverse chart rank instead of raw agreement count.