Skip to content

Evaluator Ranking by Common-Thesis Score

This file ranks evaluators by how many of the most common theses they contain.

For a broader meta-ranking of evaluator preferability for deliberated and comprehensive analysis, see evaluator-preferability-ranking.md.

Method

  • Source of truth: thesis-matrix.md
  • Covered theses: 1-40
  • Evaluator stance counted: only A (agree)
  • Thesis commonness = number of evaluators marked A for that thesis
  • Evaluator score = sum of thesis commonness values across all theses that evaluator marks A

In formula form:

score(evaluator) = Σ agreement_count(thesis)
                   for every thesis where evaluator stance = A

Interpretation:

  • Higher score means the evaluator contains more of the directory's most widely shared theses.
  • This metric rewards both breadth and alignment with high-consensus theses.
  • It is different from simple thesis-count coverage: an evaluator gets more credit for carrying a thesis agreed by 10 evaluators than one agreed by 2 evaluators.

Ranked evaluators

Rank Evaluator Weighted score Agreed theses Avg commonness per agreed thesis
1 deepseek-4-pro 248 36 6.89
2 qwen-3.7-max 242 33 7.33
3 opus-4.7 233 31 7.52
4 gpt-5.5 217 28 7.75
5 glm-5.1 207 26 7.96
6 kimi-2.6 199 24 8.29
7 gpt-5.4 197 22 8.95
8 qwen-3.6-plus 192 23 8.35
9 mimo-2.5-pro 190 21 9.05
10 gemini-3.1-pro 178 19 9.37

Reading the ranking

  • deepseek-4-pro ranks first because it agrees with the widest spread of normalized theses, including many mid-consensus and proposal-specific ones.
  • qwen-3.7-max and opus-4.7 rank high for the same reason: broad agreement footprint beyond the universal core.
  • gemini-3.1-pro, mimo-2.5-pro, and gpt-5.4 have higher average commonness per agreed thesis, but they agree with fewer total theses, so their weighted totals are lower.

Notes

  • The universal theses (1-12) contribute equally to every evaluator, so most rank separation comes from theses 13-40.
  • Aggregate ranking-count theses (41-44) are excluded because they are already derived summary facts, not independent stance rows.
  • If needed, a second ranking can be added later using a different weighting rule, such as inverse chart rank instead of raw agreement count.