Evaluator Ranking by Common-Thesis Score¶

This file ranks evaluators by how many of the most common theses they contain.

For a broader meta-ranking of evaluator preferability for deliberated and comprehensive analysis, see evaluator-preferability-ranking.md.

Method¶

Source of truth: thesis-matrix.md
Covered theses: 1-40
Evaluator stance counted: only A (agree)
Thesis commonness = number of evaluators marked A for that thesis
Evaluator score = sum of thesis commonness values across all theses that evaluator marks A

In formula form:

score(evaluator) = Σ agreement_count(thesis)
                   for every thesis where evaluator stance = A

Interpretation:

Higher score means the evaluator contains more of the directory's most widely shared theses.
This metric rewards both breadth and alignment with high-consensus theses.
It is different from simple thesis-count coverage: an evaluator gets more credit for carrying a thesis agreed by 10 evaluators than one agreed by 2 evaluators.

Rank	Evaluator	Weighted score	Agreed theses	Avg commonness per agreed thesis
1	`deepseek-4-pro`	248	36	6.89
2	`qwen-3.7-max`	242	33	7.33
3	`opus-4.7`	233	31	7.52
4	`gpt-5.5`	217	28	7.75
5	`glm-5.1`	207	26	7.96
6	`kimi-2.6`	199	24	8.29
7	`gpt-5.4`	197	22	8.95
8	`qwen-3.6-plus`	192	23	8.35
9	`mimo-2.5-pro`	190	21	9.05
10	`gemini-3.1-pro`	178	19	9.37

deepseek-4-pro ranks first because it agrees with the widest spread of normalized theses, including many mid-consensus and proposal-specific ones.
qwen-3.7-max and opus-4.7 rank high for the same reason: broad agreement footprint beyond the universal core.
gemini-3.1-pro, mimo-2.5-pro, and gpt-5.4 have higher average commonness per agreed thesis, but they agree with fewer total theses, so their weighted totals are lower.

The universal theses (1-12) contribute equally to every evaluator, so most rank separation comes from theses 13-40.
Aggregate ranking-count theses (41-44) are excluded because they are already derived summary facts, not independent stance rows.
If needed, a second ranking can be added later using a different weighting rule, such as inverse chart rank instead of raw agreement count.