Evaluator Preferability Ranking for Deliberated and Comprehensive Analysis¶
This file ranks evaluators for the specific meta-task:
Which evaluator is most preferable if the goal is to get the most deliberated and comprehensive analysis?
Method¶
This ranking is not the same as proposal quality ranking.
It is a meta-ranking of evaluator usefulness for analysis quality.
The composite score uses four signals:
- Weighted common-thesis score (
40%) - From
evaluator-common-thesis-ranking.md -
Rewards agreement with theses that are widely shared across the evaluator set
-
Thesis coverage count (
20%) - Number of theses the evaluator marks
Ainthesis-matrix.md -
Rewards breadth of explicit analytical positions
-
Structural deliberation score (
25%) -
Presence of explicit analysis features in the evaluator doc:
- evaluation criteria/framework
- strengths
- weaknesses
- synthesis/recommendation
- migration guidance
- non-goals / what to avoid
- matrix-style comparison artifact
- cross-cutting / consensus section
- risk discussion
-
Document depth score (
15%) - Based on evaluator document length, capped at 400 lines so verbosity does not dominate the ranking
Composite formula:
preferability_score =
0.40 * normalized_weighted_commonness
+ 0.20 * normalized_thesis_coverage
+ 0.25 * normalized_structural_deliberation
+ 0.15 * normalized_document_depth
Ranked evaluators¶
| Rank | Evaluator | Preferability score | Weighted common-thesis score | Agreed theses | Structure signals | Lines |
|---|---|---|---|---|---|---|
| 1 | deepseek-4-pro |
97.22 | 248 | 36 | 8 / 9 | 471 |
| 2 | qwen-3.7-max |
92.00 | 242 | 33 | 8 / 9 | 331 |
| 3 | opus-4.7 |
86.73 | 233 | 31 | 7 / 9 | 333 |
| 4 | kimi-2.6 |
82.43 | 199 | 24 | 8 / 9 | 394 |
| 5 | glm-5.1 |
79.73 | 207 | 26 | 7 / 9 | 332 |
| 6 | gpt-5.5 |
79.64 | 217 | 28 | 7 / 9 | 257 |
| 7 | qwen-3.6-plus |
78.50 | 192 | 23 | 9 / 9 | 260 |
| 8 | mimo-2.5-pro |
74.88 | 190 | 21 | 7 / 9 | 350 |
| 9 | gpt-5.4 |
68.54 | 197 | 22 | 6 / 9 | 210 |
| 10 | gemini-3.1-pro |
50.79 | 178 | 19 | 3 / 9 | 85 |
Interpretation¶
Top tier¶
deepseek-4-pro- Best overall fit for this meta-task
- Highest common-thesis score
- Highest thesis coverage
-
Very long document with strong synthesis, migration, risk, and cross-cutting analysis
-
qwen-3.7-max - Extremely comprehensive and highly structured
- Very broad thesis coverage
-
Strong if the goal is exhaustive analysis, even if some of its architectural preferences are more debatable
-
opus-4.7 - Strong breadth plus unusually rich synthesis and cross-proposal reasoning
- Especially preferable when you want careful structure and explicit tradeoff framing
Strong middle tier¶
kimi-2.6- Long, highly structured, and includes strong comparison artifacts
-
Slightly less aligned with the broadest thesis set than the top three, but still very comprehensive
-
glm-5.1 - Good balance of breadth, conservatism, and explicit implementation framing
-
Strong for practical migration-oriented analysis
-
gpt-5.5 - Particularly good at state hygiene, consequences, and test-migration reasoning
-
Slightly shorter and less exhaustive than the top few, but still very deliberate
-
qwen-3.6-plus - Structurally the most feature-complete evaluator doc
- Ranked lower because its thesis coverage and weighted common-thesis alignment are weaker than the top group
Lower tier for this specific meta-task¶
mimo-2.5-pro-
Solid coverage and decent depth, but less analytically rich than the stronger evaluators
-
gpt-5.4 - Strong evaluator in quality, but its doc is shorter and structurally lighter than the more comprehensive writeups
-
Better as a clean baseline than as the single most exhaustive analyst
-
gemini-3.1-pro - Concise and useful, but much less comprehensive than the rest of the field
Recommended use¶
- If you want one evaluator only, choose
deepseek-4-pro. - If you want one primary + one cross-check, use
deepseek-4-pro+opus-4.7. - If you want maximum exhaustiveness, use
deepseek-4-pro+qwen-3.7-max+opus-4.7.
Notes¶
- This ranking is objective only within the scoring rule above.
- It answers the question "who is most preferable for comprehensive analysis output?", not "whose final architectural recommendation is definitely best?"
- Because the universal theses (
1-12) are shared by everyone, most separation comes from non-universal theses plus document structure/depth.