Evaluator Preferability Ranking for Deliberated and Comprehensive Analysis¶

This file ranks evaluators for the specific meta-task:

Which evaluator is most preferable if the goal is to get the most deliberated and comprehensive analysis?

Method¶

This ranking is not the same as proposal quality ranking.

It is a meta-ranking of evaluator usefulness for analysis quality.

The composite score uses four signals:

Weighted common-thesis score (40%)
From evaluator-common-thesis-ranking.md
Rewards agreement with theses that are widely shared across the evaluator set
Thesis coverage count (20%)
Number of theses the evaluator marks A in thesis-matrix.md
Rewards breadth of explicit analytical positions
Structural deliberation score (25%)
Presence of explicit analysis features in the evaluator doc:
- evaluation criteria/framework
- strengths
- weaknesses
- synthesis/recommendation
- migration guidance
- non-goals / what to avoid
- matrix-style comparison artifact
- cross-cutting / consensus section
- risk discussion
Document depth score (15%)
Based on evaluator document length, capped at 400 lines so verbosity does not dominate the ranking

Composite formula:

preferability_score =
  0.40 * normalized_weighted_commonness
  + 0.20 * normalized_thesis_coverage
  + 0.25 * normalized_structural_deliberation
  + 0.15 * normalized_document_depth

Ranked evaluators¶

Rank	Evaluator	Preferability score	Weighted common-thesis score	Agreed theses	Structure signals	Lines
1	`deepseek-4-pro`	97.22	248	36	8 / 9	471
2	`qwen-3.7-max`	92.00	242	33	8 / 9	331
3	`opus-4.7`	86.73	233	31	7 / 9	333
4	`kimi-2.6`	82.43	199	24	8 / 9	394
5	`glm-5.1`	79.73	207	26	7 / 9	332
6	`gpt-5.5`	79.64	217	28	7 / 9	257
7	`qwen-3.6-plus`	78.50	192	23	9 / 9	260
8	`mimo-2.5-pro`	74.88	190	21	7 / 9	350
9	`gpt-5.4`	68.54	197	22	6 / 9	210
10	`gemini-3.1-pro`	50.79	178	19	3 / 9	85

Interpretation¶

Top tier¶

deepseek-4-pro
Best overall fit for this meta-task
Highest common-thesis score
Highest thesis coverage
Very long document with strong synthesis, migration, risk, and cross-cutting analysis
qwen-3.7-max
Extremely comprehensive and highly structured
Very broad thesis coverage
Strong if the goal is exhaustive analysis, even if some of its architectural preferences are more debatable
opus-4.7
Strong breadth plus unusually rich synthesis and cross-proposal reasoning
Especially preferable when you want careful structure and explicit tradeoff framing

Strong middle tier¶

kimi-2.6
Long, highly structured, and includes strong comparison artifacts
Slightly less aligned with the broadest thesis set than the top three, but still very comprehensive
glm-5.1
Good balance of breadth, conservatism, and explicit implementation framing
Strong for practical migration-oriented analysis
gpt-5.5
Particularly good at state hygiene, consequences, and test-migration reasoning
Slightly shorter and less exhaustive than the top few, but still very deliberate
qwen-3.6-plus
Structurally the most feature-complete evaluator doc
Ranked lower because its thesis coverage and weighted common-thesis alignment are weaker than the top group

Lower tier for this specific meta-task¶

mimo-2.5-pro
Solid coverage and decent depth, but less analytically rich than the stronger evaluators
gpt-5.4
Strong evaluator in quality, but its doc is shorter and structurally lighter than the more comprehensive writeups
Better as a clean baseline than as the single most exhaustive analyst
gemini-3.1-pro
Concise and useful, but much less comprehensive than the rest of the field

Recommended use¶

If you want one evaluator only, choose deepseek-4-pro.
If you want one primary + one cross-check, use deepseek-4-pro + opus-4.7.
If you want maximum exhaustiveness, use deepseek-4-pro + qwen-3.7-max + opus-4.7.

Notes¶

This ranking is objective only within the scoring rule above.
It answers the question "who is most preferable for comprehensive analysis output?", not "whose final architectural recommendation is definitely best?"
Because the universal theses (1-12) are shared by everyone, most separation comes from non-universal theses plus document structure/depth.