Skip to content

Evaluator Preferability Ranking for Deliberated and Comprehensive Analysis

This file ranks evaluators for the specific meta-task:

Which evaluator is most preferable if the goal is to get the most deliberated and comprehensive analysis?

Method

This ranking is not the same as proposal quality ranking.

It is a meta-ranking of evaluator usefulness for analysis quality.

The composite score uses four signals:

  1. Weighted common-thesis score (40%)
  2. From evaluator-common-thesis-ranking.md
  3. Rewards agreement with theses that are widely shared across the evaluator set

  4. Thesis coverage count (20%)

  5. Number of theses the evaluator marks A in thesis-matrix.md
  6. Rewards breadth of explicit analytical positions

  7. Structural deliberation score (25%)

  8. Presence of explicit analysis features in the evaluator doc:

    • evaluation criteria/framework
    • strengths
    • weaknesses
    • synthesis/recommendation
    • migration guidance
    • non-goals / what to avoid
    • matrix-style comparison artifact
    • cross-cutting / consensus section
    • risk discussion
  9. Document depth score (15%)

  10. Based on evaluator document length, capped at 400 lines so verbosity does not dominate the ranking

Composite formula:

preferability_score =
  0.40 * normalized_weighted_commonness
  + 0.20 * normalized_thesis_coverage
  + 0.25 * normalized_structural_deliberation
  + 0.15 * normalized_document_depth

Ranked evaluators

Rank Evaluator Preferability score Weighted common-thesis score Agreed theses Structure signals Lines
1 deepseek-4-pro 97.22 248 36 8 / 9 471
2 qwen-3.7-max 92.00 242 33 8 / 9 331
3 opus-4.7 86.73 233 31 7 / 9 333
4 kimi-2.6 82.43 199 24 8 / 9 394
5 glm-5.1 79.73 207 26 7 / 9 332
6 gpt-5.5 79.64 217 28 7 / 9 257
7 qwen-3.6-plus 78.50 192 23 9 / 9 260
8 mimo-2.5-pro 74.88 190 21 7 / 9 350
9 gpt-5.4 68.54 197 22 6 / 9 210
10 gemini-3.1-pro 50.79 178 19 3 / 9 85

Interpretation

Top tier

  1. deepseek-4-pro
  2. Best overall fit for this meta-task
  3. Highest common-thesis score
  4. Highest thesis coverage
  5. Very long document with strong synthesis, migration, risk, and cross-cutting analysis

  6. qwen-3.7-max

  7. Extremely comprehensive and highly structured
  8. Very broad thesis coverage
  9. Strong if the goal is exhaustive analysis, even if some of its architectural preferences are more debatable

  10. opus-4.7

  11. Strong breadth plus unusually rich synthesis and cross-proposal reasoning
  12. Especially preferable when you want careful structure and explicit tradeoff framing

Strong middle tier

  1. kimi-2.6
  2. Long, highly structured, and includes strong comparison artifacts
  3. Slightly less aligned with the broadest thesis set than the top three, but still very comprehensive

  4. glm-5.1

  5. Good balance of breadth, conservatism, and explicit implementation framing
  6. Strong for practical migration-oriented analysis

  7. gpt-5.5

  8. Particularly good at state hygiene, consequences, and test-migration reasoning
  9. Slightly shorter and less exhaustive than the top few, but still very deliberate

  10. qwen-3.6-plus

  11. Structurally the most feature-complete evaluator doc
  12. Ranked lower because its thesis coverage and weighted common-thesis alignment are weaker than the top group

Lower tier for this specific meta-task

  1. mimo-2.5-pro
  2. Solid coverage and decent depth, but less analytically rich than the stronger evaluators

  3. gpt-5.4

  4. Strong evaluator in quality, but its doc is shorter and structurally lighter than the more comprehensive writeups
  5. Better as a clean baseline than as the single most exhaustive analyst

  6. gemini-3.1-pro

  7. Concise and useful, but much less comprehensive than the rest of the field
  • If you want one evaluator only, choose deepseek-4-pro.
  • If you want one primary + one cross-check, use deepseek-4-pro + opus-4.7.
  • If you want maximum exhaustiveness, use deepseek-4-pro + qwen-3.7-max + opus-4.7.

Notes

  • This ranking is objective only within the scoring rule above.
  • It answers the question "who is most preferable for comprehensive analysis output?", not "whose final architectural recommendation is definitely best?"
  • Because the universal theses (1-12) are shared by everyone, most separation comes from non-universal theses plus document structure/depth.