Skip to content

Evaluator Consensus Ranking Report

Scope

This report applies analyse-evaluation-method.md to the 10 evaluator reports in this directory. It ranks evaluator reports by:

  1. distance to the equal-weight overall opinion;
  2. medoid centrality under pairwise evaluator distances.

This is an evaluator-opinion synthesis only. It does not add an independent architecture judgment about the planner refactor.

Source Set

Code Evaluator report
DS deepseek-4-pro-range.md
GE gemini-3.1-pro-range.md
GL glm-5.1-pro-range.md
G54 gpt-5.4-range.md
G55 gpt-5.5-range.md
KI kimi-2.6-range.md
MI mimo-2.5-pro-range.md
OP opus-4.7-range.md
Q36 qwen-3.6-range.md
Q37 qwen-3.7-max-range.md

analyse-evaluation-method.md is the method source, not an evaluator.

Result Summary

Ranking type Winner Score Interpretation
Closest to equal-weight overall opinion DS 0.131789 Lowest weighted Euclidean distance from the aggregate vector.
Medoid DS 0.276161 Lowest equal-weight total pairwise distance to all evaluator reports.

Both methods select deepseek-4-pro-range.md as the most central evaluator opinion.

Scoring Method

The method file requires a formal feature-space representation before aggregation. This report uses 19 fine-grained semantic aspects induced from recurring evaluator claims.

Rules:

  • Evaluator weights are equal: w_i = 0.1.
  • Aspect-distance weights are equal: alpha_j = 1 / 19.
  • Scores use [-1, 1].
  • +1 means strong support for the aspect statement.
  • 0 means neutral, balanced, or not mentioned.
  • -1 means strong rejection of the aspect statement.
  • Only clearly equivalent judgments are normalized into the same aspect.
  • If equivalence is unclear, it remains a separate aspect.
  • Unmentioned aspects are scored 0 by the user-selected override.
  • After neutral filling, every matrix cell is present and c_ij = 1.
  • Mention status is tracked separately in the evidence ledger.
  • Unresolved ambiguous cells: 0 / 190 = 0.0%, below the <= 3% requirement.

Rounding policy:

  • Source scores are shown to one decimal because scoring was assigned on a tenth-point rubric.
  • Aggregate and variance values are shown to four decimals.
  • Ranking metrics are shown to six decimals.
  • Pairwise matrix entries are shown to four decimals.
  • Clean ordering uses full-precision values before rounding; if a full-precision tie occurred, the tie-break would be source filename lexicographic order. No full-precision ties occurred.

Aspect Taxonomy

Aspect Directional statement scored positively
A01 A moderate 4-6 phase graph is the target sweet spot.
A02 DeepSeek's guard -> prepare -> plan -> adjust topology is a strong base.
A03 GPT-5.4's tick -> bootstrap_gate -> prepare_context -> acquisition_gate -> llm_plan -> decision_policy topology is a strong base.
A04 GPT-5.5's detailed state/test/non-goal hygiene should be valued or borrowed.
A05 Two- or three-node minimal splits are insufficient as the final architecture.
A06 Eight-plus-node or atomized graphs are over-decomposed and costly.
A07 A single post-LLM decision_policy/adjust node is preferred.
A08 Post-LLM correction should be split into several graph nodes.
A09 Edge functions must remain pure and must not mutate state or decisions.
A10 Action loopbacks should return to a canonical top-of-loop entry node.
A11 A dedicated iteration tick/enter_iteration phase is important.
A12 Dedicated region/currency bootstrap nodes are beneficial.
A13 Existing public contracts should stay stable, including single ask_user interrupt and outer graph boundaries.
A14 State/checkpoint discipline should avoid _route-style hints and rich cached objects unless proven safe.
A15 LLM-failure or decision-origin state is valuable for calculator/reflect safety.
A16 Helper-first staged migration with green tests is required.
A17 _build_plan_output or equivalent output-state consolidation is valuable.
A18 Opus-style state ownership, dispatch, or naming documentation is valuable as a reference.
A19 Finish validation such as route_finish_check/finish_check is valuable.

Claim Ledger

The taxonomy was induced from these recurring evaluator claims:

Aspect Recurring claim sources
A01 DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 converge on a moderate graph rather than status quo or graph explosion.
A02 DS, KI, MI, and Q37 rank DeepSeek very high; OP and GL treat it as a strong middle-ground source.
A03 GE, G54, G55, OP, Q36, and Q37 rank GPT-5.4 first or use it as the structural spine.
A04 GL ranks GPT-5.5 first; G54, G55, OP, and Q37 value its state/test/non-goal detail even when not adopting all nodes.
A05 Most reports mark GLM/Gemini/Mimo-style 2-3 node splits as useful first steps but incomplete final targets.
A06 Most reports reject Opus/Qwen/GPT-5.5/Kimi-level atomization when it creates too many checkpointed hops.
A07 DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 favor one policy node or equivalent merged correction stage.
A08 GL partly values splitting correction concerns; most other reports reject multi-node correction chains for first implementation.
A09 DS, G54, G55, KI, MI, OP, Q36, and Q37 explicitly reject mutating edge functions.
A10 DS, G54, G55, KI, MI, OP, Q36, and Q37 converge on tick, guard, pre_check, or equivalent loop entry.
A11 G54, G55, OP, Q36, and Q37 strongly emphasize a dedicated iteration phase; DS and MI borrow or rename toward it.
A12 KI supports explicit bootstrap nodes; many other reports prefer a merged bootstrap gate or existing ask_user flow.
A13 G54, G55, OP, Q36, and Q37 strongly protect public planner/outer graph/interrupt contracts.
A14 G54, G55, OP, Q36, and several others reject _route fields and unproven checkpoint caches; DS/KI/Q37 are more positive on recipes in state.
A15 GL, G54, G55, OP, Q36, and Q37 value llm_failed or decision_origin.
A16 All detailed reports reward helper extraction and staged migration.
A17 GL, G54, MI, Q36, and Q37 value GLM's _build_plan_output helper; several others do not mention it.
A18 DS, GL, OP, Q36, and Q37 value Opus's state ownership, dispatch, or naming reference material.
A19 Q36 and Q37 strongly value finish validation; OP and DS note the idea; many reports leave it neutral.

Full Scoring Matrix

Evaluator A01 A02 A03 A04 A05 A06 A07 A08 A09 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19
DS +1.0 +1.0 +0.8 +0.5 +0.8 +0.9 +0.9 -0.8 +1.0 +1.0 +0.8 -0.3 +0.8 +0.3 +0.6 +1.0 +0.2 +0.8 +0.4
GE +1.0 +0.8 +1.0 0.0 +0.8 +1.0 +0.8 -0.8 +0.7 +0.4 +0.4 0.0 0.0 +0.3 0.0 +0.4 0.0 0.0 0.0
GL +0.8 +0.8 +0.7 +1.0 +1.0 +0.8 +0.8 +0.3 +0.8 +1.0 +0.8 +0.2 +0.8 +0.6 +1.0 +0.9 +0.8 +0.9 0.0
G54 +1.0 +0.4 +1.0 +0.8 +0.8 +0.9 +1.0 -0.6 +1.0 +1.0 +1.0 -0.7 +1.0 +1.0 +0.8 +1.0 +0.8 +0.4 0.0
G55 +1.0 +0.4 +1.0 +0.8 +0.8 +0.9 +0.9 -0.5 +1.0 +1.0 +1.0 -0.5 +1.0 +0.9 +0.9 +1.0 0.0 +0.4 0.0
KI +1.0 +1.0 +0.8 +0.3 +0.7 +1.0 +1.0 -0.7 +1.0 +1.0 +0.5 +0.8 +0.2 +0.2 +0.4 +1.0 0.0 +0.6 0.0
MI +1.0 +1.0 +0.8 +0.3 +0.5 +1.0 +1.0 -0.7 +1.0 +1.0 +0.6 -0.5 +0.7 +0.4 +0.5 +1.0 +0.8 +0.5 0.0
OP +1.0 +0.8 +1.0 +1.0 +0.9 +0.9 +1.0 -0.6 +1.0 +1.0 +1.0 -0.5 +1.0 +1.0 +1.0 +1.0 0.0 +0.8 +0.7
Q36 +1.0 +0.5 +1.0 +0.4 +0.5 +0.8 +1.0 -0.7 +1.0 +1.0 +1.0 -0.6 +1.0 +0.8 +0.8 +1.0 +0.8 +0.8 +1.0
Q37 +1.0 +0.9 +1.0 +0.8 +0.8 +0.9 +1.0 -0.7 +1.0 +1.0 +1.0 -0.7 +0.9 +0.2 +0.8 +1.0 +1.0 +0.8 +1.0

Aggregate Vector And Agreement

Aspect Mean score Variance Agreement note
A01 0.9800 0.0036 Very strong consensus for moderate graph sizing.
A02 0.7600 0.0524 Strong support for DeepSeek as one central topology source.
A03 0.9100 0.0129 Very strong support for GPT-5.4 as topology source.
A04 0.5900 0.1029 Moderate support; disagreement comes from GPT-5.5 being valued but often seen as node-heavy.
A05 0.7600 0.0224 Strong consensus that minimal splits are not enough as final shape.
A06 0.9100 0.0049 Very strong consensus against atomized graph shapes.
A07 0.9400 0.0064 Very strong consensus for one policy/adjust node.
A08 -0.5800 0.0936 Moderate rejection of multi-node correction chains.
A09 0.9500 0.0105 Very strong consensus for pure routing edges.
A10 0.9400 0.0324 Very strong consensus for canonical loop entry.
A11 0.8100 0.0489 Strong support for dedicated iteration lifecycle.
A12 -0.2800 0.2076 Highest disagreement; Kimi supports dedicated bootstrap nodes, many others reject them.
A13 0.7400 0.1144 Strong support for preserving contracts, with some shorter reports neutral.
A14 0.5700 0.0981 Moderate support for checkpoint caution; recipe caching creates disagreement.
A15 0.6800 0.0876 Strong support for LLM failure/provenance tracking.
A16 0.9300 0.0321 Very strong consensus for staged migration.
A17 0.4400 0.1664 Moderate support but many reports do not mention the helper.
A18 0.6000 0.0700 Moderate support for Opus reference material.
A19 0.3100 0.1689 Weak-to-moderate support; only some reports emphasize finish validation.

Overall Opinion Generated From The Aggregate

The equal-weight overall evaluator opinion is strongly in favor of a moderate phased planner-graph refactor. The center of opinion most strongly supports a 4-6 phase topology, pure routing edges, a canonical loop-entry node, a small LLM-only planning node, a single post-LLM policy/adjustment node, and a staged helper-first migration. GPT-5.4 and DeepSeek are the two strongest topology anchors in the aggregate, with GPT-5.5 contributing state, test, non-goal, and LLM-failure hygiene. The clearest aggregate rejections are over-atomized graphs, mutation inside edge functions, and treating every post-LLM correction as a separate checkpointed node. The most disputed areas are dedicated region/currency bootstrap nodes, recipe caching in checkpoint state, _build_plan_output emphasis, and finish-validation placement.

Distance To Overall Opinion

Formula:

d_i = sqrt(sum_j alpha_j * (s_ij - mean_j)^2)
alpha_j = 1 / 19

Clean forced ordering:

Rank Evaluator Distance to center Gap from previous
1 DS 0.131789 0.000000
2 MI 0.184961 0.053172
3 G55 0.210513 0.025552
4 G54 0.222900 0.012387
5 OP 0.233734 0.010834
6 Q36 0.244949 0.011215
7 Q37 0.263778 0.018829
8 GL 0.306937 0.043159
9 KI 0.344429 0.037492
10 GE 0.400657 0.056228

Closest opinion to the overall opinion: deepseek-4-pro-range.md.

Pairwise Distance Matrix

Formula:

d(i, k) = sqrt(sum_j alpha_j * (s_ij - s_kj)^2)
alpha_j = 1 / 19
From \ To DS GE GL G54 G55 KI MI OP Q36 Q37
DS 0.0000 0.4136 0.3763 0.3236 0.2819 0.3269 0.2152 0.2606 0.2902 0.2734
GE 0.4136 0.0000 0.5813 0.5370 0.4952 0.3449 0.4007 0.5685 0.5680 0.5835
GL 0.3763 0.5813 0.0000 0.3713 0.3728 0.4507 0.3940 0.3920 0.4431 0.4249
G54 0.3236 0.5370 0.3713 0.0000 0.1947 0.5282 0.2819 0.2884 0.2800 0.3332
G55 0.2819 0.4952 0.3728 0.1947 0.0000 0.4634 0.3332 0.2176 0.3364 0.3980
KI 0.3269 0.3449 0.4507 0.5282 0.4634 0.0000 0.3763 0.4995 0.5346 0.5385
MI 0.2152 0.4007 0.3940 0.2819 0.3332 0.3763 0.0000 0.3859 0.3195 0.3162
OP 0.2606 0.5685 0.3920 0.2884 0.2176 0.4995 0.3859 0.0000 0.2763 0.3154
Q36 0.2902 0.5680 0.4431 0.2800 0.3364 0.5346 0.3195 0.2763 0.0000 0.2103
Q37 0.2734 0.5835 0.4249 0.3332 0.3980 0.5385 0.3162 0.3154 0.2103 0.0000

Medoid Ranking

Formula:

medoid_score_i = sum_k 0.1 * d(i, k)

Clean forced ordering:

Rank Evaluator Equal-weight pairwise total Gap from previous
1 DS 0.276161 0.000000
2 MI 0.302303 0.026142
3 G55 0.309327 0.007024
4 G54 0.313841 0.004514
5 OP 0.320413 0.006572
6 Q36 0.325835 0.005422
7 Q37 0.339348 0.013513
8 GL 0.380641 0.041293
9 KI 0.406289 0.025648
10 GE 0.449273 0.042984

Medoid: deepseek-4-pro-range.md.

Full Scoring Evidence Ledger

Each item follows Aspect score status: evidence note. Unmentioned means the evaluator report did not make a clear equivalent judgment for that aspect; the score is therefore neutral 0 by the user-selected override.

DS: deepseek-4-pro-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Calls the 4-node pipeline the "Goldilocks" balance and later recommends a 5-node synthesis.
A02 +1.0 explicit Ranks DeepSeek #1 and adopts its guard/prepare/plan/adjust topology.
A03 +0.8 explicit Ranks GPT-5.4 #2 and adopts its design principles and tick discipline.
A04 +0.5 explicit Values GPT-5.5 organization, consequences, and decision_origin, but ranks it below top three.
A05 +0.8 explicit Treats GLM, Mimo, and Gemini minimal splits as useful but incomplete.
A06 +0.9 explicit Criticizes Opus/Qwen-style graph expansion and _route patterns as over-complex.
A07 +0.9 explicit Says one decision_policy or adjust node is enough and rejects corrector explosion.
A08 -0.8 explicit Says splitting every corrector adds graph noise without enough benefit.
A09 +1.0 explicit Explicitly rejects redirects in edge functions as mutation outside nodes.
A10 +1.0 explicit Identifies canonical loop-entry as an adopted cross-cutting pattern.
A11 +0.8 explicit Adopts GPT-5.4's separate tick as an improvement over a combined guard.
A12 -0.3 implied Notes separate bootstrap nodes can improve clarity, but final synthesis keeps them inside guard/bootstrap.
A13 +0.8 explicit Adopts anti-pattern constraints such as keeping ask_user as the sole interrupt.
A14 +0.3 explicit Rejects _route; mixed because it also recommends recipes in state with serialization verification.
A15 +0.6 explicit Recommends borrowing GPT-5.5's decision_origin.
A16 +1.0 explicit Treats helper extraction and staged verification as central to the best proposal.
A17 +0.2 explicit Notes GLM's _build_plan_output helper as useful but not central.
A18 +0.8 explicit Adopts Opus's state ownership table, naming convention, and dispatch idea.
A19 +0.4 explicit Notes Qwen-3.6's route_finish_check as a good safety idea but does not center it.

GE: gemini-3.1-pro-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends a 5-6 node phased pipeline as the best combination.
A02 +0.8 explicit Names DeepSeek as runner-up and uses it as an architectural nod.
A03 +1.0 explicit Names GPT-5.4 the winner and primary recommendation.
A04 0.0 unmentioned Does not make a clear specific judgment about GPT-5.5's hygiene contributions.
A05 +0.8 explicit Says conservative 2-3 node splits leave too much bundled.
A06 +1.0 explicit Rejects Qwen-3.7-style 10-node decomposition as graph noise and chatty state.
A07 +0.8 explicit Says policy should be consolidated in one decision_policy node.
A08 -0.8 explicit Rejects turning every policy rule into a node.
A09 +0.7 explicit Says mutations belong in designated nodes and edges should read state and route.
A10 +0.4 explicit Notes loop-back edge handling as a key concern.
A11 +0.4 implied Recommends tick_and_guard, implying some iteration-entry support without separating it strongly.
A12 0.0 unmentioned No clear support or rejection of dedicated bootstrap nodes.
A13 0.0 unmentioned No clear public-contract or single-interrupt claim.
A14 +0.3 implied Supports idempotent state mutations but does not discuss route fields or rich caches.
A15 0.0 unmentioned No clear LLM-failure or decision-origin claim.
A16 +0.4 explicit Mentions testability and isolation, but not a detailed staged migration.
A17 0.0 unmentioned No _build_plan_output or equivalent helper claim.
A18 0.0 unmentioned No clear Opus state-ownership or dispatch-reference claim.
A19 0.0 unmentioned No finish-validation claim.

GL: glm-5.1-pro-range.md

Aspect Score Status Evidence note
A01 +0.8 explicit Final recommendation uses a moderate graph shape after selective borrowing.
A02 +0.8 explicit Ranks DeepSeek runner-up and borrows the adjust starting point.
A03 +0.7 explicit Values GPT-5.4 anti-patterns and tick, though ranks it fourth.
A04 +1.0 explicit Ranks GPT-5.5 best overall and uses it as base.
A05 +1.0 explicit Says Gemini/Mimo/GLM-level minimal shapes are incomplete final targets.
A06 +0.8 explicit Critiques Qwen-3.7 as over-engineered and Opus as slightly over-decomposed.
A07 +0.8 explicit Recommends starting with a single correct_policy node.
A08 +0.3 explicit Also praises GPT-5.5's separation of normalize/retry/maybe-calculate concerns.
A09 +0.8 explicit Adopts anti-pattern rules forbidding mutation in edges.
A10 +1.0 explicit Final graph routes all loopbacks through enter_iteration.
A11 +0.8 explicit Values tick/enter_iteration as a distinct lifecycle node.
A12 +0.2 implied Sees dedicated region/currency nodes as visible, but final recommendation folds them into a gate.
A13 +0.8 explicit Preserves non-goals and public planner boundaries from top proposals.
A14 +0.6 explicit Warns about _route and checkpoint concerns, but not as strongly as G54/G55/OP.
A15 +1.0 explicit Strongly values decision_origin or llm_failed.
A16 +0.9 explicit Emphasizes staged migration and helper extraction.
A17 +0.8 explicit Treats _build_plan_output as a useful standalone helper.
A18 +0.9 explicit Borrows Opus naming and values state ownership documentation.
A19 0.0 unmentioned No clear finish-validation adoption.

G54: gpt-5.4-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends GPT-5.4's moderate six-phase target graph.
A02 +0.4 explicit Treats DeepSeek as a useful mental model but ranks it fourth.
A03 +1.0 explicit Ranks GPT-5.4 first and recommends it as end-state architecture.
A04 +0.8 explicit Ranks GPT-5.5 second and uses its migration/state/test discipline.
A05 +0.8 explicit Says GLM/Gemini/Mimo are safer first slices but weak final targets.
A06 +0.9 explicit Rejects atomized Opus/Qwen shapes and checkpoint-boundary excess.
A07 +1.0 explicit Recommends a single decision_policy node.
A08 -0.6 explicit Says GPT-5.5's normalize/retry/maybe-calculate split is one split too far.
A09 +1.0 explicit Explicitly rejects policy in edge functions.
A10 +1.0 explicit Requires all action loopbacks to return to tick.
A11 +1.0 explicit Makes tick the first core target node.
A12 -0.7 explicit Rejects separate bootstrap interrupt nodes unless existing flow becomes insufficient.
A13 +1.0 explicit Strongly preserves ask_user, outer graph, thread naming, and public surfaces.
A14 +1.0 explicit Rejects _route patterns and unproven rich state caches.
A15 +0.8 explicit Borrows llm_failed/origin handling from GPT-5.5.
A16 +1.0 explicit Gives staged migration sequence and helper-first guidance.
A17 +0.8 explicit Borrows GLM's _build_plan_output helper.
A18 +0.4 explicit Keeps Opus as a later checklist but not as graph shape.
A19 0.0 unmentioned No clear finish-validation claim.

G55: gpt-5.5-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends a moderate phased graph rather than detailed atomization.
A02 +0.4 explicit Treats DeepSeek as a close moderate split but below GPT-5.4/GPT-5.5.
A03 +1.0 explicit Ranks GPT-5.4 as best implementation spine.
A04 +0.8 explicit Ranks GPT-5.5 second and mines it for state/test details.
A05 +0.8 explicit Marks Gemini/Mimo as too coarse and under-specified.
A06 +0.9 explicit Rejects fully atomized Opus/Qwen shapes for first implementation.
A07 +0.9 explicit Uses one decision_policy in the suggested target graph.
A08 -0.5 explicit Says GPT-5.5's three post-LLM nodes are good long-term seams but too much first-step risk.
A09 +1.0 explicit States edge functions should inspect only and mutations belong in nodes.
A10 +1.0 explicit All loopbacks return to tick.
A11 +1.0 explicit tick is the start node in the recommended graph.
A12 -0.5 explicit Says require-region/currency may be conceptual but should still use existing ask_user.
A13 +1.0 explicit Strongly preserves single interrupt and outer graph contracts.
A14 +0.9 explicit Rejects rich cached recipes and informal transient state unless serializer-safe.
A15 +0.9 explicit Values llm_failed or decision_origin for calculator-loop safety.
A16 +1.0 explicit Provides helper-first implementation guidance.
A17 0.0 unmentioned No clear _build_plan_output claim.
A18 +0.4 explicit Treats Opus as checklist material, not target graph.
A19 0.0 unmentioned No finish-validation claim.

KI: kimi-2.6-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends a 5-node hybrid as the VentureScope pipeline.
A02 +1.0 explicit Ranks DeepSeek #1 and adopts its primary architecture.
A03 +0.8 explicit Ranks GPT-5.4 #2 and uses its staged migration.
A04 +0.3 explicit Notes useful GPT-5.5 ideas but ranks it as over-engineered.
A05 +0.7 explicit Says GLM/Mimo/Gemini are safe or decent but not sufficient.
A06 +1.0 explicit Strongly rejects Opus/GPT-5.5/Qwen-3.7 over-decomposition.
A07 +1.0 explicit Uses a single adjust node for policy corrections.
A08 -0.7 explicit Says the separate correction chains should be collapsed into helpers.
A09 +1.0 explicit Rejects Qwen-3.6 redirect-in-edge approach.
A10 +1.0 explicit Final topology loops action nodes back to guard.
A11 +0.5 explicit Values tick but accepts a combined guard in the final shape.
A12 +0.8 explicit Strongly praises dedicated ask_region/ask_currency visibility.
A13 +0.2 implied Does not strongly preserve single ask_user; bootstrap explicitness pulls against this aspect.
A14 +0.2 explicit Rejects _route, but supports recipes in state.
A15 +0.4 explicit Includes llm_failed in the hybrid sources.
A16 +1.0 explicit Provides seven-stage staged migration and praises zero-risk helper extraction.
A17 0.0 unmentioned No clear _build_plan_output claim.
A18 +0.6 explicit Values Opus dispatch/naming ideas as partial borrowings.
A19 0.0 unmentioned No clear finish-validation claim.

MI: mimo-2.5-pro-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Identifies DeepSeek/GPT-5.4-sized topology as the best combination.
A02 +1.0 explicit Ranks DeepSeek #1 and recommends its graph structure.
A03 +0.8 explicit Ranks GPT-5.4 #2 and borrows migration/anti-pattern guidance.
A04 +0.3 explicit Values GPT-5.5 consequences and test guidance but sees it as over-decomposed.
A05 +0.5 explicit Minimal splits are useful but criticized for incomplete post-LLM treatment.
A06 +1.0 explicit Strongly rejects Opus/Qwen over-decomposition.
A07 +1.0 explicit Calls DeepSeek's adjust the strongest differentiator.
A08 -0.7 explicit Rejects splitting corrections into many nodes.
A09 +1.0 explicit Rejects route-function mutation as an anti-pattern.
A10 +1.0 explicit Says loopback to guard is correct.
A11 +0.6 explicit Recommends renaming DeepSeek's guard to tick.
A12 -0.5 explicit Says separate region/currency nodes add unnecessary complexity.
A13 +0.7 explicit Adds GPT-5.4 anti-pattern warnings and preserves single bootstrap concern.
A14 +0.4 explicit Criticizes _route, but accepts DeepSeek's recipe state as sensible.
A15 +0.5 explicit Notes llm_failed or decision_origin as sensible.
A16 +1.0 explicit Recommends helper extraction and staged migration.
A17 +0.8 explicit Explicitly includes GLM's _build_plan_output helper.
A18 +0.5 explicit Values Opus naming and ownership ideas but rejects the topology.
A19 0.0 unmentioned No clear finish-validation claim.

OP: opus-4.7-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends GPT-5.4-level middle topology as the baseline.
A02 +0.8 explicit Ranks DeepSeek #3 and harvests its self-loop idea.
A03 +1.0 explicit Ranks GPT-5.4 #1 and uses its six-phase pipeline.
A04 +1.0 explicit Ranks GPT-5.5 #2 and strongly borrows its llm_failed and state-cache caution.
A05 +0.9 explicit Says Mimo/Gemini are under-decomposed and do not solve the problem.
A06 +0.9 explicit Calls Opus itself operationally heavy and rejects 13-node chains.
A07 +1.0 explicit Recommends one decision_policy node, not a corrector chain.
A08 -0.6 explicit Rejects splitting the corrector chain despite recognizing its theoretical clarity.
A09 +1.0 explicit Calls redirect-in-edge the biggest anti-pattern.
A10 +1.0 explicit Lists top-node loopback as universal agreement.
A11 +1.0 explicit Uses tick as first recommended phase.
A12 -0.5 explicit Rejects split bootstrap nodes as graph-decorative.
A13 +1.0 explicit Strongly preserves single ask_user interrupt and outer graph scope.
A14 +1.0 explicit Rejects recipes in state unless serializer-safe and rejects state hint pollution.
A15 +1.0 explicit Requires llm_failed or decision_origin.
A16 +1.0 explicit Provides a six-commit green migration.
A17 0.0 unmentioned No clear _build_plan_output claim.
A18 +0.8 explicit Values state ownership and write-surface discipline as reference material.
A19 +0.7 explicit Calls route_finish_check a sharp idea, though not the central recommendation.

Q36: qwen-3.6-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends a 5-6 node target count.
A02 +0.5 explicit Treats DeepSeek as solid middle-ground but ranks it fifth.
A03 +1.0 explicit Ranks GPT-5.4 best overall and uses its target nodes.
A04 +0.4 explicit Praises GPT-5.5 hygiene but ranks it low for complexity.
A05 +0.5 explicit GLM is ranked highly as first PR, but long-term still needs more decomposition.
A06 +0.8 explicit Rejects 10+ and 14-node proposals for checkpoint overhead and graph noise.
A07 +1.0 explicit Recommends one decision_policy node.
A08 -0.7 explicit Explicitly says one policy node, not five corrector nodes.
A09 +1.0 explicit States no state mutation in edge functions as a design rule.
A10 +1.0 explicit Requires loopbacks to tick.
A11 +1.0 explicit Makes tick the first target node.
A12 -0.6 explicit Merges separate region/currency nodes into one bootstrap gate.
A13 +1.0 explicit Keeps ask_user single interrupt and avoids outer graph changes.
A14 +0.8 explicit Rejects _route and checkpoint pollution, though does not dwell on recipes as much as OP/G54.
A15 +0.8 explicit Includes llm_failed from GPT-5.5.
A16 +1.0 explicit Gives combined staged migration.
A17 +0.8 explicit Includes _build_plan_output helper.
A18 +0.8 explicit Ranks Opus high for mapping/state ownership reference.
A19 +1.0 explicit Includes finish_check in the recommended graph.

Q37: qwen-3.7-max-range.md

Aspect Score Status Evidence note
A01 +1.0 explicit Recommends a 5-node pipeline plus optional finish check.
A02 +0.9 explicit Ranks DeepSeek #2 and adopts its pragmatism.
A03 +1.0 explicit Ranks GPT-5.4 #1 and uses it as target topology.
A04 +0.8 explicit Ranks GPT-5.5 #3 and borrows llm_failed/non-goal detail.
A05 +0.8 explicit Treats Gemini/Mimo minimal proposals as insufficient.
A06 +0.9 explicit Rejects Opus/Qwen-3.7-style full decomposition as too many nodes.
A07 +1.0 explicit Uses one decision_policy node.
A08 -0.7 explicit Says multi-node post-LLM correction adds unnecessary checkpoint writes.
A09 +1.0 explicit Explicitly rejects redirect-in-edge mutation.
A10 +1.0 explicit Target graph loops all action paths back to tick.
A11 +1.0 explicit tick is the first target node.
A12 -0.7 explicit Rejects separate ask_region/ask_currency nodes as overkill.
A13 +0.9 explicit Keeps ask_user and public graph contracts stable.
A14 +0.2 explicit Rejects _route but supports recipes in state, creating a mixed checkpoint-discipline position.
A15 +0.8 explicit Includes GPT-5.5's llm_failed state field.
A16 +1.0 explicit Gives eight-step green migration.
A17 +1.0 explicit Includes _build_plan_output as first migration step.
A18 +0.8 explicit Values Opus mapping and state ownership table as reference material.
A19 +1.0 explicit Includes finish_check in the resulting target graph.

Normalization And Ambiguity Audit

Check Result
Evaluator count 10
Aspect count 19
Total evaluator-aspect cells 190
Unresolved ambiguous cells 0
Ambiguity rate 0 / 190 = 0.0%
Clear-equivalent normalization Applied only to equivalent statements such as decision_policy, adjust, enforce_policy, and correct_policy when they meant a single post-LLM policy stage.
Kept separate because equivalence was not clear Dedicated bootstrap nodes vs bootstrap gate; recipe cache vs checkpoint-state caution; finish validation vs normal finish node; multi-node correction split vs single policy node.
Neutral cells All 0.0 cells are either unmentioned or balanced; unmentioned cells use the user-selected neutral override.

Reproducibility Snippet

The numeric tables above were computed from the score matrix using this exact data shape:

M[evaluator][aspect] = score in [-1, 1]
w_i = 0.1
alpha_j = 1 / 19
c_ij = 1 after neutral fill
mean_j = sum_i M[i][j] / 10
variance_j = sum_i (M[i][j] - mean_j)^2 / 10
center_distance_i = sqrt(sum_j alpha_j * (M[i][j] - mean_j)^2)
pairwise_distance_i_k = sqrt(sum_j alpha_j * (M[i][j] - M[k][j])^2)
medoid_score_i = sum_k 0.1 * pairwise_distance_i_k

Final Answer

By closest opinion to the equal-weight overall vector, the ranking is:

  1. DS
  2. MI
  3. G55
  4. G54
  5. OP
  6. Q36
  7. Q37
  8. GL
  9. KI
  10. GE

By medoid centrality, the ranking is:

  1. DS
  2. MI
  3. G55
  4. G54
  5. OP
  6. Q36
  7. Q37
  8. GL
  9. KI
  10. GE

Both calculations select deepseek-4-pro-range.md as the most representative evaluator opinion in this set.