Evaluator Consensus Ranking Report¶

Scope¶

This report applies analyse-evaluation-method.md to the 10 evaluator reports in this directory. It ranks evaluator reports by:

distance to the equal-weight overall opinion;
medoid centrality under pairwise evaluator distances.

This is an evaluator-opinion synthesis only. It does not add an independent architecture judgment about the planner refactor.

Source Set¶

Code	Evaluator report
DS	`deepseek-4-pro-range.md`
GE	`gemini-3.1-pro-range.md`
GL	`glm-5.1-pro-range.md`
G54	`gpt-5.4-range.md`
G55	`gpt-5.5-range.md`
KI	`kimi-2.6-range.md`
MI	`mimo-2.5-pro-range.md`
OP	`opus-4.7-range.md`
Q36	`qwen-3.6-range.md`
Q37	`qwen-3.7-max-range.md`

analyse-evaluation-method.md is the method source, not an evaluator.

Result Summary¶

Ranking type	Winner	Score	Interpretation
Closest to equal-weight overall opinion	DS	0.131789	Lowest weighted Euclidean distance from the aggregate vector.
Medoid	DS	0.276161	Lowest equal-weight total pairwise distance to all evaluator reports.

Both methods select deepseek-4-pro-range.md as the most central evaluator opinion.

Scoring Method¶

The method file requires a formal feature-space representation before aggregation. This report uses 19 fine-grained semantic aspects induced from recurring evaluator claims.

Rules:

Evaluator weights are equal: w_i = 0.1.
Aspect-distance weights are equal: alpha_j = 1 / 19.
Scores use [-1, 1].
+1 means strong support for the aspect statement.
0 means neutral, balanced, or not mentioned.
-1 means strong rejection of the aspect statement.
Only clearly equivalent judgments are normalized into the same aspect.
If equivalence is unclear, it remains a separate aspect.
Unmentioned aspects are scored 0 by the user-selected override.
After neutral filling, every matrix cell is present and c_ij = 1.
Mention status is tracked separately in the evidence ledger.
Unresolved ambiguous cells: 0 / 190 = 0.0%, below the <= 3% requirement.

Rounding policy:

Source scores are shown to one decimal because scoring was assigned on a tenth-point rubric.
Aggregate and variance values are shown to four decimals.
Ranking metrics are shown to six decimals.
Pairwise matrix entries are shown to four decimals.
Clean ordering uses full-precision values before rounding; if a full-precision tie occurred, the tie-break would be source filename lexicographic order. No full-precision ties occurred.

Aspect Taxonomy¶

Aspect	Directional statement scored positively
A01	A moderate 4-6 phase graph is the target sweet spot.
A02	DeepSeek's `guard -> prepare -> plan -> adjust` topology is a strong base.
A03	GPT-5.4's `tick -> bootstrap_gate -> prepare_context -> acquisition_gate -> llm_plan -> decision_policy` topology is a strong base.
A04	GPT-5.5's detailed state/test/non-goal hygiene should be valued or borrowed.
A05	Two- or three-node minimal splits are insufficient as the final architecture.
A06	Eight-plus-node or atomized graphs are over-decomposed and costly.
A07	A single post-LLM `decision_policy`/`adjust` node is preferred.
A08	Post-LLM correction should be split into several graph nodes.
A09	Edge functions must remain pure and must not mutate state or decisions.
A10	Action loopbacks should return to a canonical top-of-loop entry node.
A11	A dedicated iteration `tick`/`enter_iteration` phase is important.
A12	Dedicated region/currency bootstrap nodes are beneficial.
A13	Existing public contracts should stay stable, including single `ask_user` interrupt and outer graph boundaries.
A14	State/checkpoint discipline should avoid `_route`-style hints and rich cached objects unless proven safe.
A15	LLM-failure or decision-origin state is valuable for calculator/reflect safety.
A16	Helper-first staged migration with green tests is required.
A17	`_build_plan_output` or equivalent output-state consolidation is valuable.
A18	Opus-style state ownership, dispatch, or naming documentation is valuable as a reference.
A19	Finish validation such as `route_finish_check`/`finish_check` is valuable.

Claim Ledger¶

The taxonomy was induced from these recurring evaluator claims:

Aspect	Recurring claim sources
A01	DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 converge on a moderate graph rather than status quo or graph explosion.
A02	DS, KI, MI, and Q37 rank DeepSeek very high; OP and GL treat it as a strong middle-ground source.
A03	GE, G54, G55, OP, Q36, and Q37 rank GPT-5.4 first or use it as the structural spine.
A04	GL ranks GPT-5.5 first; G54, G55, OP, and Q37 value its state/test/non-goal detail even when not adopting all nodes.
A05	Most reports mark GLM/Gemini/Mimo-style 2-3 node splits as useful first steps but incomplete final targets.
A06	Most reports reject Opus/Qwen/GPT-5.5/Kimi-level atomization when it creates too many checkpointed hops.
A07	DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 favor one policy node or equivalent merged correction stage.
A08	GL partly values splitting correction concerns; most other reports reject multi-node correction chains for first implementation.
A09	DS, G54, G55, KI, MI, OP, Q36, and Q37 explicitly reject mutating edge functions.
A10	DS, G54, G55, KI, MI, OP, Q36, and Q37 converge on `tick`, `guard`, `pre_check`, or equivalent loop entry.
A11	G54, G55, OP, Q36, and Q37 strongly emphasize a dedicated iteration phase; DS and MI borrow or rename toward it.
A12	KI supports explicit bootstrap nodes; many other reports prefer a merged bootstrap gate or existing `ask_user` flow.
A13	G54, G55, OP, Q36, and Q37 strongly protect public planner/outer graph/interrupt contracts.
A14	G54, G55, OP, Q36, and several others reject `_route` fields and unproven checkpoint caches; DS/KI/Q37 are more positive on recipes in state.
A15	GL, G54, G55, OP, Q36, and Q37 value `llm_failed` or `decision_origin`.
A16	All detailed reports reward helper extraction and staged migration.
A17	GL, G54, MI, Q36, and Q37 value GLM's `_build_plan_output` helper; several others do not mention it.
A18	DS, GL, OP, Q36, and Q37 value Opus's state ownership, dispatch, or naming reference material.
A19	Q36 and Q37 strongly value finish validation; OP and DS note the idea; many reports leave it neutral.

Full Scoring Matrix¶

Evaluator	A01	A02	A03	A04	A05	A06	A07	A08	A09	A10	A11	A12	A13	A14	A15	A16	A17	A18	A19
DS	+1.0	+1.0	+0.8	+0.5	+0.8	+0.9	+0.9	-0.8	+1.0	+1.0	+0.8	-0.3	+0.8	+0.3	+0.6	+1.0	+0.2	+0.8	+0.4
GE	+1.0	+0.8	+1.0	0.0	+0.8	+1.0	+0.8	-0.8	+0.7	+0.4	+0.4	0.0	0.0	+0.3	0.0	+0.4	0.0	0.0	0.0
GL	+0.8	+0.8	+0.7	+1.0	+1.0	+0.8	+0.8	+0.3	+0.8	+1.0	+0.8	+0.2	+0.8	+0.6	+1.0	+0.9	+0.8	+0.9	0.0
G54	+1.0	+0.4	+1.0	+0.8	+0.8	+0.9	+1.0	-0.6	+1.0	+1.0	+1.0	-0.7	+1.0	+1.0	+0.8	+1.0	+0.8	+0.4	0.0
G55	+1.0	+0.4	+1.0	+0.8	+0.8	+0.9	+0.9	-0.5	+1.0	+1.0	+1.0	-0.5	+1.0	+0.9	+0.9	+1.0	0.0	+0.4	0.0
KI	+1.0	+1.0	+0.8	+0.3	+0.7	+1.0	+1.0	-0.7	+1.0	+1.0	+0.5	+0.8	+0.2	+0.2	+0.4	+1.0	0.0	+0.6	0.0
MI	+1.0	+1.0	+0.8	+0.3	+0.5	+1.0	+1.0	-0.7	+1.0	+1.0	+0.6	-0.5	+0.7	+0.4	+0.5	+1.0	+0.8	+0.5	0.0
OP	+1.0	+0.8	+1.0	+1.0	+0.9	+0.9	+1.0	-0.6	+1.0	+1.0	+1.0	-0.5	+1.0	+1.0	+1.0	+1.0	0.0	+0.8	+0.7
Q36	+1.0	+0.5	+1.0	+0.4	+0.5	+0.8	+1.0	-0.7	+1.0	+1.0	+1.0	-0.6	+1.0	+0.8	+0.8	+1.0	+0.8	+0.8	+1.0
Q37	+1.0	+0.9	+1.0	+0.8	+0.8	+0.9	+1.0	-0.7	+1.0	+1.0	+1.0	-0.7	+0.9	+0.2	+0.8	+1.0	+1.0	+0.8	+1.0

Aggregate Vector And Agreement¶

Aspect	Mean score	Variance	Agreement note
A01	0.9800	0.0036	Very strong consensus for moderate graph sizing.
A02	0.7600	0.0524	Strong support for DeepSeek as one central topology source.
A03	0.9100	0.0129	Very strong support for GPT-5.4 as topology source.
A04	0.5900	0.1029	Moderate support; disagreement comes from GPT-5.5 being valued but often seen as node-heavy.
A05	0.7600	0.0224	Strong consensus that minimal splits are not enough as final shape.
A06	0.9100	0.0049	Very strong consensus against atomized graph shapes.
A07	0.9400	0.0064	Very strong consensus for one policy/adjust node.
A08	-0.5800	0.0936	Moderate rejection of multi-node correction chains.
A09	0.9500	0.0105	Very strong consensus for pure routing edges.
A10	0.9400	0.0324	Very strong consensus for canonical loop entry.
A11	0.8100	0.0489	Strong support for dedicated iteration lifecycle.
A12	-0.2800	0.2076	Highest disagreement; Kimi supports dedicated bootstrap nodes, many others reject them.
A13	0.7400	0.1144	Strong support for preserving contracts, with some shorter reports neutral.
A14	0.5700	0.0981	Moderate support for checkpoint caution; recipe caching creates disagreement.
A15	0.6800	0.0876	Strong support for LLM failure/provenance tracking.
A16	0.9300	0.0321	Very strong consensus for staged migration.
A17	0.4400	0.1664	Moderate support but many reports do not mention the helper.
A18	0.6000	0.0700	Moderate support for Opus reference material.
A19	0.3100	0.1689	Weak-to-moderate support; only some reports emphasize finish validation.

Overall Opinion Generated From The Aggregate¶

The equal-weight overall evaluator opinion is strongly in favor of a moderate phased planner-graph refactor. The center of opinion most strongly supports a 4-6 phase topology, pure routing edges, a canonical loop-entry node, a small LLM-only planning node, a single post-LLM policy/adjustment node, and a staged helper-first migration. GPT-5.4 and DeepSeek are the two strongest topology anchors in the aggregate, with GPT-5.5 contributing state, test, non-goal, and LLM-failure hygiene. The clearest aggregate rejections are over-atomized graphs, mutation inside edge functions, and treating every post-LLM correction as a separate checkpointed node. The most disputed areas are dedicated region/currency bootstrap nodes, recipe caching in checkpoint state, _build_plan_output emphasis, and finish-validation placement.

Distance To Overall Opinion¶

Formula:

d_i = sqrt(sum_j alpha_j * (s_ij - mean_j)^2)
alpha_j = 1 / 19

Clean forced ordering:

Rank	Evaluator	Distance to center	Gap from previous
1	DS	0.131789	0.000000
2	MI	0.184961	0.053172
3	G55	0.210513	0.025552
4	G54	0.222900	0.012387
5	OP	0.233734	0.010834
6	Q36	0.244949	0.011215
7	Q37	0.263778	0.018829
8	GL	0.306937	0.043159
9	KI	0.344429	0.037492
10	GE	0.400657	0.056228

Closest opinion to the overall opinion: deepseek-4-pro-range.md.

Pairwise Distance Matrix¶

Formula:

d(i, k) = sqrt(sum_j alpha_j * (s_ij - s_kj)^2)
alpha_j = 1 / 19

From \ To	DS	GE	GL	G54	G55	KI	MI	OP	Q36	Q37
DS	0.0000	0.4136	0.3763	0.3236	0.2819	0.3269	0.2152	0.2606	0.2902	0.2734
GE	0.4136	0.0000	0.5813	0.5370	0.4952	0.3449	0.4007	0.5685	0.5680	0.5835
GL	0.3763	0.5813	0.0000	0.3713	0.3728	0.4507	0.3940	0.3920	0.4431	0.4249
G54	0.3236	0.5370	0.3713	0.0000	0.1947	0.5282	0.2819	0.2884	0.2800	0.3332
G55	0.2819	0.4952	0.3728	0.1947	0.0000	0.4634	0.3332	0.2176	0.3364	0.3980
KI	0.3269	0.3449	0.4507	0.5282	0.4634	0.0000	0.3763	0.4995	0.5346	0.5385
MI	0.2152	0.4007	0.3940	0.2819	0.3332	0.3763	0.0000	0.3859	0.3195	0.3162
OP	0.2606	0.5685	0.3920	0.2884	0.2176	0.4995	0.3859	0.0000	0.2763	0.3154
Q36	0.2902	0.5680	0.4431	0.2800	0.3364	0.5346	0.3195	0.2763	0.0000	0.2103
Q37	0.2734	0.5835	0.4249	0.3332	0.3980	0.5385	0.3162	0.3154	0.2103	0.0000

Medoid Ranking¶

Formula:

medoid_score_i = sum_k 0.1 * d(i, k)

Clean forced ordering:

Rank	Evaluator	Equal-weight pairwise total	Gap from previous
1	DS	0.276161	0.000000
2	MI	0.302303	0.026142
3	G55	0.309327	0.007024
4	G54	0.313841	0.004514
5	OP	0.320413	0.006572
6	Q36	0.325835	0.005422
7	Q37	0.339348	0.013513
8	GL	0.380641	0.041293
9	KI	0.406289	0.025648
10	GE	0.449273	0.042984

Medoid: deepseek-4-pro-range.md.

Full Scoring Evidence Ledger¶

Each item follows Aspect score status: evidence note. Unmentioned means the evaluator report did not make a clear equivalent judgment for that aspect; the score is therefore neutral 0 by the user-selected override.

DS: `deepseek-4-pro-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Calls the 4-node pipeline the "Goldilocks" balance and later recommends a 5-node synthesis.
A02	+1.0	explicit	Ranks DeepSeek #1 and adopts its guard/prepare/plan/adjust topology.
A03	+0.8	explicit	Ranks GPT-5.4 #2 and adopts its design principles and tick discipline.
A04	+0.5	explicit	Values GPT-5.5 organization, consequences, and `decision_origin`, but ranks it below top three.
A05	+0.8	explicit	Treats GLM, Mimo, and Gemini minimal splits as useful but incomplete.
A06	+0.9	explicit	Criticizes Opus/Qwen-style graph expansion and `_route` patterns as over-complex.
A07	+0.9	explicit	Says one `decision_policy` or `adjust` node is enough and rejects corrector explosion.
A08	-0.8	explicit	Says splitting every corrector adds graph noise without enough benefit.
A09	+1.0	explicit	Explicitly rejects redirects in edge functions as mutation outside nodes.
A10	+1.0	explicit	Identifies canonical loop-entry as an adopted cross-cutting pattern.
A11	+0.8	explicit	Adopts GPT-5.4's separate `tick` as an improvement over a combined guard.
A12	-0.3	implied	Notes separate bootstrap nodes can improve clarity, but final synthesis keeps them inside guard/bootstrap.
A13	+0.8	explicit	Adopts anti-pattern constraints such as keeping `ask_user` as the sole interrupt.
A14	+0.3	explicit	Rejects `_route`; mixed because it also recommends `recipes` in state with serialization verification.
A15	+0.6	explicit	Recommends borrowing GPT-5.5's `decision_origin`.
A16	+1.0	explicit	Treats helper extraction and staged verification as central to the best proposal.
A17	+0.2	explicit	Notes GLM's `_build_plan_output` helper as useful but not central.
A18	+0.8	explicit	Adopts Opus's state ownership table, naming convention, and dispatch idea.
A19	+0.4	explicit	Notes Qwen-3.6's `route_finish_check` as a good safety idea but does not center it.

GE: `gemini-3.1-pro-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends a 5-6 node phased pipeline as the best combination.
A02	+0.8	explicit	Names DeepSeek as runner-up and uses it as an architectural nod.
A03	+1.0	explicit	Names GPT-5.4 the winner and primary recommendation.
A04	0.0	unmentioned	Does not make a clear specific judgment about GPT-5.5's hygiene contributions.
A05	+0.8	explicit	Says conservative 2-3 node splits leave too much bundled.
A06	+1.0	explicit	Rejects Qwen-3.7-style 10-node decomposition as graph noise and chatty state.
A07	+0.8	explicit	Says policy should be consolidated in one `decision_policy` node.
A08	-0.8	explicit	Rejects turning every policy rule into a node.
A09	+0.7	explicit	Says mutations belong in designated nodes and edges should read state and route.
A10	+0.4	explicit	Notes loop-back edge handling as a key concern.
A11	+0.4	implied	Recommends `tick_and_guard`, implying some iteration-entry support without separating it strongly.
A12	0.0	unmentioned	No clear support or rejection of dedicated bootstrap nodes.
A13	0.0	unmentioned	No clear public-contract or single-interrupt claim.
A14	+0.3	implied	Supports idempotent state mutations but does not discuss route fields or rich caches.
A15	0.0	unmentioned	No clear LLM-failure or decision-origin claim.
A16	+0.4	explicit	Mentions testability and isolation, but not a detailed staged migration.
A17	0.0	unmentioned	No `_build_plan_output` or equivalent helper claim.
A18	0.0	unmentioned	No clear Opus state-ownership or dispatch-reference claim.
A19	0.0	unmentioned	No finish-validation claim.

GL: `glm-5.1-pro-range.md`¶

Aspect	Score	Status	Evidence note
A01	+0.8	explicit	Final recommendation uses a moderate graph shape after selective borrowing.
A02	+0.8	explicit	Ranks DeepSeek runner-up and borrows the `adjust` starting point.
A03	+0.7	explicit	Values GPT-5.4 anti-patterns and `tick`, though ranks it fourth.
A04	+1.0	explicit	Ranks GPT-5.5 best overall and uses it as base.
A05	+1.0	explicit	Says Gemini/Mimo/GLM-level minimal shapes are incomplete final targets.
A06	+0.8	explicit	Critiques Qwen-3.7 as over-engineered and Opus as slightly over-decomposed.
A07	+0.8	explicit	Recommends starting with a single `correct_policy` node.
A08	+0.3	explicit	Also praises GPT-5.5's separation of normalize/retry/maybe-calculate concerns.
A09	+0.8	explicit	Adopts anti-pattern rules forbidding mutation in edges.
A10	+1.0	explicit	Final graph routes all loopbacks through `enter_iteration`.
A11	+0.8	explicit	Values `tick`/`enter_iteration` as a distinct lifecycle node.
A12	+0.2	implied	Sees dedicated region/currency nodes as visible, but final recommendation folds them into a gate.
A13	+0.8	explicit	Preserves non-goals and public planner boundaries from top proposals.
A14	+0.6	explicit	Warns about `_route` and checkpoint concerns, but not as strongly as G54/G55/OP.
A15	+1.0	explicit	Strongly values `decision_origin` or `llm_failed`.
A16	+0.9	explicit	Emphasizes staged migration and helper extraction.
A17	+0.8	explicit	Treats `_build_plan_output` as a useful standalone helper.
A18	+0.9	explicit	Borrows Opus naming and values state ownership documentation.
A19	0.0	unmentioned	No clear finish-validation adoption.

G54: `gpt-5.4-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends GPT-5.4's moderate six-phase target graph.
A02	+0.4	explicit	Treats DeepSeek as a useful mental model but ranks it fourth.
A03	+1.0	explicit	Ranks GPT-5.4 first and recommends it as end-state architecture.
A04	+0.8	explicit	Ranks GPT-5.5 second and uses its migration/state/test discipline.
A05	+0.8	explicit	Says GLM/Gemini/Mimo are safer first slices but weak final targets.
A06	+0.9	explicit	Rejects atomized Opus/Qwen shapes and checkpoint-boundary excess.
A07	+1.0	explicit	Recommends a single `decision_policy` node.
A08	-0.6	explicit	Says GPT-5.5's normalize/retry/maybe-calculate split is one split too far.
A09	+1.0	explicit	Explicitly rejects policy in edge functions.
A10	+1.0	explicit	Requires all action loopbacks to return to `tick`.
A11	+1.0	explicit	Makes `tick` the first core target node.
A12	-0.7	explicit	Rejects separate bootstrap interrupt nodes unless existing flow becomes insufficient.
A13	+1.0	explicit	Strongly preserves `ask_user`, outer graph, thread naming, and public surfaces.
A14	+1.0	explicit	Rejects `_route` patterns and unproven rich state caches.
A15	+0.8	explicit	Borrows `llm_failed`/origin handling from GPT-5.5.
A16	+1.0	explicit	Gives staged migration sequence and helper-first guidance.
A17	+0.8	explicit	Borrows GLM's `_build_plan_output` helper.
A18	+0.4	explicit	Keeps Opus as a later checklist but not as graph shape.
A19	0.0	unmentioned	No clear finish-validation claim.

G55: `gpt-5.5-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends a moderate phased graph rather than detailed atomization.
A02	+0.4	explicit	Treats DeepSeek as a close moderate split but below GPT-5.4/GPT-5.5.
A03	+1.0	explicit	Ranks GPT-5.4 as best implementation spine.
A04	+0.8	explicit	Ranks GPT-5.5 second and mines it for state/test details.
A05	+0.8	explicit	Marks Gemini/Mimo as too coarse and under-specified.
A06	+0.9	explicit	Rejects fully atomized Opus/Qwen shapes for first implementation.
A07	+0.9	explicit	Uses one `decision_policy` in the suggested target graph.
A08	-0.5	explicit	Says GPT-5.5's three post-LLM nodes are good long-term seams but too much first-step risk.
A09	+1.0	explicit	States edge functions should inspect only and mutations belong in nodes.
A10	+1.0	explicit	All loopbacks return to `tick`.
A11	+1.0	explicit	`tick` is the start node in the recommended graph.
A12	-0.5	explicit	Says require-region/currency may be conceptual but should still use existing `ask_user`.
A13	+1.0	explicit	Strongly preserves single interrupt and outer graph contracts.
A14	+0.9	explicit	Rejects rich cached recipes and informal transient state unless serializer-safe.
A15	+0.9	explicit	Values `llm_failed` or `decision_origin` for calculator-loop safety.
A16	+1.0	explicit	Provides helper-first implementation guidance.
A17	0.0	unmentioned	No clear `_build_plan_output` claim.
A18	+0.4	explicit	Treats Opus as checklist material, not target graph.
A19	0.0	unmentioned	No finish-validation claim.

KI: `kimi-2.6-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends a 5-node hybrid as the VentureScope pipeline.
A02	+1.0	explicit	Ranks DeepSeek #1 and adopts its primary architecture.
A03	+0.8	explicit	Ranks GPT-5.4 #2 and uses its staged migration.
A04	+0.3	explicit	Notes useful GPT-5.5 ideas but ranks it as over-engineered.
A05	+0.7	explicit	Says GLM/Mimo/Gemini are safe or decent but not sufficient.
A06	+1.0	explicit	Strongly rejects Opus/GPT-5.5/Qwen-3.7 over-decomposition.
A07	+1.0	explicit	Uses a single `adjust` node for policy corrections.
A08	-0.7	explicit	Says the separate correction chains should be collapsed into helpers.
A09	+1.0	explicit	Rejects Qwen-3.6 redirect-in-edge approach.
A10	+1.0	explicit	Final topology loops action nodes back to `guard`.
A11	+0.5	explicit	Values `tick` but accepts a combined `guard` in the final shape.
A12	+0.8	explicit	Strongly praises dedicated `ask_region`/`ask_currency` visibility.
A13	+0.2	implied	Does not strongly preserve single `ask_user`; bootstrap explicitness pulls against this aspect.
A14	+0.2	explicit	Rejects `_route`, but supports `recipes` in state.
A15	+0.4	explicit	Includes `llm_failed` in the hybrid sources.
A16	+1.0	explicit	Provides seven-stage staged migration and praises zero-risk helper extraction.
A17	0.0	unmentioned	No clear `_build_plan_output` claim.
A18	+0.6	explicit	Values Opus dispatch/naming ideas as partial borrowings.
A19	0.0	unmentioned	No clear finish-validation claim.

MI: `mimo-2.5-pro-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Identifies DeepSeek/GPT-5.4-sized topology as the best combination.
A02	+1.0	explicit	Ranks DeepSeek #1 and recommends its graph structure.
A03	+0.8	explicit	Ranks GPT-5.4 #2 and borrows migration/anti-pattern guidance.
A04	+0.3	explicit	Values GPT-5.5 consequences and test guidance but sees it as over-decomposed.
A05	+0.5	explicit	Minimal splits are useful but criticized for incomplete post-LLM treatment.
A06	+1.0	explicit	Strongly rejects Opus/Qwen over-decomposition.
A07	+1.0	explicit	Calls DeepSeek's `adjust` the strongest differentiator.
A08	-0.7	explicit	Rejects splitting corrections into many nodes.
A09	+1.0	explicit	Rejects route-function mutation as an anti-pattern.
A10	+1.0	explicit	Says loopback to `guard` is correct.
A11	+0.6	explicit	Recommends renaming DeepSeek's `guard` to `tick`.
A12	-0.5	explicit	Says separate region/currency nodes add unnecessary complexity.
A13	+0.7	explicit	Adds GPT-5.4 anti-pattern warnings and preserves single bootstrap concern.
A14	+0.4	explicit	Criticizes `_route`, but accepts DeepSeek's recipe state as sensible.
A15	+0.5	explicit	Notes `llm_failed` or `decision_origin` as sensible.
A16	+1.0	explicit	Recommends helper extraction and staged migration.
A17	+0.8	explicit	Explicitly includes GLM's `_build_plan_output` helper.
A18	+0.5	explicit	Values Opus naming and ownership ideas but rejects the topology.
A19	0.0	unmentioned	No clear finish-validation claim.

OP: `opus-4.7-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends GPT-5.4-level middle topology as the baseline.
A02	+0.8	explicit	Ranks DeepSeek #3 and harvests its self-loop idea.
A03	+1.0	explicit	Ranks GPT-5.4 #1 and uses its six-phase pipeline.
A04	+1.0	explicit	Ranks GPT-5.5 #2 and strongly borrows its `llm_failed` and state-cache caution.
A05	+0.9	explicit	Says Mimo/Gemini are under-decomposed and do not solve the problem.
A06	+0.9	explicit	Calls Opus itself operationally heavy and rejects 13-node chains.
A07	+1.0	explicit	Recommends one `decision_policy` node, not a corrector chain.
A08	-0.6	explicit	Rejects splitting the corrector chain despite recognizing its theoretical clarity.
A09	+1.0	explicit	Calls redirect-in-edge the biggest anti-pattern.
A10	+1.0	explicit	Lists top-node loopback as universal agreement.
A11	+1.0	explicit	Uses `tick` as first recommended phase.
A12	-0.5	explicit	Rejects split bootstrap nodes as graph-decorative.
A13	+1.0	explicit	Strongly preserves single `ask_user` interrupt and outer graph scope.
A14	+1.0	explicit	Rejects recipes in state unless serializer-safe and rejects state hint pollution.
A15	+1.0	explicit	Requires `llm_failed` or `decision_origin`.
A16	+1.0	explicit	Provides a six-commit green migration.
A17	0.0	unmentioned	No clear `_build_plan_output` claim.
A18	+0.8	explicit	Values state ownership and write-surface discipline as reference material.
A19	+0.7	explicit	Calls `route_finish_check` a sharp idea, though not the central recommendation.

Q36: `qwen-3.6-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends a 5-6 node target count.
A02	+0.5	explicit	Treats DeepSeek as solid middle-ground but ranks it fifth.
A03	+1.0	explicit	Ranks GPT-5.4 best overall and uses its target nodes.
A04	+0.4	explicit	Praises GPT-5.5 hygiene but ranks it low for complexity.
A05	+0.5	explicit	GLM is ranked highly as first PR, but long-term still needs more decomposition.
A06	+0.8	explicit	Rejects 10+ and 14-node proposals for checkpoint overhead and graph noise.
A07	+1.0	explicit	Recommends one `decision_policy` node.
A08	-0.7	explicit	Explicitly says one policy node, not five corrector nodes.
A09	+1.0	explicit	States no state mutation in edge functions as a design rule.
A10	+1.0	explicit	Requires loopbacks to `tick`.
A11	+1.0	explicit	Makes `tick` the first target node.
A12	-0.6	explicit	Merges separate region/currency nodes into one bootstrap gate.
A13	+1.0	explicit	Keeps `ask_user` single interrupt and avoids outer graph changes.
A14	+0.8	explicit	Rejects `_route` and checkpoint pollution, though does not dwell on recipes as much as OP/G54.
A15	+0.8	explicit	Includes `llm_failed` from GPT-5.5.
A16	+1.0	explicit	Gives combined staged migration.
A17	+0.8	explicit	Includes `_build_plan_output` helper.
A18	+0.8	explicit	Ranks Opus high for mapping/state ownership reference.
A19	+1.0	explicit	Includes `finish_check` in the recommended graph.

Q37: `qwen-3.7-max-range.md`¶

Aspect	Score	Status	Evidence note
A01	+1.0	explicit	Recommends a 5-node pipeline plus optional finish check.
A02	+0.9	explicit	Ranks DeepSeek #2 and adopts its pragmatism.
A03	+1.0	explicit	Ranks GPT-5.4 #1 and uses it as target topology.
A04	+0.8	explicit	Ranks GPT-5.5 #3 and borrows `llm_failed`/non-goal detail.
A05	+0.8	explicit	Treats Gemini/Mimo minimal proposals as insufficient.
A06	+0.9	explicit	Rejects Opus/Qwen-3.7-style full decomposition as too many nodes.
A07	+1.0	explicit	Uses one `decision_policy` node.
A08	-0.7	explicit	Says multi-node post-LLM correction adds unnecessary checkpoint writes.
A09	+1.0	explicit	Explicitly rejects redirect-in-edge mutation.
A10	+1.0	explicit	Target graph loops all action paths back to `tick`.
A11	+1.0	explicit	`tick` is the first target node.
A12	-0.7	explicit	Rejects separate `ask_region`/`ask_currency` nodes as overkill.
A13	+0.9	explicit	Keeps `ask_user` and public graph contracts stable.
A14	+0.2	explicit	Rejects `_route` but supports recipes in state, creating a mixed checkpoint-discipline position.
A15	+0.8	explicit	Includes GPT-5.5's `llm_failed` state field.
A16	+1.0	explicit	Gives eight-step green migration.
A17	+1.0	explicit	Includes `_build_plan_output` as first migration step.
A18	+0.8	explicit	Values Opus mapping and state ownership table as reference material.
A19	+1.0	explicit	Includes `finish_check` in the resulting target graph.

Normalization And Ambiguity Audit¶

Check	Result
Evaluator count	10
Aspect count	19
Total evaluator-aspect cells	190
Unresolved ambiguous cells	0
Ambiguity rate	`0 / 190 = 0.0%`
Clear-equivalent normalization	Applied only to equivalent statements such as `decision_policy`, `adjust`, `enforce_policy`, and `correct_policy` when they meant a single post-LLM policy stage.
Kept separate because equivalence was not clear	Dedicated bootstrap nodes vs bootstrap gate; recipe cache vs checkpoint-state caution; finish validation vs normal finish node; multi-node correction split vs single policy node.
Neutral cells	All `0.0` cells are either unmentioned or balanced; unmentioned cells use the user-selected neutral override.

Reproducibility Snippet¶

The numeric tables above were computed from the score matrix using this exact data shape:

M[evaluator][aspect] = score in [-1, 1]
w_i = 0.1
alpha_j = 1 / 19
c_ij = 1 after neutral fill
mean_j = sum_i M[i][j] / 10
variance_j = sum_i (M[i][j] - mean_j)^2 / 10
center_distance_i = sqrt(sum_j alpha_j * (M[i][j] - mean_j)^2)
pairwise_distance_i_k = sqrt(sum_j alpha_j * (M[i][j] - M[k][j])^2)
medoid_score_i = sum_k 0.1 * pairwise_distance_i_k

Final Answer¶

By closest opinion to the equal-weight overall vector, the ranking is:

DS
MI
G55
G54
OP
Q36
Q37
GL
KI
GE

By medoid centrality, the ranking is:

DS
MI
G55
G54
OP
Q36
Q37
GL
KI
GE

Both calculations select deepseek-4-pro-range.md as the most representative evaluator opinion in this set.

Evaluator Consensus Ranking Report¶

Scope¶

Source Set¶

Result Summary¶

Scoring Method¶

Aspect Taxonomy¶

Claim Ledger¶

Full Scoring Matrix¶

Aggregate Vector And Agreement¶

Overall Opinion Generated From The Aggregate¶

Distance To Overall Opinion¶

Pairwise Distance Matrix¶

Medoid Ranking¶

Full Scoring Evidence Ledger¶

DS: deepseek-4-pro-range.md¶

GE: gemini-3.1-pro-range.md¶

GL: glm-5.1-pro-range.md¶

G54: gpt-5.4-range.md¶

G55: gpt-5.5-range.md¶

KI: kimi-2.6-range.md¶

MI: mimo-2.5-pro-range.md¶

OP: opus-4.7-range.md¶

Q36: qwen-3.6-range.md¶

Q37: qwen-3.7-max-range.md¶

Normalization And Ambiguity Audit¶

Reproducibility Snippet¶

Final Answer¶

DS: `deepseek-4-pro-range.md`¶

GE: `gemini-3.1-pro-range.md`¶

GL: `glm-5.1-pro-range.md`¶

G54: `gpt-5.4-range.md`¶

G55: `gpt-5.5-range.md`¶

KI: `kimi-2.6-range.md`¶

MI: `mimo-2.5-pro-range.md`¶

OP: `opus-4.7-range.md`¶

Q36: `qwen-3.6-range.md`¶

Q37: `qwen-3.7-max-range.md`¶