Evaluator Consensus Ranking Report¶

Scope¶

This report applies docs/planner-graph-ref/analyse/analyse-evaluation-method.md to the 10 evaluator reports in this directory. The method document is not treated as an evaluator.

Evaluator inputs:

zdeepseek-4-pro-range.md
gemini-3.1-pro-range.md
glm-5.1-pro-range.md
gpt-5.4-range.md
gpt-5.5-range.md
kimi-2.6-range.md
mimo-2.5-pro-range.md
opus-4.7-range.md
qwen-3.6-range.md
qwen-3.7-max-range.md

The plan artifact names deepseek-4-pro-range.md, but the on-disk DeepSeek evaluator file is zdeepseek-4-pro-range.md; this report uses the on-disk source filename.

Method And Scoring Rules¶

Each evaluator report is represented as an aspect vector x_i = (s_i1, ..., s_i16) with scores in [-1, 1]. Equal evaluator weights are used: w_i = 0.1.

Scoring scale:

Score	Meaning
`1.0`	Strong support for the aspect.
`0.8`	Clear support with qualification or less centrality.
`0.5`	Moderate or partial support.
`0.3`	Weak positive support.
`0.0`	Neutral or unmentioned.
Negative values	Caution against or rejection of the aspect, with magnitude matching strength.

Missingness policy: unmentioned aspects are scored 0 by user-selected override. This intentionally differs from the method document's caution that missing aspects should not always pull toward zero. Here, after neutral filling, c_ij = 1 for every evaluator-aspect cell; mention status is tracked separately for audit only.

Mention-status codes:

Code	Meaning
`E`	Explicit claim or explicit verdict.
`I`	Implied by ranking, criticism, or recommended hybrid.
`U`	Unmentioned; neutral `0` by missingness override.

Normalization decisions:

Clearly equivalent terms were normalized: adjust, decision_policy, enforce_policy, correct_policy, and post-decision normalization all map to A4 when the evaluator supports a node-owned correction stage.
Unclear equivalents were kept separate: single interrupt handling is A7, checkpoint-safe state is A8, and recipe caching is A16.
Post-LLM correction split into many nodes is scored positive for A4 only when the evaluator endorsed it as node-owned correction, but the same text can still score high on A11 if it rejected graph over-decomposition.
No independent planner-architecture judgment is added. Scores reflect only evaluator-report claims.

Ambiguity result: 0 / 160 = 0.000% unresolved ambiguous cells. No forced cells were needed.

Rounding policy: calculations used the full matrix values below. Aggregate and variance tables show 3 decimals plus exact values where rankings depend on small gaps. Distance tables show 6 decimals. Ties are broken by exact unrounded value, then lexicographic filename if exact values are equal. Near ties are gaps <= 0.020000.

Aspect Taxonomy¶

Code	Aspect	Claim direction represented by positive scores
A1	Right-sized moderate decomposition	Prefer a 4-7 phase graph over minimal or atomized rewrites.
A2	Staged helper-first migration	Extract helpers and ship green stages before graph surgery.
A3	LLM-only planning node	Isolate prompt construction and structured LLM output from deterministic policy.
A4	Node-owned post-decision policy	Put redirects, caps, and calculation adjustment in a node-owned policy stage, not inline or in edges.
A5	Pure routing edges	Conditional edges inspect state and route only; they do not mutate decisions.
A6	Canonical loop-entry phase	Action nodes re-enter through one top-of-loop tick/guard/precheck phase.
A7	Single interrupt surface	Preserve `ask_user` / `observe_user` as the interrupt path instead of adding special bootstrap observe nodes.
A8	Checkpoint-safe state discipline	Minimize new state, avoid transient route fields, and add only serializer-safe state.
A9	Calculator lifecycle visibility	Preserve or expose calculator caps, success, blocked recovery, and finish-to-calculate policy.
A10	LLM failure provenance	Track `llm_failed` or `decision_origin` to avoid failed-LLM recovery loops.
A11	Reject over-decomposition	Avoid chatty 10+ node graphs and excessive PostgresSaver checkpoint hops.
A12	Reject under-decomposition	Avoid leaving a renamed 150-200 line god-node-lite as the target state.
A13	Graph honesty and observability	Make Mermaid/LangGraph topology reflect the real planner lifecycle.
A14	Test seams	Make node boundaries unit-testable without full LLM orchestration.
A15	Finish validation	Add or value a finish-check / premature-finish validation step.
A16	Recipe caching in state	Support caching dynamic acquisition recipes in planner state.

Scoring Matrix¶

Evaluator	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
`zdeepseek-4-pro-range.md`	1.0	1.0	1.0	1.0	0.8	1.0	0.5	0.4	0.8	0.7	1.0	1.0	1.0	1.0	0.0	0.8
`gemini-3.1-pro-range.md`	1.0	0.7	0.8	0.8	0.5	0.8	0.0	0.6	0.4	0.0	1.0	0.8	0.9	0.8	0.0	0.0
`glm-5.1-pro-range.md`	0.7	1.0	0.9	0.5	0.8	0.9	0.8	0.9	1.0	0.7	0.8	0.7	1.0	1.0	0.0	-0.4
`gpt-5.4-range.md`	1.0	1.0	1.0	0.9	1.0	1.0	1.0	1.0	0.9	0.7	1.0	0.9	1.0	1.0	-0.2	-0.6
`gpt-5.5-range.md`	1.0	1.0	1.0	0.8	1.0	1.0	1.0	1.0	0.9	1.0	1.0	0.9	1.0	1.0	0.3	-0.7
`kimi-2.6-range.md`	1.0	1.0	1.0	1.0	0.8	1.0	-0.3	0.7	0.9	0.0	0.9	1.0	1.0	1.0	0.0	1.0
`mimo-2.5-pro-range.md`	1.0	1.0	1.0	1.0	0.8	0.9	0.6	0.7	0.7	0.0	1.0	0.8	1.0	1.0	0.0	0.4
`opus-4.7-range.md`	1.0	1.0	1.0	0.8	1.0	1.0	1.0	1.0	0.8	1.0	1.0	1.0	1.0	1.0	0.6	-0.8
`qwen-3.6-range.md`	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.9	0.7	1.0	0.8	0.6	1.0	1.0	1.0	0.5
`qwen-3.7-max-range.md`	1.0	1.0	1.0	1.0	1.0	1.0	0.6	0.8	0.8	0.6	1.0	0.8	1.0	1.0	0.8	0.6

Mention Status Matrix¶

Evaluator	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
`zdeepseek-4-pro-range.md`	E	E	E	E	E	E	I	E	E	E	E	E	E	E	U	E
`gemini-3.1-pro-range.md`	E	E	E	E	I	E	U	I	I	U	E	E	E	E	U	U
`glm-5.1-pro-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	U	E
`gpt-5.4-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	I	E
`gpt-5.5-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	I	E
`kimi-2.6-range.md`	E	E	E	E	E	E	E	E	E	U	E	E	E	E	U	E
`mimo-2.5-pro-range.md`	E	E	E	E	E	E	I	E	E	U	E	E	E	E	U	I
`opus-4.7-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E
`qwen-3.6-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E
`qwen-3.7-max-range.md`	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E	E

Cell Evidence Ledger¶

Each item lists all 16 cell rationales for the source evaluator. U cells are neutral by the missingness override.

`zdeepseek-4-pro-range.md`¶

A1 1.0/E: calls the 4-node DeepSeek approach the "Goldilocks" solution; rationale: strong support for moderate granularity.
A2 1.0/E: praises a helper-first migration with verification gates; rationale: supports staged rollout.
A3 1.0/E: says the LLM call should be isolated in a thin node; rationale: direct LLM-isolation claim.
A4 1.0/E: identifies adjust as the strongest insight and canonical normalization stage; rationale: strong policy-node support.
A5 0.8/E: rejects redirect logic in edge functions when discussing Qwen-3.6; rationale: supports pure routing edges.
A6 1.0/E: requires loop-back through a canonical guard entry; rationale: strong loop-entry support.
A7 0.5/I: criticizes extra bootstrap-node variants but also accepts structural bootstrap visibility; rationale: partial support for single interrupt handling.
A8 0.4/E: criticizes _route fields but endorses recipe state; rationale: mixed checkpoint discipline.
A9 0.8/E: values calculator lifecycle and _adjust_calculation_decision reuse; rationale: supports calculator preservation.
A10 0.7/E: highlights llm_failed / origin as a useful safeguard in other proposals; rationale: supports failure provenance.
A11 1.0/E: rejects 10+ node proposals as over-engineered; rationale: strong anti-overdecomposition claim.
A12 1.0/E: says 2-3 node proposals leave too much hidden logic; rationale: strong anti-underdecomposition claim.
A13 1.0/E: centers the hidden-routing problem and graph honesty; rationale: strong observability claim.
A14 1.0/E: repeatedly rates proposals by test seams and isolated nodes; rationale: strong testability claim.
A15 0.0/U: no finish-validation claim; rationale: neutral by missingness.
A16 0.8/E: calls recipes in state a practical cache; rationale: supports recipe caching.

`gemini-3.1-pro-range.md`¶

A1 1.0/E: names GPT-5.4 and DeepSeek as top phased-pipeline approaches; rationale: strong moderate-granularity support.
A2 0.7/E: values testable phases but gives limited migration detail; rationale: moderate staged-migration support.
A3 0.8/E: endorses separating preparation/rules from LLM planning; rationale: supports LLM isolation.
A4 0.8/E: recommends decision_policy/DeepSeek-style post-policy over hidden rewrites; rationale: supports policy node.
A5 0.5/I: says state mutations should happen in designated nodes; rationale: implied edge-purity support.
A6 0.8/E: notes loopbacks must be handled carefully through the phase entry; rationale: supports canonical re-entry.
A7 0.0/U: no single-interrupt claim; rationale: neutral by missingness.
A8 0.6/I: values idempotent state mutation and edge readability; rationale: implied checkpoint discipline.
A9 0.4/I: mentions acquisition/calculator gate concerns but not as a central criterion; rationale: weak calculator support.
A10 0.0/U: no LLM-failure provenance claim; rationale: neutral by missingness.
A11 1.0/E: explicitly rejects graph noise and extreme decomposition; rationale: strong anti-overdecomposition claim.
A12 0.8/E: says conservative 2-3 node splits leave rewrites hidden; rationale: supports avoiding under-decomposition.
A13 0.9/E: frames the graph diagram as misleading today; rationale: strong graph-honesty support.
A14 0.8/E: ties phase boundaries to unit-testability; rationale: supports test seams.
A15 0.0/U: no finish-validation claim; rationale: neutral by missingness.
A16 0.0/U: no recipe-caching claim; rationale: neutral by missingness.

`glm-5.1-pro-range.md`¶

A1 0.7/E: ranks GPT-5.5 highest despite higher node count but still rejects extremes; rationale: moderate right-sizing support.
A2 1.0/E: praises staged extraction and migration discipline; rationale: strong staged-migration support.
A3 0.9/E: treats LLM planning as a separable phase; rationale: strong LLM-isolation support.
A4 0.5/E: favors separate normalization/retry/calculator policy more than one node; rationale: partial support for node-owned policy.
A5 0.8/E: rejects policy in routing functions and _route patterns; rationale: supports pure edges.
A6 0.9/E: supports one iteration lifecycle entry; rationale: strong canonical-loop support.
A7 0.8/E: says ask_user should remain the interrupt surface; rationale: supports single interrupt handling.
A8 0.9/E: emphasizes checkpoint safety and serializer-compatible state; rationale: strong state discipline.
A9 1.0/E: ranks calculator-visible maybe_calculate highly; rationale: strong calculator-lifecycle support.
A10 0.7/E: values llm_failed / origin handling; rationale: supports failure provenance.
A11 0.8/E: criticizes over-modeled graphs; rationale: anti-overdecomposition support.
A12 0.7/E: treats GLM-style minimal split as safe but not final; rationale: moderate anti-underdecomposition support.
A13 1.0/E: scores graph visibility as central; rationale: strong graph-honesty claim.
A14 1.0/E: maps tests to new node targets; rationale: strong test-seam support.
A15 0.0/U: no finish-validation claim; rationale: neutral by missingness.
A16 -0.4/E: warns against cached recipe/state additions unless proven safe; rationale: mild rejection of recipe caching.

`gpt-5.4-range.md`¶

A1 1.0/E: recommends the GPT-5.4 moderate phased graph as target; rationale: strong right-sizing support.
A2 1.0/E: says helper extraction and staged migration are the best playbook; rationale: strong staged-migration support.
A3 1.0/E: says the LLM node should be prompt build plus structured call only; rationale: strong LLM-isolation claim.
A4 0.9/E: recommends one decision_policy node; rationale: strong unified-policy support.
A5 1.0/E: explicitly says edges must be side-effect free; rationale: strong edge-purity claim.
A6 1.0/E: says only tick should increment iterations and loopbacks return to it; rationale: strong loop-entry support.
A7 1.0/E: preserves ask_user as the single interrupt node; rationale: strong single-interrupt support.
A8 1.0/E: rejects route tags and checkpoint-risky state; rationale: strong state discipline.
A9 0.9/E: requires calculator policy to remain complete in acquisition_gate/decision_policy; rationale: strong calculator support.
A10 0.7/E: notes LLM-failure handoff needs care; rationale: supports provenance but not as central.
A11 1.0/E: rejects Opus/Qwen atomized graphs; rationale: strong anti-overdecomposition claim.
A12 0.9/E: says minimal splits leave policy hidden; rationale: strong anti-underdecomposition claim.
A13 1.0/E: makes graph honesty a main criterion; rationale: strong observability support.
A14 1.0/E: emphasizes phase-specific unit tests; rationale: strong test-seam support.
A15 -0.2/I: treats finish-check as a back-loaded validation rather than core shape; rationale: mild caution.
A16 -0.6/E: warns not to cache recipes without serialization proof; rationale: rejects recipe caching for first pass.

`gpt-5.5-range.md`¶

A1 1.0/E: recommends GPT-5.4 spine with GPT-5.5 details; rationale: strong moderate-granularity support.
A2 1.0/E: endorses six green migration stages; rationale: strong staged-migration support.
A3 1.0/E: says LLM planning should be thin and isolated; rationale: strong LLM-isolation claim.
A4 0.8/E: supports node-owned normalization, but accepts merging post-LLM nodes for first pass; rationale: clear policy-node support.
A5 1.0/E: rejects mutation in edge functions; rationale: strong edge-purity support.
A6 1.0/E: requires loopbacks to top entry; rationale: strong canonical-loop support.
A7 1.0/E: keeps ask_user as single interrupt surface; rationale: strong single-interrupt support.
A8 1.0/E: emphasizes checkpoint-owned state and serializer-safe additions; rationale: strong state discipline.
A9 0.9/E: values calculator routing and blocked-loop preservation; rationale: strong calculator support.
A10 1.0/E: explicitly recommends llm_failed or decision_origin; rationale: strong failure-provenance support.
A11 1.0/E: rejects atomized corrector chains as too chatty; rationale: strong anti-overdecomposition support.
A12 0.9/E: says under-specified 2-3 node plans do not solve post-LLM policy; rationale: strong anti-underdecomposition support.
A13 1.0/E: treats graph honesty as a core criterion; rationale: strong observability support.
A14 1.0/E: gives test-target mapping; rationale: strong test-seam support.
A15 0.3/I: values completion correctness but does not make finish-check central; rationale: weak positive.
A16 -0.7/E: explicitly says not to store recipes unless serialization is proven; rationale: clear recipe-caching rejection.

`kimi-2.6-range.md`¶

A1 1.0/E: recommends a 5-node hybrid from top proposals; rationale: strong right-sized support.
A2 1.0/E: requires staged migration and helper extraction; rationale: strong staged-migration support.
A3 1.0/E: isolates the LLM call in plan/decide; rationale: strong LLM-isolation support.
A4 1.0/E: strongly supports adjust/enforce_policy as post-processing stage; rationale: strong policy-node support.
A5 0.8/E: rejects redirect-in-edge anti-pattern; rationale: supports edge purity.
A6 1.0/E: loops all action nodes back to guard/tick; rationale: strong canonical-loop support.
A7 -0.3/E: praises dedicated ask_region/ask_currency node visibility; rationale: mild rejection of a single generic interrupt shape.
A8 0.7/E: discusses checkpointer overhead and avoids _route; rationale: supports state discipline with some recipe-cache optimism.
A9 0.9/E: values calc_gate and calculator adjustment routing; rationale: strong calculator support.
A10 0.0/U: no failure-provenance claim; rationale: neutral by missingness.
A11 0.9/E: rejects Opus/Qwen over-decomposition; rationale: strong anti-overdecomposition support.
A12 1.0/E: says minimal splits leave too much hidden; rationale: strong anti-underdecomposition support.
A13 1.0/E: repeatedly emphasizes truthful Mermaid diagrams; rationale: strong graph-honesty support.
A14 1.0/E: makes testability a high-weight criterion; rationale: strong test-seam support.
A15 0.0/U: no finish-validation claim; rationale: neutral by missingness.
A16 1.0/E: endorses recipe state as useful deduplication; rationale: strong recipe-caching support.

`mimo-2.5-pro-range.md`¶

A1 1.0/E: ranks DeepSeek's 4-node split first; rationale: strong moderate-granularity support.
A2 1.0/E: praises helper-first migration; rationale: strong staged-migration support.
A3 1.0/E: requires LLM-only plan/decide separation; rationale: strong LLM-isolation support.
A4 1.0/E: calls adjust the clean post-LLM stage; rationale: strong policy-node support.
A5 0.8/E: rejects redirect-in-routing and _route patterns; rationale: supports pure edges.
A6 0.9/E: endorses action loopbacks to guard/tick; rationale: strong canonical-loop support.
A7 0.6/I: recommends avoiding separate region/currency nodes; rationale: partial single-interrupt support.
A8 0.7/E: values minimal state changes and flags transient route fields; rationale: supports state discipline.
A9 0.7/E: discusses calculator checks and adjustment but not as top criterion; rationale: clear but moderate support.
A10 0.0/U: no failure-provenance claim; rationale: neutral by missingness.
A11 1.0/E: ranks over-decomposed Opus/Qwen low; rationale: strong anti-overdecomposition support.
A12 0.8/E: says minimal 3-node proposals are still incomplete; rationale: supports anti-underdecomposition.
A13 1.0/E: centers graph routing honesty; rationale: strong observability support.
A14 1.0/E: scores proposals on testability; rationale: strong test-seam support.
A15 0.0/U: no finish-validation claim; rationale: neutral by missingness.
A16 0.4/I: treats recipe caching as sensible in the favored DeepSeek plan; rationale: weak to moderate recipe-caching support.

`opus-4.7-range.md`¶

A1 1.0/E: recommends the GPT-5.4 six-phase shape; rationale: strong right-sized support.
A2 1.0/E: gives a six-commit green migration; rationale: strong staged-migration support.
A3 1.0/E: says LLM node should shrink to prompt plus structured call; rationale: strong LLM-isolation claim.
A4 0.8/E: supports decision_policy, while noting possible helper substructure; rationale: clear policy-node support.
A5 1.0/E: calls redirect-in-edge the biggest anti-pattern; rationale: strong edge-purity claim.
A6 1.0/E: says all loopbacks re-enter at tick; rationale: strong canonical-loop support.
A7 1.0/E: keeps ask_user as single interrupt and rejects extra observe nodes; rationale: strong single-interrupt support.
A8 1.0/E: warns about state surface, route fields, and serializer safety; rationale: strong state discipline.
A9 0.8/E: values calculator routing and failure safeguards; rationale: clear calculator support.
A10 1.0/E: explicitly recommends llm_failed / decision_origin; rationale: strong failure-provenance support.
A11 1.0/E: rejects 13-node corrector chains as operationally heavy; rationale: strong anti-overdecomposition support.
A12 1.0/E: says three-node proposals do not solve the problem; rationale: strong anti-underdecomposition support.
A13 1.0/E: ranks by graph honesty and observability; rationale: strong graph-honesty support.
A14 1.0/E: rates test seams as a main axis; rationale: strong test-seam support.
A15 0.6/E: calls route_finish_check a sharp idea; rationale: moderate finish-validation support.
A16 -0.8/E: rejects recipe state until serializer compatibility is proven; rationale: strong recipe-caching rejection.

`qwen-3.6-range.md`¶

A1 1.0/E: final synthesis targets 5-6 nodes; rationale: strong moderate-granularity support.
A2 1.0/E: combines GPT-5.4 and GLM migration steps; rationale: strong staged-migration support.
A3 1.0/E: isolates LLM planning in llm_plan; rationale: strong LLM-isolation claim.
A4 1.0/E: recommends one decision_policy node; rationale: strong policy-node support.
A5 1.0/E: lists no edge mutation as a design rule; rationale: strong edge-purity claim.
A6 1.0/E: says loopbacks go to tick; rationale: strong canonical-loop support.
A7 1.0/E: says ask_user remains the single interrupt node; rationale: strong single-interrupt support.
A8 0.9/E: rejects _route and checkpoint pollution while allowing deliberate state additions; rationale: strong state discipline.
A9 0.7/E: includes calculator gate/acquisition policy in target graph; rationale: clear calculator support.
A10 1.0/E: recommends adding llm_failed; rationale: strong failure-provenance support.
A11 0.8/E: despite ranking Opus high for mapping quality, final target reduces node count; rationale: anti-overdecomposition support.
A12 0.6/E: ranks GLM conservative split high as a first PR but not final; rationale: moderate anti-underdecomposition support.
A13 1.0/E: frames Mermaid truthfulness as a central objective; rationale: strong graph-honesty support.
A14 1.0/E: emphasizes green staged tests; rationale: strong test-seam support.
A15 1.0/E: uniquely recommends finish_check; rationale: strong finish-validation support.
A16 0.5/E: recommends recipe state as a performance improvement; rationale: moderate recipe-caching support.

`qwen-3.7-max-range.md`¶

A1 1.0/E: states the sweet spot is 4-7 nodes and ranks GPT-5.4 first; rationale: strong right-sizing support.
A2 1.0/E: praises six-stage independent migration; rationale: strong staged-migration support.
A3 1.0/E: recommends llm_plan isolated from deterministic gates; rationale: strong LLM-isolation claim.
A4 1.0/E: endorses a single decision_policy/post-processing stage; rationale: strong policy-node support.
A5 1.0/E: rejects routing-state and edge mutation patterns; rationale: strong edge-purity support.
A6 1.0/E: recommends tick/guard loop entry; rationale: strong canonical-loop support.
A7 0.6/E: criticizes separate bootstrap nodes but also values bootstrap clarity; rationale: partial single-interrupt support.
A8 0.8/E: flags _route and checkpoint exclusion risk; rationale: strong state discipline.
A9 0.8/E: values acquisition and calculator gates; rationale: clear calculator support.
A10 0.6/E: praises decision_origin/llm_failed but does not make it the main shape; rationale: moderate failure-provenance support.
A11 1.0/E: rejects 10-node full decomposition as excessive; rationale: strong anti-overdecomposition support.
A12 0.8/E: says 2-node split remains incomplete; rationale: strong anti-underdecomposition support.
A13 1.0/E: uses architectural soundness and graph truthfulness as high-weight criteria; rationale: strong graph-honesty support.
A14 1.0/E: includes testability improvement as a criterion; rationale: strong test-seam support.
A15 0.8/E: identifies finish/check-completion validation as useful; rationale: strong finish-validation support.
A16 0.6/E: praises recipe state but notes serialization care; rationale: moderate recipe-caching support.

Aggregate Vector And Agreement¶

Equal weights produce the aggregate vector bar{s}_j = mean_i(s_ij) because all cells have c_ij = 1.

Aspect	Aggregate	Variance	Agreement signal
A1	0.970000	0.008100	Very strong consensus
A2	0.970000	0.008100	Very strong consensus
A3	0.970000	0.004100	Very strong consensus
A4	0.880000	0.023600	Strong consensus
A5	0.870000	0.024100	Strong consensus
A6	0.960000	0.004400	Very strong consensus
A7	0.620000	0.185600	Mixed
A8	0.800000	0.036000	Strong consensus
A9	0.790000	0.024900	Strong consensus
A10	0.570000	0.158100	Mixed
A11	0.950000	0.006500	Very strong consensus
A12	0.850000	0.016500	Strong consensus
A13	0.990000	0.000900	Very strong consensus
A14	0.980000	0.003600	Very strong consensus
A15	0.250000	0.150500	Weak, disputed, or sparse
A16	0.080000	0.399600	Highly disputed

Strongest aggregate claims are A13 graph honesty, A14 test seams, A1/A2/A3 right-sized staged LLM-isolated design, A6 canonical loop entry, and A11 rejection of over-decomposition.

The highest-variance claims are A16 recipe caching, A7 single interrupt versus explicit bootstrap nodes, A10 LLM-failure provenance, and A15 finish validation. These are the least stable consensus points.

Distance-To-Center Calculation¶

Aspect-distance weights are equal: alpha_j = 1 / 16 = 0.0625.

Formula:

d_i = sqrt(sum_j alpha_j * (s_ij - bar{s}_j)^2)

Clean ranking by weighted Euclidean distance to the aggregate center:

Rank	Evaluator	Distance
1	`mimo-2.5-pro-range.md`	0.183610
2	`qwen-3.7-max-range.md`	0.196119
3	`glm-5.1-pro-range.md`	0.205152
4	`zdeepseek-4-pro-range.md`	0.226578
5	`gpt-5.4-range.md`	0.237881
6	`gpt-5.5-range.md`	0.253155
7	`qwen-3.6-range.md`	0.274704
8	`gemini-3.1-pro-range.md`	0.283659
9	`opus-4.7-range.md`	0.287163
10	`kimi-2.6-range.md`	0.366691

Near-tie gaps:

Adjacent pair	Gap
`mimo-2.5-pro-range.md` to `qwen-3.7-max-range.md`	0.012509
`qwen-3.7-max-range.md` to `glm-5.1-pro-range.md`	0.009034
`glm-5.1-pro-range.md` to `zdeepseek-4-pro-range.md`	0.021425
`zdeepseek-4-pro-range.md` to `gpt-5.4-range.md`	0.011303
`gpt-5.4-range.md` to `gpt-5.5-range.md`	0.015274
`gpt-5.5-range.md` to `qwen-3.6-range.md`	0.021549
`qwen-3.6-range.md` to `gemini-3.1-pro-range.md`	0.008955
`gemini-3.1-pro-range.md` to `opus-4.7-range.md`	0.003504
`opus-4.7-range.md` to `kimi-2.6-range.md`	0.079528

Closest-to-center opinion: mimo-2.5-pro-range.md.

This is a centrality result, not a claim that Mimo's own ranked recommendations were the most favored. It is closest because it strongly supports the shared mainstream claims while remaining less extreme on the most disputed aspects: A7, A10, A15, and A16.

Pairwise Distance Matrix¶

Each value is the same equal-aspect weighted Euclidean distance between two evaluator vectors.

Evaluator	zdeepseek	gemini	glm	gpt-5.4	gpt-5.5	kimi	mimo	opus	qwen-3.6	qwen-3.7
zdeepseek	0.000000	0.350892	0.379967	0.409268	0.443001	0.282843	0.225000	0.480885	0.347311	0.242384
gemini	0.350892	0.000000	0.372492	0.417582	0.465698	0.330719	0.242384	0.489260	0.501248	0.389711
glm	0.379967	0.372492	0.000000	0.182003	0.201556	0.510514	0.325960	0.258602	0.390512	0.369966
gpt-5.4	0.409268	0.417582	0.182003	0.000000	0.150000	0.555653	0.343693	0.223607	0.427931	0.409268
gpt-5.5	0.443001	0.465698	0.201556	0.150000	0.000000	0.605186	0.410030	0.086603	0.366572	0.384057
kimi	0.282843	0.330719	0.510514	0.555653	0.605186	0.000000	0.281736	0.636396	0.514174	0.360555
mimo	0.225000	0.242384	0.325960	0.343693	0.410030	0.281736	0.000000	0.446514	0.382426	0.263391
opus	0.480885	0.489260	0.258602	0.223607	0.086603	0.636396	0.446514	0.000000	0.363146	0.390512
qwen-3.6	0.347311	0.501248	0.390512	0.427931	0.366572	0.514174	0.382426	0.363146	0.000000	0.171391
qwen-3.7	0.242384	0.389711	0.369966	0.409268	0.384057	0.360555	0.263391	0.390512	0.171391	0.000000

Medoid Calculation¶

Medoid total for evaluator i:

M_i = sum_j 0.1 * d(i, j)

Clean medoid centrality ranking:

Rank	Evaluator	Weighted total distance
1	`mimo-2.5-pro-range.md`	0.292114
2	`qwen-3.7-max-range.md`	0.298124
3	`glm-5.1-pro-range.md`	0.299157
4	`gpt-5.5-range.md`	0.311270
5	`gpt-5.4-range.md`	0.311900
6	`zdeepseek-4-pro-range.md`	0.316155
7	`opus-4.7-range.md`	0.337552
8	`qwen-3.6-range.md`	0.346471
9	`gemini-3.1-pro-range.md`	0.355999
10	`kimi-2.6-range.md`	0.407778

Near-tie gaps:

Adjacent pair	Gap
`mimo-2.5-pro-range.md` to `qwen-3.7-max-range.md`	0.006010
`qwen-3.7-max-range.md` to `glm-5.1-pro-range.md`	0.001034
`glm-5.1-pro-range.md` to `gpt-5.5-range.md`	0.012113
`gpt-5.5-range.md` to `gpt-5.4-range.md`	0.000630
`gpt-5.4-range.md` to `zdeepseek-4-pro-range.md`	0.004255
`zdeepseek-4-pro-range.md` to `opus-4.7-range.md`	0.021397
`opus-4.7-range.md` to `qwen-3.6-range.md`	0.008919
`qwen-3.6-range.md` to `gemini-3.1-pro-range.md`	0.009527
`gemini-3.1-pro-range.md` to `kimi-2.6-range.md`	0.051779

Medoid opinion: mimo-2.5-pro-range.md.

The same report is both closest to the aggregate center and the medoid by total pairwise distance.

Matrix-Derived Overall Opinion¶

The aggregate evaluator opinion is strongly positive toward a moderate planner-graph refactor that makes the hidden planner lifecycle visible without turning every rule into a checkpointed node. The consensus favors a staged helper-first migration, a canonical loop-entry phase, a thin LLM planning node, and a node-owned post-decision policy phase. The evaluators strongly agree that routing edges must remain pure, action loopbacks should re-enter at the top of the planning loop, and tests should move toward phase-specific seams.

The aggregate is also strongly negative toward both extremes: a 10+ node policy lattice with excessive checkpoint writes and a minimal 2-3 node split that leaves most deterministic policy hidden in a renamed god node. The consensus synthesis points toward a GPT-5.4-like or DeepSeek-like moderate pipeline, with GPT-5.5-style state and failure-provenance caution where explicitly supported by the evaluator reports.

The unresolved consensus tensions are visible in the high-variance aspects. Recipe caching in planner state is the most disputed claim: some evaluators treat it as useful deduplication, while others warn that FieldAcquisition-like objects may be unsafe for checkpoint serialization. Single-interrupt handling is also mixed because some reports value explicit region/currency graph nodes, while others prefer preserving the existing ask_user / observe_user path. Finish validation has weak aggregate support because it is a sharp but sparsely mentioned idea. LLM-failure provenance is positively supported but not universal.

The nearest existing opinion is mimo-2.5-pro-range.md, and the medoid is also mimo-2.5-pro-range.md. This means it is the most central existing evaluator report under the selected aspect-space representation and equal weights.

Evaluator Consensus Ranking Report¶

Scope¶

Method And Scoring Rules¶

Aspect Taxonomy¶

Scoring Matrix¶

Mention Status Matrix¶

Cell Evidence Ledger¶

zdeepseek-4-pro-range.md¶

gemini-3.1-pro-range.md¶

glm-5.1-pro-range.md¶

gpt-5.4-range.md¶

gpt-5.5-range.md¶

kimi-2.6-range.md¶

mimo-2.5-pro-range.md¶

opus-4.7-range.md¶

qwen-3.6-range.md¶

qwen-3.7-max-range.md¶