Evaluator Consensus Ranking Report¶
Scope¶
This report applies docs/planner-graph-ref/analyse/analyse-evaluation-method.md to the 10 evaluator reports in this directory. The method document is not treated as an evaluator.
Evaluator inputs:
zdeepseek-4-pro-range.mdgemini-3.1-pro-range.mdglm-5.1-pro-range.mdgpt-5.4-range.mdgpt-5.5-range.mdkimi-2.6-range.mdmimo-2.5-pro-range.mdopus-4.7-range.mdqwen-3.6-range.mdqwen-3.7-max-range.md
The plan artifact names deepseek-4-pro-range.md, but the on-disk DeepSeek evaluator file is zdeepseek-4-pro-range.md; this report uses the on-disk source filename.
Method And Scoring Rules¶
Each evaluator report is represented as an aspect vector x_i = (s_i1, ..., s_i16) with scores in [-1, 1]. Equal evaluator weights are used: w_i = 0.1.
Scoring scale:
| Score | Meaning |
|---|---|
1.0 |
Strong support for the aspect. |
0.8 |
Clear support with qualification or less centrality. |
0.5 |
Moderate or partial support. |
0.3 |
Weak positive support. |
0.0 |
Neutral or unmentioned. |
| Negative values | Caution against or rejection of the aspect, with magnitude matching strength. |
Missingness policy: unmentioned aspects are scored 0 by user-selected override. This intentionally differs from the method document's caution that missing aspects should not always pull toward zero. Here, after neutral filling, c_ij = 1 for every evaluator-aspect cell; mention status is tracked separately for audit only.
Mention-status codes:
| Code | Meaning |
|---|---|
E |
Explicit claim or explicit verdict. |
I |
Implied by ranking, criticism, or recommended hybrid. |
U |
Unmentioned; neutral 0 by missingness override. |
Normalization decisions:
- Clearly equivalent terms were normalized:
adjust,decision_policy,enforce_policy,correct_policy, and post-decision normalization all map to A4 when the evaluator supports a node-owned correction stage. - Unclear equivalents were kept separate: single interrupt handling is A7, checkpoint-safe state is A8, and recipe caching is A16.
- Post-LLM correction split into many nodes is scored positive for A4 only when the evaluator endorsed it as node-owned correction, but the same text can still score high on A11 if it rejected graph over-decomposition.
- No independent planner-architecture judgment is added. Scores reflect only evaluator-report claims.
Ambiguity result: 0 / 160 = 0.000% unresolved ambiguous cells. No forced cells were needed.
Rounding policy: calculations used the full matrix values below. Aggregate and variance tables show 3 decimals plus exact values where rankings depend on small gaps. Distance tables show 6 decimals. Ties are broken by exact unrounded value, then lexicographic filename if exact values are equal. Near ties are gaps <= 0.020000.
Aspect Taxonomy¶
| Code | Aspect | Claim direction represented by positive scores |
|---|---|---|
| A1 | Right-sized moderate decomposition | Prefer a 4-7 phase graph over minimal or atomized rewrites. |
| A2 | Staged helper-first migration | Extract helpers and ship green stages before graph surgery. |
| A3 | LLM-only planning node | Isolate prompt construction and structured LLM output from deterministic policy. |
| A4 | Node-owned post-decision policy | Put redirects, caps, and calculation adjustment in a node-owned policy stage, not inline or in edges. |
| A5 | Pure routing edges | Conditional edges inspect state and route only; they do not mutate decisions. |
| A6 | Canonical loop-entry phase | Action nodes re-enter through one top-of-loop tick/guard/precheck phase. |
| A7 | Single interrupt surface | Preserve ask_user / observe_user as the interrupt path instead of adding special bootstrap observe nodes. |
| A8 | Checkpoint-safe state discipline | Minimize new state, avoid transient route fields, and add only serializer-safe state. |
| A9 | Calculator lifecycle visibility | Preserve or expose calculator caps, success, blocked recovery, and finish-to-calculate policy. |
| A10 | LLM failure provenance | Track llm_failed or decision_origin to avoid failed-LLM recovery loops. |
| A11 | Reject over-decomposition | Avoid chatty 10+ node graphs and excessive PostgresSaver checkpoint hops. |
| A12 | Reject under-decomposition | Avoid leaving a renamed 150-200 line god-node-lite as the target state. |
| A13 | Graph honesty and observability | Make Mermaid/LangGraph topology reflect the real planner lifecycle. |
| A14 | Test seams | Make node boundaries unit-testable without full LLM orchestration. |
| A15 | Finish validation | Add or value a finish-check / premature-finish validation step. |
| A16 | Recipe caching in state | Support caching dynamic acquisition recipes in planner state. |
Scoring Matrix¶
| Evaluator | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zdeepseek-4-pro-range.md |
1.0 | 1.0 | 1.0 | 1.0 | 0.8 | 1.0 | 0.5 | 0.4 | 0.8 | 0.7 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.8 |
gemini-3.1-pro-range.md |
1.0 | 0.7 | 0.8 | 0.8 | 0.5 | 0.8 | 0.0 | 0.6 | 0.4 | 0.0 | 1.0 | 0.8 | 0.9 | 0.8 | 0.0 | 0.0 |
glm-5.1-pro-range.md |
0.7 | 1.0 | 0.9 | 0.5 | 0.8 | 0.9 | 0.8 | 0.9 | 1.0 | 0.7 | 0.8 | 0.7 | 1.0 | 1.0 | 0.0 | -0.4 |
gpt-5.4-range.md |
1.0 | 1.0 | 1.0 | 0.9 | 1.0 | 1.0 | 1.0 | 1.0 | 0.9 | 0.7 | 1.0 | 0.9 | 1.0 | 1.0 | -0.2 | -0.6 |
gpt-5.5-range.md |
1.0 | 1.0 | 1.0 | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 | 0.9 | 1.0 | 1.0 | 0.9 | 1.0 | 1.0 | 0.3 | -0.7 |
kimi-2.6-range.md |
1.0 | 1.0 | 1.0 | 1.0 | 0.8 | 1.0 | -0.3 | 0.7 | 0.9 | 0.0 | 0.9 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 |
mimo-2.5-pro-range.md |
1.0 | 1.0 | 1.0 | 1.0 | 0.8 | 0.9 | 0.6 | 0.7 | 0.7 | 0.0 | 1.0 | 0.8 | 1.0 | 1.0 | 0.0 | 0.4 |
opus-4.7-range.md |
1.0 | 1.0 | 1.0 | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.6 | -0.8 |
qwen-3.6-range.md |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.9 | 0.7 | 1.0 | 0.8 | 0.6 | 1.0 | 1.0 | 1.0 | 0.5 |
qwen-3.7-max-range.md |
1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.6 | 0.8 | 0.8 | 0.6 | 1.0 | 0.8 | 1.0 | 1.0 | 0.8 | 0.6 |
Mention Status Matrix¶
| Evaluator | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zdeepseek-4-pro-range.md |
E | E | E | E | E | E | I | E | E | E | E | E | E | E | U | E |
gemini-3.1-pro-range.md |
E | E | E | E | I | E | U | I | I | U | E | E | E | E | U | U |
glm-5.1-pro-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | U | E |
gpt-5.4-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | I | E |
gpt-5.5-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | I | E |
kimi-2.6-range.md |
E | E | E | E | E | E | E | E | E | U | E | E | E | E | U | E |
mimo-2.5-pro-range.md |
E | E | E | E | E | E | I | E | E | U | E | E | E | E | U | I |
opus-4.7-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | E | E |
qwen-3.6-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | E | E |
qwen-3.7-max-range.md |
E | E | E | E | E | E | E | E | E | E | E | E | E | E | E | E |
Cell Evidence Ledger¶
Each item lists all 16 cell rationales for the source evaluator. U cells are neutral by the missingness override.
zdeepseek-4-pro-range.md¶
- A1
1.0/E: calls the 4-node DeepSeek approach the "Goldilocks" solution; rationale: strong support for moderate granularity. - A2
1.0/E: praises a helper-first migration with verification gates; rationale: supports staged rollout. - A3
1.0/E: says the LLM call should be isolated in a thin node; rationale: direct LLM-isolation claim. - A4
1.0/E: identifiesadjustas the strongest insight and canonical normalization stage; rationale: strong policy-node support. - A5
0.8/E: rejects redirect logic in edge functions when discussing Qwen-3.6; rationale: supports pure routing edges. - A6
1.0/E: requires loop-back through a canonical guard entry; rationale: strong loop-entry support. - A7
0.5/I: criticizes extra bootstrap-node variants but also accepts structural bootstrap visibility; rationale: partial support for single interrupt handling. - A8
0.4/E: criticizes_routefields but endorses recipe state; rationale: mixed checkpoint discipline. - A9
0.8/E: values calculator lifecycle and_adjust_calculation_decisionreuse; rationale: supports calculator preservation. - A10
0.7/E: highlightsllm_failed/ origin as a useful safeguard in other proposals; rationale: supports failure provenance. - A11
1.0/E: rejects 10+ node proposals as over-engineered; rationale: strong anti-overdecomposition claim. - A12
1.0/E: says 2-3 node proposals leave too much hidden logic; rationale: strong anti-underdecomposition claim. - A13
1.0/E: centers the hidden-routing problem and graph honesty; rationale: strong observability claim. - A14
1.0/E: repeatedly rates proposals by test seams and isolated nodes; rationale: strong testability claim. - A15
0.0/U: no finish-validation claim; rationale: neutral by missingness. - A16
0.8/E: callsrecipesin state a practical cache; rationale: supports recipe caching.
gemini-3.1-pro-range.md¶
- A1
1.0/E: names GPT-5.4 and DeepSeek as top phased-pipeline approaches; rationale: strong moderate-granularity support. - A2
0.7/E: values testable phases but gives limited migration detail; rationale: moderate staged-migration support. - A3
0.8/E: endorses separating preparation/rules from LLM planning; rationale: supports LLM isolation. - A4
0.8/E: recommendsdecision_policy/DeepSeek-style post-policy over hidden rewrites; rationale: supports policy node. - A5
0.5/I: says state mutations should happen in designated nodes; rationale: implied edge-purity support. - A6
0.8/E: notes loopbacks must be handled carefully through the phase entry; rationale: supports canonical re-entry. - A7
0.0/U: no single-interrupt claim; rationale: neutral by missingness. - A8
0.6/I: values idempotent state mutation and edge readability; rationale: implied checkpoint discipline. - A9
0.4/I: mentions acquisition/calculator gate concerns but not as a central criterion; rationale: weak calculator support. - A10
0.0/U: no LLM-failure provenance claim; rationale: neutral by missingness. - A11
1.0/E: explicitly rejects graph noise and extreme decomposition; rationale: strong anti-overdecomposition claim. - A12
0.8/E: says conservative 2-3 node splits leave rewrites hidden; rationale: supports avoiding under-decomposition. - A13
0.9/E: frames the graph diagram as misleading today; rationale: strong graph-honesty support. - A14
0.8/E: ties phase boundaries to unit-testability; rationale: supports test seams. - A15
0.0/U: no finish-validation claim; rationale: neutral by missingness. - A16
0.0/U: no recipe-caching claim; rationale: neutral by missingness.
glm-5.1-pro-range.md¶
- A1
0.7/E: ranks GPT-5.5 highest despite higher node count but still rejects extremes; rationale: moderate right-sizing support. - A2
1.0/E: praises staged extraction and migration discipline; rationale: strong staged-migration support. - A3
0.9/E: treats LLM planning as a separable phase; rationale: strong LLM-isolation support. - A4
0.5/E: favors separate normalization/retry/calculator policy more than one node; rationale: partial support for node-owned policy. - A5
0.8/E: rejects policy in routing functions and_routepatterns; rationale: supports pure edges. - A6
0.9/E: supports one iteration lifecycle entry; rationale: strong canonical-loop support. - A7
0.8/E: saysask_usershould remain the interrupt surface; rationale: supports single interrupt handling. - A8
0.9/E: emphasizes checkpoint safety and serializer-compatible state; rationale: strong state discipline. - A9
1.0/E: ranks calculator-visiblemaybe_calculatehighly; rationale: strong calculator-lifecycle support. - A10
0.7/E: valuesllm_failed/ origin handling; rationale: supports failure provenance. - A11
0.8/E: criticizes over-modeled graphs; rationale: anti-overdecomposition support. - A12
0.7/E: treats GLM-style minimal split as safe but not final; rationale: moderate anti-underdecomposition support. - A13
1.0/E: scores graph visibility as central; rationale: strong graph-honesty claim. - A14
1.0/E: maps tests to new node targets; rationale: strong test-seam support. - A15
0.0/U: no finish-validation claim; rationale: neutral by missingness. - A16
-0.4/E: warns against cached recipe/state additions unless proven safe; rationale: mild rejection of recipe caching.
gpt-5.4-range.md¶
- A1
1.0/E: recommends the GPT-5.4 moderate phased graph as target; rationale: strong right-sizing support. - A2
1.0/E: says helper extraction and staged migration are the best playbook; rationale: strong staged-migration support. - A3
1.0/E: says the LLM node should be prompt build plus structured call only; rationale: strong LLM-isolation claim. - A4
0.9/E: recommends onedecision_policynode; rationale: strong unified-policy support. - A5
1.0/E: explicitly says edges must be side-effect free; rationale: strong edge-purity claim. - A6
1.0/E: says onlytickshould increment iterations and loopbacks return to it; rationale: strong loop-entry support. - A7
1.0/E: preservesask_useras the single interrupt node; rationale: strong single-interrupt support. - A8
1.0/E: rejects route tags and checkpoint-risky state; rationale: strong state discipline. - A9
0.9/E: requires calculator policy to remain complete inacquisition_gate/decision_policy; rationale: strong calculator support. - A10
0.7/E: notes LLM-failure handoff needs care; rationale: supports provenance but not as central. - A11
1.0/E: rejects Opus/Qwen atomized graphs; rationale: strong anti-overdecomposition claim. - A12
0.9/E: says minimal splits leave policy hidden; rationale: strong anti-underdecomposition claim. - A13
1.0/E: makes graph honesty a main criterion; rationale: strong observability support. - A14
1.0/E: emphasizes phase-specific unit tests; rationale: strong test-seam support. - A15
-0.2/I: treats finish-check as a back-loaded validation rather than core shape; rationale: mild caution. - A16
-0.6/E: warns not to cache recipes without serialization proof; rationale: rejects recipe caching for first pass.
gpt-5.5-range.md¶
- A1
1.0/E: recommends GPT-5.4 spine with GPT-5.5 details; rationale: strong moderate-granularity support. - A2
1.0/E: endorses six green migration stages; rationale: strong staged-migration support. - A3
1.0/E: says LLM planning should be thin and isolated; rationale: strong LLM-isolation claim. - A4
0.8/E: supports node-owned normalization, but accepts merging post-LLM nodes for first pass; rationale: clear policy-node support. - A5
1.0/E: rejects mutation in edge functions; rationale: strong edge-purity support. - A6
1.0/E: requires loopbacks to top entry; rationale: strong canonical-loop support. - A7
1.0/E: keepsask_useras single interrupt surface; rationale: strong single-interrupt support. - A8
1.0/E: emphasizes checkpoint-owned state and serializer-safe additions; rationale: strong state discipline. - A9
0.9/E: values calculator routing and blocked-loop preservation; rationale: strong calculator support. - A10
1.0/E: explicitly recommendsllm_failedordecision_origin; rationale: strong failure-provenance support. - A11
1.0/E: rejects atomized corrector chains as too chatty; rationale: strong anti-overdecomposition support. - A12
0.9/E: says under-specified 2-3 node plans do not solve post-LLM policy; rationale: strong anti-underdecomposition support. - A13
1.0/E: treats graph honesty as a core criterion; rationale: strong observability support. - A14
1.0/E: gives test-target mapping; rationale: strong test-seam support. - A15
0.3/I: values completion correctness but does not make finish-check central; rationale: weak positive. - A16
-0.7/E: explicitly says not to store recipes unless serialization is proven; rationale: clear recipe-caching rejection.
kimi-2.6-range.md¶
- A1
1.0/E: recommends a 5-node hybrid from top proposals; rationale: strong right-sized support. - A2
1.0/E: requires staged migration and helper extraction; rationale: strong staged-migration support. - A3
1.0/E: isolates the LLM call inplan/decide; rationale: strong LLM-isolation support. - A4
1.0/E: strongly supportsadjust/enforce_policyas post-processing stage; rationale: strong policy-node support. - A5
0.8/E: rejects redirect-in-edge anti-pattern; rationale: supports edge purity. - A6
1.0/E: loops all action nodes back to guard/tick; rationale: strong canonical-loop support. - A7
-0.3/E: praises dedicatedask_region/ask_currencynode visibility; rationale: mild rejection of a single generic interrupt shape. - A8
0.7/E: discusses checkpointer overhead and avoids_route; rationale: supports state discipline with some recipe-cache optimism. - A9
0.9/E: valuescalc_gateand calculator adjustment routing; rationale: strong calculator support. - A10
0.0/U: no failure-provenance claim; rationale: neutral by missingness. - A11
0.9/E: rejects Opus/Qwen over-decomposition; rationale: strong anti-overdecomposition support. - A12
1.0/E: says minimal splits leave too much hidden; rationale: strong anti-underdecomposition support. - A13
1.0/E: repeatedly emphasizes truthful Mermaid diagrams; rationale: strong graph-honesty support. - A14
1.0/E: makes testability a high-weight criterion; rationale: strong test-seam support. - A15
0.0/U: no finish-validation claim; rationale: neutral by missingness. - A16
1.0/E: endorses recipe state as useful deduplication; rationale: strong recipe-caching support.
mimo-2.5-pro-range.md¶
- A1
1.0/E: ranks DeepSeek's 4-node split first; rationale: strong moderate-granularity support. - A2
1.0/E: praises helper-first migration; rationale: strong staged-migration support. - A3
1.0/E: requires LLM-only plan/decide separation; rationale: strong LLM-isolation support. - A4
1.0/E: callsadjustthe clean post-LLM stage; rationale: strong policy-node support. - A5
0.8/E: rejects redirect-in-routing and_routepatterns; rationale: supports pure edges. - A6
0.9/E: endorses action loopbacks to guard/tick; rationale: strong canonical-loop support. - A7
0.6/I: recommends avoiding separate region/currency nodes; rationale: partial single-interrupt support. - A8
0.7/E: values minimal state changes and flags transient route fields; rationale: supports state discipline. - A9
0.7/E: discusses calculator checks and adjustment but not as top criterion; rationale: clear but moderate support. - A10
0.0/U: no failure-provenance claim; rationale: neutral by missingness. - A11
1.0/E: ranks over-decomposed Opus/Qwen low; rationale: strong anti-overdecomposition support. - A12
0.8/E: says minimal 3-node proposals are still incomplete; rationale: supports anti-underdecomposition. - A13
1.0/E: centers graph routing honesty; rationale: strong observability support. - A14
1.0/E: scores proposals on testability; rationale: strong test-seam support. - A15
0.0/U: no finish-validation claim; rationale: neutral by missingness. - A16
0.4/I: treats recipe caching as sensible in the favored DeepSeek plan; rationale: weak to moderate recipe-caching support.
opus-4.7-range.md¶
- A1
1.0/E: recommends the GPT-5.4 six-phase shape; rationale: strong right-sized support. - A2
1.0/E: gives a six-commit green migration; rationale: strong staged-migration support. - A3
1.0/E: says LLM node should shrink to prompt plus structured call; rationale: strong LLM-isolation claim. - A4
0.8/E: supportsdecision_policy, while noting possible helper substructure; rationale: clear policy-node support. - A5
1.0/E: calls redirect-in-edge the biggest anti-pattern; rationale: strong edge-purity claim. - A6
1.0/E: says all loopbacks re-enter attick; rationale: strong canonical-loop support. - A7
1.0/E: keepsask_useras single interrupt and rejects extra observe nodes; rationale: strong single-interrupt support. - A8
1.0/E: warns about state surface, route fields, and serializer safety; rationale: strong state discipline. - A9
0.8/E: values calculator routing and failure safeguards; rationale: clear calculator support. - A10
1.0/E: explicitly recommendsllm_failed/decision_origin; rationale: strong failure-provenance support. - A11
1.0/E: rejects 13-node corrector chains as operationally heavy; rationale: strong anti-overdecomposition support. - A12
1.0/E: says three-node proposals do not solve the problem; rationale: strong anti-underdecomposition support. - A13
1.0/E: ranks by graph honesty and observability; rationale: strong graph-honesty support. - A14
1.0/E: rates test seams as a main axis; rationale: strong test-seam support. - A15
0.6/E: callsroute_finish_checka sharp idea; rationale: moderate finish-validation support. - A16
-0.8/E: rejects recipe state until serializer compatibility is proven; rationale: strong recipe-caching rejection.
qwen-3.6-range.md¶
- A1
1.0/E: final synthesis targets 5-6 nodes; rationale: strong moderate-granularity support. - A2
1.0/E: combines GPT-5.4 and GLM migration steps; rationale: strong staged-migration support. - A3
1.0/E: isolates LLM planning inllm_plan; rationale: strong LLM-isolation claim. - A4
1.0/E: recommends onedecision_policynode; rationale: strong policy-node support. - A5
1.0/E: lists no edge mutation as a design rule; rationale: strong edge-purity claim. - A6
1.0/E: says loopbacks go totick; rationale: strong canonical-loop support. - A7
1.0/E: saysask_userremains the single interrupt node; rationale: strong single-interrupt support. - A8
0.9/E: rejects_routeand checkpoint pollution while allowing deliberate state additions; rationale: strong state discipline. - A9
0.7/E: includes calculator gate/acquisition policy in target graph; rationale: clear calculator support. - A10
1.0/E: recommends addingllm_failed; rationale: strong failure-provenance support. - A11
0.8/E: despite ranking Opus high for mapping quality, final target reduces node count; rationale: anti-overdecomposition support. - A12
0.6/E: ranks GLM conservative split high as a first PR but not final; rationale: moderate anti-underdecomposition support. - A13
1.0/E: frames Mermaid truthfulness as a central objective; rationale: strong graph-honesty support. - A14
1.0/E: emphasizes green staged tests; rationale: strong test-seam support. - A15
1.0/E: uniquely recommendsfinish_check; rationale: strong finish-validation support. - A16
0.5/E: recommends recipe state as a performance improvement; rationale: moderate recipe-caching support.
qwen-3.7-max-range.md¶
- A1
1.0/E: states the sweet spot is 4-7 nodes and ranks GPT-5.4 first; rationale: strong right-sizing support. - A2
1.0/E: praises six-stage independent migration; rationale: strong staged-migration support. - A3
1.0/E: recommendsllm_planisolated from deterministic gates; rationale: strong LLM-isolation claim. - A4
1.0/E: endorses a singledecision_policy/post-processing stage; rationale: strong policy-node support. - A5
1.0/E: rejects routing-state and edge mutation patterns; rationale: strong edge-purity support. - A6
1.0/E: recommends tick/guard loop entry; rationale: strong canonical-loop support. - A7
0.6/E: criticizes separate bootstrap nodes but also values bootstrap clarity; rationale: partial single-interrupt support. - A8
0.8/E: flags_routeand checkpoint exclusion risk; rationale: strong state discipline. - A9
0.8/E: values acquisition and calculator gates; rationale: clear calculator support. - A10
0.6/E: praisesdecision_origin/llm_failedbut does not make it the main shape; rationale: moderate failure-provenance support. - A11
1.0/E: rejects 10-node full decomposition as excessive; rationale: strong anti-overdecomposition support. - A12
0.8/E: says 2-node split remains incomplete; rationale: strong anti-underdecomposition support. - A13
1.0/E: uses architectural soundness and graph truthfulness as high-weight criteria; rationale: strong graph-honesty support. - A14
1.0/E: includes testability improvement as a criterion; rationale: strong test-seam support. - A15
0.8/E: identifies finish/check-completion validation as useful; rationale: strong finish-validation support. - A16
0.6/E: praises recipe state but notes serialization care; rationale: moderate recipe-caching support.
Aggregate Vector And Agreement¶
Equal weights produce the aggregate vector bar{s}_j = mean_i(s_ij) because all cells have c_ij = 1.
| Aspect | Aggregate | Variance | Agreement signal |
|---|---|---|---|
| A1 | 0.970000 | 0.008100 | Very strong consensus |
| A2 | 0.970000 | 0.008100 | Very strong consensus |
| A3 | 0.970000 | 0.004100 | Very strong consensus |
| A4 | 0.880000 | 0.023600 | Strong consensus |
| A5 | 0.870000 | 0.024100 | Strong consensus |
| A6 | 0.960000 | 0.004400 | Very strong consensus |
| A7 | 0.620000 | 0.185600 | Mixed |
| A8 | 0.800000 | 0.036000 | Strong consensus |
| A9 | 0.790000 | 0.024900 | Strong consensus |
| A10 | 0.570000 | 0.158100 | Mixed |
| A11 | 0.950000 | 0.006500 | Very strong consensus |
| A12 | 0.850000 | 0.016500 | Strong consensus |
| A13 | 0.990000 | 0.000900 | Very strong consensus |
| A14 | 0.980000 | 0.003600 | Very strong consensus |
| A15 | 0.250000 | 0.150500 | Weak, disputed, or sparse |
| A16 | 0.080000 | 0.399600 | Highly disputed |
Strongest aggregate claims are A13 graph honesty, A14 test seams, A1/A2/A3 right-sized staged LLM-isolated design, A6 canonical loop entry, and A11 rejection of over-decomposition.
The highest-variance claims are A16 recipe caching, A7 single interrupt versus explicit bootstrap nodes, A10 LLM-failure provenance, and A15 finish validation. These are the least stable consensus points.
Distance-To-Center Calculation¶
Aspect-distance weights are equal: alpha_j = 1 / 16 = 0.0625.
Formula:
Clean ranking by weighted Euclidean distance to the aggregate center:
| Rank | Evaluator | Distance |
|---|---|---|
| 1 | mimo-2.5-pro-range.md |
0.183610 |
| 2 | qwen-3.7-max-range.md |
0.196119 |
| 3 | glm-5.1-pro-range.md |
0.205152 |
| 4 | zdeepseek-4-pro-range.md |
0.226578 |
| 5 | gpt-5.4-range.md |
0.237881 |
| 6 | gpt-5.5-range.md |
0.253155 |
| 7 | qwen-3.6-range.md |
0.274704 |
| 8 | gemini-3.1-pro-range.md |
0.283659 |
| 9 | opus-4.7-range.md |
0.287163 |
| 10 | kimi-2.6-range.md |
0.366691 |
Near-tie gaps:
| Adjacent pair | Gap |
|---|---|
mimo-2.5-pro-range.md to qwen-3.7-max-range.md |
0.012509 |
qwen-3.7-max-range.md to glm-5.1-pro-range.md |
0.009034 |
glm-5.1-pro-range.md to zdeepseek-4-pro-range.md |
0.021425 |
zdeepseek-4-pro-range.md to gpt-5.4-range.md |
0.011303 |
gpt-5.4-range.md to gpt-5.5-range.md |
0.015274 |
gpt-5.5-range.md to qwen-3.6-range.md |
0.021549 |
qwen-3.6-range.md to gemini-3.1-pro-range.md |
0.008955 |
gemini-3.1-pro-range.md to opus-4.7-range.md |
0.003504 |
opus-4.7-range.md to kimi-2.6-range.md |
0.079528 |
Closest-to-center opinion: mimo-2.5-pro-range.md.
This is a centrality result, not a claim that Mimo's own ranked recommendations were the most favored. It is closest because it strongly supports the shared mainstream claims while remaining less extreme on the most disputed aspects: A7, A10, A15, and A16.
Pairwise Distance Matrix¶
Each value is the same equal-aspect weighted Euclidean distance between two evaluator vectors.
| Evaluator | zdeepseek | gemini | glm | gpt-5.4 | gpt-5.5 | kimi | mimo | opus | qwen-3.6 | qwen-3.7 |
|---|---|---|---|---|---|---|---|---|---|---|
| zdeepseek | 0.000000 | 0.350892 | 0.379967 | 0.409268 | 0.443001 | 0.282843 | 0.225000 | 0.480885 | 0.347311 | 0.242384 |
| gemini | 0.350892 | 0.000000 | 0.372492 | 0.417582 | 0.465698 | 0.330719 | 0.242384 | 0.489260 | 0.501248 | 0.389711 |
| glm | 0.379967 | 0.372492 | 0.000000 | 0.182003 | 0.201556 | 0.510514 | 0.325960 | 0.258602 | 0.390512 | 0.369966 |
| gpt-5.4 | 0.409268 | 0.417582 | 0.182003 | 0.000000 | 0.150000 | 0.555653 | 0.343693 | 0.223607 | 0.427931 | 0.409268 |
| gpt-5.5 | 0.443001 | 0.465698 | 0.201556 | 0.150000 | 0.000000 | 0.605186 | 0.410030 | 0.086603 | 0.366572 | 0.384057 |
| kimi | 0.282843 | 0.330719 | 0.510514 | 0.555653 | 0.605186 | 0.000000 | 0.281736 | 0.636396 | 0.514174 | 0.360555 |
| mimo | 0.225000 | 0.242384 | 0.325960 | 0.343693 | 0.410030 | 0.281736 | 0.000000 | 0.446514 | 0.382426 | 0.263391 |
| opus | 0.480885 | 0.489260 | 0.258602 | 0.223607 | 0.086603 | 0.636396 | 0.446514 | 0.000000 | 0.363146 | 0.390512 |
| qwen-3.6 | 0.347311 | 0.501248 | 0.390512 | 0.427931 | 0.366572 | 0.514174 | 0.382426 | 0.363146 | 0.000000 | 0.171391 |
| qwen-3.7 | 0.242384 | 0.389711 | 0.369966 | 0.409268 | 0.384057 | 0.360555 | 0.263391 | 0.390512 | 0.171391 | 0.000000 |
Medoid Calculation¶
Medoid total for evaluator i:
Clean medoid centrality ranking:
| Rank | Evaluator | Weighted total distance |
|---|---|---|
| 1 | mimo-2.5-pro-range.md |
0.292114 |
| 2 | qwen-3.7-max-range.md |
0.298124 |
| 3 | glm-5.1-pro-range.md |
0.299157 |
| 4 | gpt-5.5-range.md |
0.311270 |
| 5 | gpt-5.4-range.md |
0.311900 |
| 6 | zdeepseek-4-pro-range.md |
0.316155 |
| 7 | opus-4.7-range.md |
0.337552 |
| 8 | qwen-3.6-range.md |
0.346471 |
| 9 | gemini-3.1-pro-range.md |
0.355999 |
| 10 | kimi-2.6-range.md |
0.407778 |
Near-tie gaps:
| Adjacent pair | Gap |
|---|---|
mimo-2.5-pro-range.md to qwen-3.7-max-range.md |
0.006010 |
qwen-3.7-max-range.md to glm-5.1-pro-range.md |
0.001034 |
glm-5.1-pro-range.md to gpt-5.5-range.md |
0.012113 |
gpt-5.5-range.md to gpt-5.4-range.md |
0.000630 |
gpt-5.4-range.md to zdeepseek-4-pro-range.md |
0.004255 |
zdeepseek-4-pro-range.md to opus-4.7-range.md |
0.021397 |
opus-4.7-range.md to qwen-3.6-range.md |
0.008919 |
qwen-3.6-range.md to gemini-3.1-pro-range.md |
0.009527 |
gemini-3.1-pro-range.md to kimi-2.6-range.md |
0.051779 |
Medoid opinion: mimo-2.5-pro-range.md.
The same report is both closest to the aggregate center and the medoid by total pairwise distance.
Matrix-Derived Overall Opinion¶
The aggregate evaluator opinion is strongly positive toward a moderate planner-graph refactor that makes the hidden planner lifecycle visible without turning every rule into a checkpointed node. The consensus favors a staged helper-first migration, a canonical loop-entry phase, a thin LLM planning node, and a node-owned post-decision policy phase. The evaluators strongly agree that routing edges must remain pure, action loopbacks should re-enter at the top of the planning loop, and tests should move toward phase-specific seams.
The aggregate is also strongly negative toward both extremes: a 10+ node policy lattice with excessive checkpoint writes and a minimal 2-3 node split that leaves most deterministic policy hidden in a renamed god node. The consensus synthesis points toward a GPT-5.4-like or DeepSeek-like moderate pipeline, with GPT-5.5-style state and failure-provenance caution where explicitly supported by the evaluator reports.
The unresolved consensus tensions are visible in the high-variance aspects. Recipe caching in planner state is the most disputed claim: some evaluators treat it as useful deduplication, while others warn that FieldAcquisition-like objects may be unsafe for checkpoint serialization. Single-interrupt handling is also mixed because some reports value explicit region/currency graph nodes, while others prefer preserving the existing ask_user / observe_user path. Finish validation has weak aggregate support because it is a sharp but sparsely mentioned idea. LLM-failure provenance is positively supported but not universal.
The nearest existing opinion is mimo-2.5-pro-range.md, and the medoid is also mimo-2.5-pro-range.md. This means it is the most central existing evaluator report under the selected aspect-space representation and equal weights.