Evaluator Consensus Ranking Report¶
Scope¶
This report applies analyse-evaluation-method.md to the 10 evaluator reports in this directory. It ranks evaluator reports by:
- distance to the equal-weight overall opinion;
- medoid centrality under pairwise evaluator distances.
This is an evaluator-opinion synthesis only. It does not add an independent architecture judgment about the planner refactor.
Source Set¶
| Code | Evaluator report |
|---|---|
| DS | deepseek-4-pro-range.md |
| GE | gemini-3.1-pro-range.md |
| GL | glm-5.1-pro-range.md |
| G54 | gpt-5.4-range.md |
| G55 | gpt-5.5-range.md |
| KI | kimi-2.6-range.md |
| MI | mimo-2.5-pro-range.md |
| OP | opus-4.7-range.md |
| Q36 | qwen-3.6-range.md |
| Q37 | qwen-3.7-max-range.md |
analyse-evaluation-method.md is the method source, not an evaluator.
Result Summary¶
| Ranking type | Winner | Score | Interpretation |
|---|---|---|---|
| Closest to equal-weight overall opinion | DS | 0.131789 | Lowest weighted Euclidean distance from the aggregate vector. |
| Medoid | DS | 0.276161 | Lowest equal-weight total pairwise distance to all evaluator reports. |
Both methods select deepseek-4-pro-range.md as the most central evaluator opinion.
Scoring Method¶
The method file requires a formal feature-space representation before aggregation. This report uses 19 fine-grained semantic aspects induced from recurring evaluator claims.
Rules:
- Evaluator weights are equal:
w_i = 0.1. - Aspect-distance weights are equal:
alpha_j = 1 / 19. - Scores use
[-1, 1]. +1means strong support for the aspect statement.0means neutral, balanced, or not mentioned.-1means strong rejection of the aspect statement.- Only clearly equivalent judgments are normalized into the same aspect.
- If equivalence is unclear, it remains a separate aspect.
- Unmentioned aspects are scored
0by the user-selected override. - After neutral filling, every matrix cell is present and
c_ij = 1. - Mention status is tracked separately in the evidence ledger.
- Unresolved ambiguous cells:
0 / 190 = 0.0%, below the<= 3%requirement.
Rounding policy:
- Source scores are shown to one decimal because scoring was assigned on a tenth-point rubric.
- Aggregate and variance values are shown to four decimals.
- Ranking metrics are shown to six decimals.
- Pairwise matrix entries are shown to four decimals.
- Clean ordering uses full-precision values before rounding; if a full-precision tie occurred, the tie-break would be source filename lexicographic order. No full-precision ties occurred.
Aspect Taxonomy¶
| Aspect | Directional statement scored positively |
|---|---|
| A01 | A moderate 4-6 phase graph is the target sweet spot. |
| A02 | DeepSeek's guard -> prepare -> plan -> adjust topology is a strong base. |
| A03 | GPT-5.4's tick -> bootstrap_gate -> prepare_context -> acquisition_gate -> llm_plan -> decision_policy topology is a strong base. |
| A04 | GPT-5.5's detailed state/test/non-goal hygiene should be valued or borrowed. |
| A05 | Two- or three-node minimal splits are insufficient as the final architecture. |
| A06 | Eight-plus-node or atomized graphs are over-decomposed and costly. |
| A07 | A single post-LLM decision_policy/adjust node is preferred. |
| A08 | Post-LLM correction should be split into several graph nodes. |
| A09 | Edge functions must remain pure and must not mutate state or decisions. |
| A10 | Action loopbacks should return to a canonical top-of-loop entry node. |
| A11 | A dedicated iteration tick/enter_iteration phase is important. |
| A12 | Dedicated region/currency bootstrap nodes are beneficial. |
| A13 | Existing public contracts should stay stable, including single ask_user interrupt and outer graph boundaries. |
| A14 | State/checkpoint discipline should avoid _route-style hints and rich cached objects unless proven safe. |
| A15 | LLM-failure or decision-origin state is valuable for calculator/reflect safety. |
| A16 | Helper-first staged migration with green tests is required. |
| A17 | _build_plan_output or equivalent output-state consolidation is valuable. |
| A18 | Opus-style state ownership, dispatch, or naming documentation is valuable as a reference. |
| A19 | Finish validation such as route_finish_check/finish_check is valuable. |
Claim Ledger¶
The taxonomy was induced from these recurring evaluator claims:
| Aspect | Recurring claim sources |
|---|---|
| A01 | DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 converge on a moderate graph rather than status quo or graph explosion. |
| A02 | DS, KI, MI, and Q37 rank DeepSeek very high; OP and GL treat it as a strong middle-ground source. |
| A03 | GE, G54, G55, OP, Q36, and Q37 rank GPT-5.4 first or use it as the structural spine. |
| A04 | GL ranks GPT-5.5 first; G54, G55, OP, and Q37 value its state/test/non-goal detail even when not adopting all nodes. |
| A05 | Most reports mark GLM/Gemini/Mimo-style 2-3 node splits as useful first steps but incomplete final targets. |
| A06 | Most reports reject Opus/Qwen/GPT-5.5/Kimi-level atomization when it creates too many checkpointed hops. |
| A07 | DS, GE, G54, G55, KI, MI, OP, Q36, and Q37 favor one policy node or equivalent merged correction stage. |
| A08 | GL partly values splitting correction concerns; most other reports reject multi-node correction chains for first implementation. |
| A09 | DS, G54, G55, KI, MI, OP, Q36, and Q37 explicitly reject mutating edge functions. |
| A10 | DS, G54, G55, KI, MI, OP, Q36, and Q37 converge on tick, guard, pre_check, or equivalent loop entry. |
| A11 | G54, G55, OP, Q36, and Q37 strongly emphasize a dedicated iteration phase; DS and MI borrow or rename toward it. |
| A12 | KI supports explicit bootstrap nodes; many other reports prefer a merged bootstrap gate or existing ask_user flow. |
| A13 | G54, G55, OP, Q36, and Q37 strongly protect public planner/outer graph/interrupt contracts. |
| A14 | G54, G55, OP, Q36, and several others reject _route fields and unproven checkpoint caches; DS/KI/Q37 are more positive on recipes in state. |
| A15 | GL, G54, G55, OP, Q36, and Q37 value llm_failed or decision_origin. |
| A16 | All detailed reports reward helper extraction and staged migration. |
| A17 | GL, G54, MI, Q36, and Q37 value GLM's _build_plan_output helper; several others do not mention it. |
| A18 | DS, GL, OP, Q36, and Q37 value Opus's state ownership, dispatch, or naming reference material. |
| A19 | Q36 and Q37 strongly value finish validation; OP and DS note the idea; many reports leave it neutral. |
Full Scoring Matrix¶
| Evaluator | A01 | A02 | A03 | A04 | A05 | A06 | A07 | A08 | A09 | A10 | A11 | A12 | A13 | A14 | A15 | A16 | A17 | A18 | A19 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DS | +1.0 | +1.0 | +0.8 | +0.5 | +0.8 | +0.9 | +0.9 | -0.8 | +1.0 | +1.0 | +0.8 | -0.3 | +0.8 | +0.3 | +0.6 | +1.0 | +0.2 | +0.8 | +0.4 |
| GE | +1.0 | +0.8 | +1.0 | 0.0 | +0.8 | +1.0 | +0.8 | -0.8 | +0.7 | +0.4 | +0.4 | 0.0 | 0.0 | +0.3 | 0.0 | +0.4 | 0.0 | 0.0 | 0.0 |
| GL | +0.8 | +0.8 | +0.7 | +1.0 | +1.0 | +0.8 | +0.8 | +0.3 | +0.8 | +1.0 | +0.8 | +0.2 | +0.8 | +0.6 | +1.0 | +0.9 | +0.8 | +0.9 | 0.0 |
| G54 | +1.0 | +0.4 | +1.0 | +0.8 | +0.8 | +0.9 | +1.0 | -0.6 | +1.0 | +1.0 | +1.0 | -0.7 | +1.0 | +1.0 | +0.8 | +1.0 | +0.8 | +0.4 | 0.0 |
| G55 | +1.0 | +0.4 | +1.0 | +0.8 | +0.8 | +0.9 | +0.9 | -0.5 | +1.0 | +1.0 | +1.0 | -0.5 | +1.0 | +0.9 | +0.9 | +1.0 | 0.0 | +0.4 | 0.0 |
| KI | +1.0 | +1.0 | +0.8 | +0.3 | +0.7 | +1.0 | +1.0 | -0.7 | +1.0 | +1.0 | +0.5 | +0.8 | +0.2 | +0.2 | +0.4 | +1.0 | 0.0 | +0.6 | 0.0 |
| MI | +1.0 | +1.0 | +0.8 | +0.3 | +0.5 | +1.0 | +1.0 | -0.7 | +1.0 | +1.0 | +0.6 | -0.5 | +0.7 | +0.4 | +0.5 | +1.0 | +0.8 | +0.5 | 0.0 |
| OP | +1.0 | +0.8 | +1.0 | +1.0 | +0.9 | +0.9 | +1.0 | -0.6 | +1.0 | +1.0 | +1.0 | -0.5 | +1.0 | +1.0 | +1.0 | +1.0 | 0.0 | +0.8 | +0.7 |
| Q36 | +1.0 | +0.5 | +1.0 | +0.4 | +0.5 | +0.8 | +1.0 | -0.7 | +1.0 | +1.0 | +1.0 | -0.6 | +1.0 | +0.8 | +0.8 | +1.0 | +0.8 | +0.8 | +1.0 |
| Q37 | +1.0 | +0.9 | +1.0 | +0.8 | +0.8 | +0.9 | +1.0 | -0.7 | +1.0 | +1.0 | +1.0 | -0.7 | +0.9 | +0.2 | +0.8 | +1.0 | +1.0 | +0.8 | +1.0 |
Aggregate Vector And Agreement¶
| Aspect | Mean score | Variance | Agreement note |
|---|---|---|---|
| A01 | 0.9800 | 0.0036 | Very strong consensus for moderate graph sizing. |
| A02 | 0.7600 | 0.0524 | Strong support for DeepSeek as one central topology source. |
| A03 | 0.9100 | 0.0129 | Very strong support for GPT-5.4 as topology source. |
| A04 | 0.5900 | 0.1029 | Moderate support; disagreement comes from GPT-5.5 being valued but often seen as node-heavy. |
| A05 | 0.7600 | 0.0224 | Strong consensus that minimal splits are not enough as final shape. |
| A06 | 0.9100 | 0.0049 | Very strong consensus against atomized graph shapes. |
| A07 | 0.9400 | 0.0064 | Very strong consensus for one policy/adjust node. |
| A08 | -0.5800 | 0.0936 | Moderate rejection of multi-node correction chains. |
| A09 | 0.9500 | 0.0105 | Very strong consensus for pure routing edges. |
| A10 | 0.9400 | 0.0324 | Very strong consensus for canonical loop entry. |
| A11 | 0.8100 | 0.0489 | Strong support for dedicated iteration lifecycle. |
| A12 | -0.2800 | 0.2076 | Highest disagreement; Kimi supports dedicated bootstrap nodes, many others reject them. |
| A13 | 0.7400 | 0.1144 | Strong support for preserving contracts, with some shorter reports neutral. |
| A14 | 0.5700 | 0.0981 | Moderate support for checkpoint caution; recipe caching creates disagreement. |
| A15 | 0.6800 | 0.0876 | Strong support for LLM failure/provenance tracking. |
| A16 | 0.9300 | 0.0321 | Very strong consensus for staged migration. |
| A17 | 0.4400 | 0.1664 | Moderate support but many reports do not mention the helper. |
| A18 | 0.6000 | 0.0700 | Moderate support for Opus reference material. |
| A19 | 0.3100 | 0.1689 | Weak-to-moderate support; only some reports emphasize finish validation. |
Overall Opinion Generated From The Aggregate¶
The equal-weight overall evaluator opinion is strongly in favor of a moderate phased planner-graph refactor. The center of opinion most strongly supports a 4-6 phase topology, pure routing edges, a canonical loop-entry node, a small LLM-only planning node, a single post-LLM policy/adjustment node, and a staged helper-first migration. GPT-5.4 and DeepSeek are the two strongest topology anchors in the aggregate, with GPT-5.5 contributing state, test, non-goal, and LLM-failure hygiene. The clearest aggregate rejections are over-atomized graphs, mutation inside edge functions, and treating every post-LLM correction as a separate checkpointed node. The most disputed areas are dedicated region/currency bootstrap nodes, recipe caching in checkpoint state, _build_plan_output emphasis, and finish-validation placement.
Distance To Overall Opinion¶
Formula:
Clean forced ordering:
| Rank | Evaluator | Distance to center | Gap from previous |
|---|---|---|---|
| 1 | DS | 0.131789 | 0.000000 |
| 2 | MI | 0.184961 | 0.053172 |
| 3 | G55 | 0.210513 | 0.025552 |
| 4 | G54 | 0.222900 | 0.012387 |
| 5 | OP | 0.233734 | 0.010834 |
| 6 | Q36 | 0.244949 | 0.011215 |
| 7 | Q37 | 0.263778 | 0.018829 |
| 8 | GL | 0.306937 | 0.043159 |
| 9 | KI | 0.344429 | 0.037492 |
| 10 | GE | 0.400657 | 0.056228 |
Closest opinion to the overall opinion: deepseek-4-pro-range.md.
Pairwise Distance Matrix¶
Formula:
| From \ To | DS | GE | GL | G54 | G55 | KI | MI | OP | Q36 | Q37 |
|---|---|---|---|---|---|---|---|---|---|---|
| DS | 0.0000 | 0.4136 | 0.3763 | 0.3236 | 0.2819 | 0.3269 | 0.2152 | 0.2606 | 0.2902 | 0.2734 |
| GE | 0.4136 | 0.0000 | 0.5813 | 0.5370 | 0.4952 | 0.3449 | 0.4007 | 0.5685 | 0.5680 | 0.5835 |
| GL | 0.3763 | 0.5813 | 0.0000 | 0.3713 | 0.3728 | 0.4507 | 0.3940 | 0.3920 | 0.4431 | 0.4249 |
| G54 | 0.3236 | 0.5370 | 0.3713 | 0.0000 | 0.1947 | 0.5282 | 0.2819 | 0.2884 | 0.2800 | 0.3332 |
| G55 | 0.2819 | 0.4952 | 0.3728 | 0.1947 | 0.0000 | 0.4634 | 0.3332 | 0.2176 | 0.3364 | 0.3980 |
| KI | 0.3269 | 0.3449 | 0.4507 | 0.5282 | 0.4634 | 0.0000 | 0.3763 | 0.4995 | 0.5346 | 0.5385 |
| MI | 0.2152 | 0.4007 | 0.3940 | 0.2819 | 0.3332 | 0.3763 | 0.0000 | 0.3859 | 0.3195 | 0.3162 |
| OP | 0.2606 | 0.5685 | 0.3920 | 0.2884 | 0.2176 | 0.4995 | 0.3859 | 0.0000 | 0.2763 | 0.3154 |
| Q36 | 0.2902 | 0.5680 | 0.4431 | 0.2800 | 0.3364 | 0.5346 | 0.3195 | 0.2763 | 0.0000 | 0.2103 |
| Q37 | 0.2734 | 0.5835 | 0.4249 | 0.3332 | 0.3980 | 0.5385 | 0.3162 | 0.3154 | 0.2103 | 0.0000 |
Medoid Ranking¶
Formula:
Clean forced ordering:
| Rank | Evaluator | Equal-weight pairwise total | Gap from previous |
|---|---|---|---|
| 1 | DS | 0.276161 | 0.000000 |
| 2 | MI | 0.302303 | 0.026142 |
| 3 | G55 | 0.309327 | 0.007024 |
| 4 | G54 | 0.313841 | 0.004514 |
| 5 | OP | 0.320413 | 0.006572 |
| 6 | Q36 | 0.325835 | 0.005422 |
| 7 | Q37 | 0.339348 | 0.013513 |
| 8 | GL | 0.380641 | 0.041293 |
| 9 | KI | 0.406289 | 0.025648 |
| 10 | GE | 0.449273 | 0.042984 |
Medoid: deepseek-4-pro-range.md.
Full Scoring Evidence Ledger¶
Each item follows Aspect score status: evidence note. Unmentioned means the evaluator report did not make a clear equivalent judgment for that aspect; the score is therefore neutral 0 by the user-selected override.
DS: deepseek-4-pro-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Calls the 4-node pipeline the "Goldilocks" balance and later recommends a 5-node synthesis. |
| A02 | +1.0 | explicit | Ranks DeepSeek #1 and adopts its guard/prepare/plan/adjust topology. |
| A03 | +0.8 | explicit | Ranks GPT-5.4 #2 and adopts its design principles and tick discipline. |
| A04 | +0.5 | explicit | Values GPT-5.5 organization, consequences, and decision_origin, but ranks it below top three. |
| A05 | +0.8 | explicit | Treats GLM, Mimo, and Gemini minimal splits as useful but incomplete. |
| A06 | +0.9 | explicit | Criticizes Opus/Qwen-style graph expansion and _route patterns as over-complex. |
| A07 | +0.9 | explicit | Says one decision_policy or adjust node is enough and rejects corrector explosion. |
| A08 | -0.8 | explicit | Says splitting every corrector adds graph noise without enough benefit. |
| A09 | +1.0 | explicit | Explicitly rejects redirects in edge functions as mutation outside nodes. |
| A10 | +1.0 | explicit | Identifies canonical loop-entry as an adopted cross-cutting pattern. |
| A11 | +0.8 | explicit | Adopts GPT-5.4's separate tick as an improvement over a combined guard. |
| A12 | -0.3 | implied | Notes separate bootstrap nodes can improve clarity, but final synthesis keeps them inside guard/bootstrap. |
| A13 | +0.8 | explicit | Adopts anti-pattern constraints such as keeping ask_user as the sole interrupt. |
| A14 | +0.3 | explicit | Rejects _route; mixed because it also recommends recipes in state with serialization verification. |
| A15 | +0.6 | explicit | Recommends borrowing GPT-5.5's decision_origin. |
| A16 | +1.0 | explicit | Treats helper extraction and staged verification as central to the best proposal. |
| A17 | +0.2 | explicit | Notes GLM's _build_plan_output helper as useful but not central. |
| A18 | +0.8 | explicit | Adopts Opus's state ownership table, naming convention, and dispatch idea. |
| A19 | +0.4 | explicit | Notes Qwen-3.6's route_finish_check as a good safety idea but does not center it. |
GE: gemini-3.1-pro-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends a 5-6 node phased pipeline as the best combination. |
| A02 | +0.8 | explicit | Names DeepSeek as runner-up and uses it as an architectural nod. |
| A03 | +1.0 | explicit | Names GPT-5.4 the winner and primary recommendation. |
| A04 | 0.0 | unmentioned | Does not make a clear specific judgment about GPT-5.5's hygiene contributions. |
| A05 | +0.8 | explicit | Says conservative 2-3 node splits leave too much bundled. |
| A06 | +1.0 | explicit | Rejects Qwen-3.7-style 10-node decomposition as graph noise and chatty state. |
| A07 | +0.8 | explicit | Says policy should be consolidated in one decision_policy node. |
| A08 | -0.8 | explicit | Rejects turning every policy rule into a node. |
| A09 | +0.7 | explicit | Says mutations belong in designated nodes and edges should read state and route. |
| A10 | +0.4 | explicit | Notes loop-back edge handling as a key concern. |
| A11 | +0.4 | implied | Recommends tick_and_guard, implying some iteration-entry support without separating it strongly. |
| A12 | 0.0 | unmentioned | No clear support or rejection of dedicated bootstrap nodes. |
| A13 | 0.0 | unmentioned | No clear public-contract or single-interrupt claim. |
| A14 | +0.3 | implied | Supports idempotent state mutations but does not discuss route fields or rich caches. |
| A15 | 0.0 | unmentioned | No clear LLM-failure or decision-origin claim. |
| A16 | +0.4 | explicit | Mentions testability and isolation, but not a detailed staged migration. |
| A17 | 0.0 | unmentioned | No _build_plan_output or equivalent helper claim. |
| A18 | 0.0 | unmentioned | No clear Opus state-ownership or dispatch-reference claim. |
| A19 | 0.0 | unmentioned | No finish-validation claim. |
GL: glm-5.1-pro-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +0.8 | explicit | Final recommendation uses a moderate graph shape after selective borrowing. |
| A02 | +0.8 | explicit | Ranks DeepSeek runner-up and borrows the adjust starting point. |
| A03 | +0.7 | explicit | Values GPT-5.4 anti-patterns and tick, though ranks it fourth. |
| A04 | +1.0 | explicit | Ranks GPT-5.5 best overall and uses it as base. |
| A05 | +1.0 | explicit | Says Gemini/Mimo/GLM-level minimal shapes are incomplete final targets. |
| A06 | +0.8 | explicit | Critiques Qwen-3.7 as over-engineered and Opus as slightly over-decomposed. |
| A07 | +0.8 | explicit | Recommends starting with a single correct_policy node. |
| A08 | +0.3 | explicit | Also praises GPT-5.5's separation of normalize/retry/maybe-calculate concerns. |
| A09 | +0.8 | explicit | Adopts anti-pattern rules forbidding mutation in edges. |
| A10 | +1.0 | explicit | Final graph routes all loopbacks through enter_iteration. |
| A11 | +0.8 | explicit | Values tick/enter_iteration as a distinct lifecycle node. |
| A12 | +0.2 | implied | Sees dedicated region/currency nodes as visible, but final recommendation folds them into a gate. |
| A13 | +0.8 | explicit | Preserves non-goals and public planner boundaries from top proposals. |
| A14 | +0.6 | explicit | Warns about _route and checkpoint concerns, but not as strongly as G54/G55/OP. |
| A15 | +1.0 | explicit | Strongly values decision_origin or llm_failed. |
| A16 | +0.9 | explicit | Emphasizes staged migration and helper extraction. |
| A17 | +0.8 | explicit | Treats _build_plan_output as a useful standalone helper. |
| A18 | +0.9 | explicit | Borrows Opus naming and values state ownership documentation. |
| A19 | 0.0 | unmentioned | No clear finish-validation adoption. |
G54: gpt-5.4-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends GPT-5.4's moderate six-phase target graph. |
| A02 | +0.4 | explicit | Treats DeepSeek as a useful mental model but ranks it fourth. |
| A03 | +1.0 | explicit | Ranks GPT-5.4 first and recommends it as end-state architecture. |
| A04 | +0.8 | explicit | Ranks GPT-5.5 second and uses its migration/state/test discipline. |
| A05 | +0.8 | explicit | Says GLM/Gemini/Mimo are safer first slices but weak final targets. |
| A06 | +0.9 | explicit | Rejects atomized Opus/Qwen shapes and checkpoint-boundary excess. |
| A07 | +1.0 | explicit | Recommends a single decision_policy node. |
| A08 | -0.6 | explicit | Says GPT-5.5's normalize/retry/maybe-calculate split is one split too far. |
| A09 | +1.0 | explicit | Explicitly rejects policy in edge functions. |
| A10 | +1.0 | explicit | Requires all action loopbacks to return to tick. |
| A11 | +1.0 | explicit | Makes tick the first core target node. |
| A12 | -0.7 | explicit | Rejects separate bootstrap interrupt nodes unless existing flow becomes insufficient. |
| A13 | +1.0 | explicit | Strongly preserves ask_user, outer graph, thread naming, and public surfaces. |
| A14 | +1.0 | explicit | Rejects _route patterns and unproven rich state caches. |
| A15 | +0.8 | explicit | Borrows llm_failed/origin handling from GPT-5.5. |
| A16 | +1.0 | explicit | Gives staged migration sequence and helper-first guidance. |
| A17 | +0.8 | explicit | Borrows GLM's _build_plan_output helper. |
| A18 | +0.4 | explicit | Keeps Opus as a later checklist but not as graph shape. |
| A19 | 0.0 | unmentioned | No clear finish-validation claim. |
G55: gpt-5.5-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends a moderate phased graph rather than detailed atomization. |
| A02 | +0.4 | explicit | Treats DeepSeek as a close moderate split but below GPT-5.4/GPT-5.5. |
| A03 | +1.0 | explicit | Ranks GPT-5.4 as best implementation spine. |
| A04 | +0.8 | explicit | Ranks GPT-5.5 second and mines it for state/test details. |
| A05 | +0.8 | explicit | Marks Gemini/Mimo as too coarse and under-specified. |
| A06 | +0.9 | explicit | Rejects fully atomized Opus/Qwen shapes for first implementation. |
| A07 | +0.9 | explicit | Uses one decision_policy in the suggested target graph. |
| A08 | -0.5 | explicit | Says GPT-5.5's three post-LLM nodes are good long-term seams but too much first-step risk. |
| A09 | +1.0 | explicit | States edge functions should inspect only and mutations belong in nodes. |
| A10 | +1.0 | explicit | All loopbacks return to tick. |
| A11 | +1.0 | explicit | tick is the start node in the recommended graph. |
| A12 | -0.5 | explicit | Says require-region/currency may be conceptual but should still use existing ask_user. |
| A13 | +1.0 | explicit | Strongly preserves single interrupt and outer graph contracts. |
| A14 | +0.9 | explicit | Rejects rich cached recipes and informal transient state unless serializer-safe. |
| A15 | +0.9 | explicit | Values llm_failed or decision_origin for calculator-loop safety. |
| A16 | +1.0 | explicit | Provides helper-first implementation guidance. |
| A17 | 0.0 | unmentioned | No clear _build_plan_output claim. |
| A18 | +0.4 | explicit | Treats Opus as checklist material, not target graph. |
| A19 | 0.0 | unmentioned | No finish-validation claim. |
KI: kimi-2.6-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends a 5-node hybrid as the VentureScope pipeline. |
| A02 | +1.0 | explicit | Ranks DeepSeek #1 and adopts its primary architecture. |
| A03 | +0.8 | explicit | Ranks GPT-5.4 #2 and uses its staged migration. |
| A04 | +0.3 | explicit | Notes useful GPT-5.5 ideas but ranks it as over-engineered. |
| A05 | +0.7 | explicit | Says GLM/Mimo/Gemini are safe or decent but not sufficient. |
| A06 | +1.0 | explicit | Strongly rejects Opus/GPT-5.5/Qwen-3.7 over-decomposition. |
| A07 | +1.0 | explicit | Uses a single adjust node for policy corrections. |
| A08 | -0.7 | explicit | Says the separate correction chains should be collapsed into helpers. |
| A09 | +1.0 | explicit | Rejects Qwen-3.6 redirect-in-edge approach. |
| A10 | +1.0 | explicit | Final topology loops action nodes back to guard. |
| A11 | +0.5 | explicit | Values tick but accepts a combined guard in the final shape. |
| A12 | +0.8 | explicit | Strongly praises dedicated ask_region/ask_currency visibility. |
| A13 | +0.2 | implied | Does not strongly preserve single ask_user; bootstrap explicitness pulls against this aspect. |
| A14 | +0.2 | explicit | Rejects _route, but supports recipes in state. |
| A15 | +0.4 | explicit | Includes llm_failed in the hybrid sources. |
| A16 | +1.0 | explicit | Provides seven-stage staged migration and praises zero-risk helper extraction. |
| A17 | 0.0 | unmentioned | No clear _build_plan_output claim. |
| A18 | +0.6 | explicit | Values Opus dispatch/naming ideas as partial borrowings. |
| A19 | 0.0 | unmentioned | No clear finish-validation claim. |
MI: mimo-2.5-pro-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Identifies DeepSeek/GPT-5.4-sized topology as the best combination. |
| A02 | +1.0 | explicit | Ranks DeepSeek #1 and recommends its graph structure. |
| A03 | +0.8 | explicit | Ranks GPT-5.4 #2 and borrows migration/anti-pattern guidance. |
| A04 | +0.3 | explicit | Values GPT-5.5 consequences and test guidance but sees it as over-decomposed. |
| A05 | +0.5 | explicit | Minimal splits are useful but criticized for incomplete post-LLM treatment. |
| A06 | +1.0 | explicit | Strongly rejects Opus/Qwen over-decomposition. |
| A07 | +1.0 | explicit | Calls DeepSeek's adjust the strongest differentiator. |
| A08 | -0.7 | explicit | Rejects splitting corrections into many nodes. |
| A09 | +1.0 | explicit | Rejects route-function mutation as an anti-pattern. |
| A10 | +1.0 | explicit | Says loopback to guard is correct. |
| A11 | +0.6 | explicit | Recommends renaming DeepSeek's guard to tick. |
| A12 | -0.5 | explicit | Says separate region/currency nodes add unnecessary complexity. |
| A13 | +0.7 | explicit | Adds GPT-5.4 anti-pattern warnings and preserves single bootstrap concern. |
| A14 | +0.4 | explicit | Criticizes _route, but accepts DeepSeek's recipe state as sensible. |
| A15 | +0.5 | explicit | Notes llm_failed or decision_origin as sensible. |
| A16 | +1.0 | explicit | Recommends helper extraction and staged migration. |
| A17 | +0.8 | explicit | Explicitly includes GLM's _build_plan_output helper. |
| A18 | +0.5 | explicit | Values Opus naming and ownership ideas but rejects the topology. |
| A19 | 0.0 | unmentioned | No clear finish-validation claim. |
OP: opus-4.7-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends GPT-5.4-level middle topology as the baseline. |
| A02 | +0.8 | explicit | Ranks DeepSeek #3 and harvests its self-loop idea. |
| A03 | +1.0 | explicit | Ranks GPT-5.4 #1 and uses its six-phase pipeline. |
| A04 | +1.0 | explicit | Ranks GPT-5.5 #2 and strongly borrows its llm_failed and state-cache caution. |
| A05 | +0.9 | explicit | Says Mimo/Gemini are under-decomposed and do not solve the problem. |
| A06 | +0.9 | explicit | Calls Opus itself operationally heavy and rejects 13-node chains. |
| A07 | +1.0 | explicit | Recommends one decision_policy node, not a corrector chain. |
| A08 | -0.6 | explicit | Rejects splitting the corrector chain despite recognizing its theoretical clarity. |
| A09 | +1.0 | explicit | Calls redirect-in-edge the biggest anti-pattern. |
| A10 | +1.0 | explicit | Lists top-node loopback as universal agreement. |
| A11 | +1.0 | explicit | Uses tick as first recommended phase. |
| A12 | -0.5 | explicit | Rejects split bootstrap nodes as graph-decorative. |
| A13 | +1.0 | explicit | Strongly preserves single ask_user interrupt and outer graph scope. |
| A14 | +1.0 | explicit | Rejects recipes in state unless serializer-safe and rejects state hint pollution. |
| A15 | +1.0 | explicit | Requires llm_failed or decision_origin. |
| A16 | +1.0 | explicit | Provides a six-commit green migration. |
| A17 | 0.0 | unmentioned | No clear _build_plan_output claim. |
| A18 | +0.8 | explicit | Values state ownership and write-surface discipline as reference material. |
| A19 | +0.7 | explicit | Calls route_finish_check a sharp idea, though not the central recommendation. |
Q36: qwen-3.6-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends a 5-6 node target count. |
| A02 | +0.5 | explicit | Treats DeepSeek as solid middle-ground but ranks it fifth. |
| A03 | +1.0 | explicit | Ranks GPT-5.4 best overall and uses its target nodes. |
| A04 | +0.4 | explicit | Praises GPT-5.5 hygiene but ranks it low for complexity. |
| A05 | +0.5 | explicit | GLM is ranked highly as first PR, but long-term still needs more decomposition. |
| A06 | +0.8 | explicit | Rejects 10+ and 14-node proposals for checkpoint overhead and graph noise. |
| A07 | +1.0 | explicit | Recommends one decision_policy node. |
| A08 | -0.7 | explicit | Explicitly says one policy node, not five corrector nodes. |
| A09 | +1.0 | explicit | States no state mutation in edge functions as a design rule. |
| A10 | +1.0 | explicit | Requires loopbacks to tick. |
| A11 | +1.0 | explicit | Makes tick the first target node. |
| A12 | -0.6 | explicit | Merges separate region/currency nodes into one bootstrap gate. |
| A13 | +1.0 | explicit | Keeps ask_user single interrupt and avoids outer graph changes. |
| A14 | +0.8 | explicit | Rejects _route and checkpoint pollution, though does not dwell on recipes as much as OP/G54. |
| A15 | +0.8 | explicit | Includes llm_failed from GPT-5.5. |
| A16 | +1.0 | explicit | Gives combined staged migration. |
| A17 | +0.8 | explicit | Includes _build_plan_output helper. |
| A18 | +0.8 | explicit | Ranks Opus high for mapping/state ownership reference. |
| A19 | +1.0 | explicit | Includes finish_check in the recommended graph. |
Q37: qwen-3.7-max-range.md¶
| Aspect | Score | Status | Evidence note |
|---|---|---|---|
| A01 | +1.0 | explicit | Recommends a 5-node pipeline plus optional finish check. |
| A02 | +0.9 | explicit | Ranks DeepSeek #2 and adopts its pragmatism. |
| A03 | +1.0 | explicit | Ranks GPT-5.4 #1 and uses it as target topology. |
| A04 | +0.8 | explicit | Ranks GPT-5.5 #3 and borrows llm_failed/non-goal detail. |
| A05 | +0.8 | explicit | Treats Gemini/Mimo minimal proposals as insufficient. |
| A06 | +0.9 | explicit | Rejects Opus/Qwen-3.7-style full decomposition as too many nodes. |
| A07 | +1.0 | explicit | Uses one decision_policy node. |
| A08 | -0.7 | explicit | Says multi-node post-LLM correction adds unnecessary checkpoint writes. |
| A09 | +1.0 | explicit | Explicitly rejects redirect-in-edge mutation. |
| A10 | +1.0 | explicit | Target graph loops all action paths back to tick. |
| A11 | +1.0 | explicit | tick is the first target node. |
| A12 | -0.7 | explicit | Rejects separate ask_region/ask_currency nodes as overkill. |
| A13 | +0.9 | explicit | Keeps ask_user and public graph contracts stable. |
| A14 | +0.2 | explicit | Rejects _route but supports recipes in state, creating a mixed checkpoint-discipline position. |
| A15 | +0.8 | explicit | Includes GPT-5.5's llm_failed state field. |
| A16 | +1.0 | explicit | Gives eight-step green migration. |
| A17 | +1.0 | explicit | Includes _build_plan_output as first migration step. |
| A18 | +0.8 | explicit | Values Opus mapping and state ownership table as reference material. |
| A19 | +1.0 | explicit | Includes finish_check in the resulting target graph. |
Normalization And Ambiguity Audit¶
| Check | Result |
|---|---|
| Evaluator count | 10 |
| Aspect count | 19 |
| Total evaluator-aspect cells | 190 |
| Unresolved ambiguous cells | 0 |
| Ambiguity rate | 0 / 190 = 0.0% |
| Clear-equivalent normalization | Applied only to equivalent statements such as decision_policy, adjust, enforce_policy, and correct_policy when they meant a single post-LLM policy stage. |
| Kept separate because equivalence was not clear | Dedicated bootstrap nodes vs bootstrap gate; recipe cache vs checkpoint-state caution; finish validation vs normal finish node; multi-node correction split vs single policy node. |
| Neutral cells | All 0.0 cells are either unmentioned or balanced; unmentioned cells use the user-selected neutral override. |
Reproducibility Snippet¶
The numeric tables above were computed from the score matrix using this exact data shape:
M[evaluator][aspect] = score in [-1, 1]
w_i = 0.1
alpha_j = 1 / 19
c_ij = 1 after neutral fill
mean_j = sum_i M[i][j] / 10
variance_j = sum_i (M[i][j] - mean_j)^2 / 10
center_distance_i = sqrt(sum_j alpha_j * (M[i][j] - mean_j)^2)
pairwise_distance_i_k = sqrt(sum_j alpha_j * (M[i][j] - M[k][j])^2)
medoid_score_i = sum_k 0.1 * pairwise_distance_i_k
Final Answer¶
By closest opinion to the equal-weight overall vector, the ranking is:
- DS
- MI
- G55
- G54
- OP
- Q36
- Q37
- GL
- KI
- GE
By medoid centrality, the ranking is:
- DS
- MI
- G55
- G54
- OP
- Q36
- Q37
- GL
- KI
- GE
Both calculations select deepseek-4-pro-range.md as the most representative evaluator opinion in this set.