Skip to content

Thesis Index for docs/planner-graph-ref/analyse

This file normalizes the main theses from the evaluator analyses in this directory.

Source evaluator docs:

  • deepseek-4-pro-range.md
  • gemini-3.1-pro-range.md
  • glm-5.1-pro-range.md
  • gpt-5.4-range.md
  • gpt-5.5-range.md
  • kimi-2.6-range.md
  • mimo-2.5-pro-range.md
  • opus-4.7-range.md
  • qwen-3.6-range.md
  • qwen-3.7-max-range.md

Cross-check used for ranking counts:

  • rankings-matrix.md

Method

  • Theses are normalized: semantically equivalent statements are merged into one line.
  • "Agreeing evaluators" means the evaluator states the thesis directly or clearly endorses it in its final ranking/synthesis.
  • Some theses conflict with each other. This file preserves those disagreements instead of forcing a false consensus.
  • rankings-matrix.md is used to cross-check ranking counts, but it is not listed as an evaluator because it is an aggregate synthesis file.
  • The companion stance grid lives in thesis-matrix.md and marks each evaluator as agree / disagree / no-position for theses 1-40.
  • The ranked agreement chart lives in thesis-agreement-chart.md and sorts theses by evaluator agreement count.
  • The evaluator ranking by summed thesis commonness lives in evaluator-common-thesis-ranking.md.
  • The evaluator preferability ranking for "most deliberated and comprehensive analysis" lives in evaluator-preferability-ranking.md.

1. Universal theses

All ten evaluator docs agree on every thesis in this section:

  • deepseek-4-pro
  • gemini-3.1-pro
  • glm-5.1
  • gpt-5.4
  • gpt-5.5
  • kimi-2.6
  • mimo-2.5-pro
  • opus-4.7
  • qwen-3.6-plus
  • qwen-3.7-max

  • plan_node is a monolith / god node and should be split.

  • The current graph is misleading because too much routing intelligence is hidden inside plan_node.
  • Deterministic pre-LLM logic belongs in graph-visible phases.
  • The LLM step should shrink to a thin planning phase rather than remain a mixed orchestration node.
  • Loop-back edges from action nodes should re-enter at a top-of-loop phase, not jump back into the middle of planning.
  • ask_user should remain the single interrupting node.
  • The outer conversation graph is out of scope for this refactor.
  • The right end-state is a medium-granularity pipeline of durable phases, not a node per if branch.
  • Heavy over-decomposition adds graph noise and checkpoint overhead.
  • Conservative 2-3 node splits may be acceptable as migration slices, but they are not the strongest final architecture.
  • Post-LLM routing/policy must become explicit architecture, not remain buried in a catch-all planner function.
  • Redirect / mutation logic does not belong in edge functions in the final design.

2. Strong-majority theses

  1. gpt-5.4 is the strongest baseline and the consensus winner.

    • Agreeing evaluators: gemini-3.1-pro, gpt-5.4, gpt-5.5, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  2. deepseek-4-pro is the strongest 4-node alternative and the most stable second-tier option.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, kimi-2.6, mimo-2.5-pro, qwen-3.7-max
  3. gpt-5.5 contributes the best state hygiene, consequence analysis, and test-migration ideas even if its full topology is too granular.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, opus-4.7, qwen-3.7-max
  4. The planner needs explicit LLM-failure provenance (llm_failed or decision_origin) so failure paths are not rewritten into bad calculator/reflect loops.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, opus-4.7, qwen-3.7-max
  5. Exactly one loop-entry node should own iterations increments.

    • Agreeing evaluators: deepseek-4-pro, gpt-5.4, gpt-5.5, kimi-2.6, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  6. _route / route-tag fields in checkpointed state are a bad pattern.

    • Agreeing evaluators: deepseek-4-pro, gpt-5.4, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  7. Caching recipes / FieldAcquisition-like acquisition objects in planner state is risky until serializer compatibility is proven.

    • Agreeing evaluators: gpt-5.4, gpt-5.5, opus-4.7, qwen-3.7-max
  8. Explicit state-write ownership is worth adopting as implementation documentation.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, opus-4.7
  9. Calculator lifecycle deserves explicit named handling (gate, policy seam, or dedicated concern), not incidental inline logic.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, kimi-2.6, qwen-3.7-max
  10. GLM's _build_plan_output() helper is worth adopting regardless of the final graph shape.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, qwen-3.6-plus, qwen-3.7-max
  11. Kimi's decision-matrix / "what-goes-where" appendix is one of the best implementation reference artifacts in the set.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, kimi-2.6, qwen-3.7-max
  12. Opus's state/write mapping tables are valuable reference material even if the proposed graph is too chatty.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, gpt-5.5, opus-4.7
  13. qwen-3.7-max is useful as implementation pseudo-code / reference, but not as the target topology.

    • Agreeing evaluators: deepseek-4-pro, glm-5.1, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  14. qwen-3.6-plus contributes a useful finish-validation idea (route_finish_check / finish re-check) even though the rest of the design should not be adopted as-is.

    • Agreeing evaluators: deepseek-4-pro, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max

3. Contested theses

  1. Dedicated ask_region / ask_currency nodes improve bootstrap visibility and graph honesty.

    • Agreeing evaluators: kimi-2.6, qwen-3.6-plus
  2. Dedicated bootstrap nodes are mostly graph noise; one gate feeding the existing ask_user path is better.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, gpt-5.4, gpt-5.5, glm-5.1, mimo-2.5-pro, opus-4.7, qwen-3.7-max
  3. route_direct is a good adapter pattern for converging deterministic acquisition decisions and LLM decisions into one post-policy phase.

    • Agreeing evaluators: deepseek-4-pro, kimi-2.6, qwen-3.7-max
  4. decision_origin is a better long-term state field than a bare llm_failed boolean.

    • Agreeing evaluators: deepseek-4-pro, gpt-5.5
  5. Splitting post-LLM policy into normalize_decision -> retry_gate -> maybe_calculate improves correctness and testability despite the extra hops.

    • Agreeing evaluators: glm-5.1, gpt-5.5, qwen-3.7-max
  6. DeepSeek's plan -> plan self-loop is the cleanest way to surface late decomposition / composite-target re-prompting.

    • Agreeing evaluators: deepseek-4-pro, opus-4.7, qwen-3.7-max
  7. Keeping recipes in planner state is a worthwhile cache if compatibility is verified.

    • Agreeing evaluators: deepseek-4-pro
  8. Opus-style g_* / c_* naming improves graph self-documentation.

    • Agreeing evaluators: deepseek-4-pro, opus-4.7
  9. A dedicated dispatch phase is useful as a canonical final log+route point.

    • Agreeing evaluators: deepseek-4-pro, opus-4.7, qwen-3.7-max

4. Proposal-specific rejection theses

  1. qwen-3.6-plus's redirect-in-edge pattern should be rejected even if some of its finish-validation ideas are retained.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, gpt-5.4, gpt-5.5, glm-5.1, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.7-max
  2. The full Opus corrector lattice is analytically excellent but too chatty for production.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, glm-5.1, gpt-5.4, gpt-5.5, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  3. Gemini's evaluate_rules shape is too coarse and would become a new god node.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, glm-5.1, gpt-5.4, gpt-5.5, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  4. Mimo's decide node still hides too much logic and does not solve enough of the original problem.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, glm-5.1, gpt-5.4, gpt-5.5, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  5. qwen-3.7-max's _route-heavy design is brittle even if its code sketches are useful.

    • Agreeing evaluators: deepseek-4-pro, gpt-5.4, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus

5. Ranking theses

  1. gpt-5.4 was ranked #1 by six evaluators.

    • Agreeing evaluators: gemini-3.1-pro, gpt-5.4, gpt-5.5, opus-4.7, qwen-3.6-plus, qwen-3.7-max
  2. deepseek-4-pro was ranked #1 by three evaluators.

    • Agreeing evaluators: deepseek-4-pro, kimi-2.6, mimo-2.5-pro
  3. gpt-5.5 was ranked #1 by one evaluator.

    • Agreeing evaluators: glm-5.1
  4. gpt-5.4 plus deepseek-4-pro form the clear top tier across the directory.

    • Agreeing evaluators: deepseek-4-pro, gemini-3.1-pro, gpt-5.5, kimi-2.6, mimo-2.5-pro, opus-4.7, qwen-3.6-plus, qwen-3.7-max