Skip to content

First Theses Summary Recap

This file recaps the thesis-analysis work completed in this session under docs/planner-graph-ref/analyse.

Goal of the session

Build a normalized, cross-evaluator view of the analytic docs in this directory so the evaluator set can be compared at four levels:

  1. shared theses
  2. per-evaluator stance on those theses
  3. thesis popularity across evaluators
  4. evaluator ranking derived from those thesis patterns

Source set analyzed

Primary evaluator docs:

  • deepseek-4-pro-range.md
  • gemini-3.1-pro-range.md
  • glm-5.1-pro-range.md
  • gpt-5.4-range.md
  • gpt-5.5-range.md
  • kimi-2.6-range.md
  • mimo-2.5-pro-range.md
  • opus-4.7-range.md
  • qwen-3.6-range.md
  • qwen-3.7-max-range.md

Cross-check / aggregate source:

  • rankings-matrix.md

Artifacts created

1. thesis-index.md

Purpose:

  • normalize the evaluator docs into a single list of theses
  • record which evaluators explicitly agree with each thesis
  • separate universal, strong-majority, contested, rejection, and ranking theses

Key outcome:

  • the evaluator set converges strongly on a small common core
  • most disagreement lives in granularity, bootstrap-node treatment, state-field choices, and corrector placement

2. thesis-matrix.md

Purpose:

  • convert the thesis list into a stance matrix
  • mark each evaluator with A, D, or for theses 1-40

Key outcome:

  • the universal core became mechanically visible
  • contested theses became much easier to isolate and compare

3. thesis-agreement-chart.md

Purpose:

  • rank theses by evaluator agreement count

Key outcome:

  • the most common theses are the shared diagnosis and shared architectural constraints
  • the least common theses are niche implementation preferences such as decision_origin, Opus naming conventions, and recipes caching support

4. evaluator-common-thesis-ranking.md

Purpose:

  • rank evaluators by the summed commonness of the theses they contain

Scoring rule:

  • for each evaluator, sum the agreement-count values of every thesis they mark A

Key outcome:

  • evaluators with broader overlap with the directory-wide consensus rank highest on this metric

5. evaluator-preferability-ranking.md

Purpose:

  • answer the meta-question: which evaluator is most preferable if the goal is the most deliberated and comprehensive analysis?

Composite rule used:

  • 40% weighted common-thesis score
  • 20% thesis coverage count
  • 25% structural deliberation signals
  • 15% document depth

Key outcome:

  • deepseek-4-pro ranked first for this meta-task
  • qwen-3.7-max and opus-4.7 followed as the strongest cross-check evaluators

Main findings from the session

Universal evaluator consensus

All evaluators agree on the following high-level points:

  • plan_node is too monolithic and should be split
  • the current graph hides important routing intelligence
  • deterministic pre-LLM logic should be graph-visible
  • the LLM phase should be thinner
  • loop-backs should re-enter at a top-of-loop phase
  • ask_user should remain the single interrupting node
  • the outer conversation graph should stay out of scope
  • the target architecture should use durable phases, not a node per if
  • redirect logic should not live in edge functions

Strong consensus but not universal

Examples:

  • gpt-5.4 is the strongest baseline proposal
  • _route fields are usually viewed as a bad pattern
  • explicit LLM-failure provenance is valuable
  • dedicated bootstrap nodes are usually considered graph noise

Most contested areas

Examples:

  • whether ask_region / ask_currency deserve dedicated nodes
  • whether decision_origin is worth a dedicated field
  • whether post-LLM policy should be one node or a smaller chain
  • whether caching recipes in state is useful or too risky

Final evaluator ranking for comprehensive analysis

From evaluator-preferability-ranking.md:

  1. deepseek-4-pro
  2. qwen-3.7-max
  3. opus-4.7
  4. kimi-2.6
  5. glm-5.1
  6. gpt-5.5
  7. qwen-3.6-plus
  8. mimo-2.5-pro
  9. gpt-5.4
  10. gemini-3.1-pro

Recommended practical use:

  • one evaluator only: deepseek-4-pro
  • primary + cross-check: deepseek-4-pro + opus-4.7
  • maximum exhaustiveness: deepseek-4-pro + qwen-3.7-max + opus-4.7

Caveats

  • these rankings are internal to this source set and scoring method
  • they measure analysis behavior, not guaranteed architectural correctness
  • universal theses compress differences; most rank separation comes from non-universal theses plus document structure/depth
  • rankings-matrix.md was used as a cross-check and aggregate source, not treated as another evaluator

Suggested next steps

  • if needed, add per-thesis source line references back to the original evaluator docs
  • if needed, add an alternative evaluator ranking based on inverse thesis-rank weight instead of raw agreement count
  • if needed, convert the final evaluator preferability result into a short decision note for future reuse