First Theses Summary Recap¶
This file recaps the thesis-analysis work completed in this session under docs/planner-graph-ref/analyse.
Goal of the session¶
Build a normalized, cross-evaluator view of the analytic docs in this directory so the evaluator set can be compared at four levels:
- shared theses
- per-evaluator stance on those theses
- thesis popularity across evaluators
- evaluator ranking derived from those thesis patterns
Source set analyzed¶
Primary evaluator docs:
deepseek-4-pro-range.mdgemini-3.1-pro-range.mdglm-5.1-pro-range.mdgpt-5.4-range.mdgpt-5.5-range.mdkimi-2.6-range.mdmimo-2.5-pro-range.mdopus-4.7-range.mdqwen-3.6-range.mdqwen-3.7-max-range.md
Cross-check / aggregate source:
rankings-matrix.md
Artifacts created¶
1. thesis-index.md¶
Purpose:
- normalize the evaluator docs into a single list of theses
- record which evaluators explicitly agree with each thesis
- separate universal, strong-majority, contested, rejection, and ranking theses
Key outcome:
- the evaluator set converges strongly on a small common core
- most disagreement lives in granularity, bootstrap-node treatment, state-field choices, and corrector placement
2. thesis-matrix.md¶
Purpose:
- convert the thesis list into a stance matrix
- mark each evaluator with
A,D, or—for theses1-40
Key outcome:
- the universal core became mechanically visible
- contested theses became much easier to isolate and compare
3. thesis-agreement-chart.md¶
Purpose:
- rank theses by evaluator agreement count
Key outcome:
- the most common theses are the shared diagnosis and shared architectural constraints
- the least common theses are niche implementation preferences such as
decision_origin, Opus naming conventions, andrecipescaching support
4. evaluator-common-thesis-ranking.md¶
Purpose:
- rank evaluators by the summed commonness of the theses they contain
Scoring rule:
- for each evaluator, sum the agreement-count values of every thesis they mark
A
Key outcome:
- evaluators with broader overlap with the directory-wide consensus rank highest on this metric
5. evaluator-preferability-ranking.md¶
Purpose:
- answer the meta-question: which evaluator is most preferable if the goal is the most deliberated and comprehensive analysis?
Composite rule used:
- 40% weighted common-thesis score
- 20% thesis coverage count
- 25% structural deliberation signals
- 15% document depth
Key outcome:
deepseek-4-proranked first for this meta-taskqwen-3.7-maxandopus-4.7followed as the strongest cross-check evaluators
Main findings from the session¶
Universal evaluator consensus¶
All evaluators agree on the following high-level points:
plan_nodeis too monolithic and should be split- the current graph hides important routing intelligence
- deterministic pre-LLM logic should be graph-visible
- the LLM phase should be thinner
- loop-backs should re-enter at a top-of-loop phase
ask_usershould remain the single interrupting node- the outer conversation graph should stay out of scope
- the target architecture should use durable phases, not a node per
if - redirect logic should not live in edge functions
Strong consensus but not universal¶
Examples:
gpt-5.4is the strongest baseline proposal_routefields are usually viewed as a bad pattern- explicit LLM-failure provenance is valuable
- dedicated bootstrap nodes are usually considered graph noise
Most contested areas¶
Examples:
- whether
ask_region/ask_currencydeserve dedicated nodes - whether
decision_originis worth a dedicated field - whether post-LLM policy should be one node or a smaller chain
- whether caching
recipesin state is useful or too risky
Final evaluator ranking for comprehensive analysis¶
From evaluator-preferability-ranking.md:
deepseek-4-proqwen-3.7-maxopus-4.7kimi-2.6glm-5.1gpt-5.5qwen-3.6-plusmimo-2.5-progpt-5.4gemini-3.1-pro
Recommended practical use:
- one evaluator only:
deepseek-4-pro - primary + cross-check:
deepseek-4-pro+opus-4.7 - maximum exhaustiveness:
deepseek-4-pro+qwen-3.7-max+opus-4.7
Caveats¶
- these rankings are internal to this source set and scoring method
- they measure analysis behavior, not guaranteed architectural correctness
- universal theses compress differences; most rank separation comes from non-universal theses plus document structure/depth
rankings-matrix.mdwas used as a cross-check and aggregate source, not treated as another evaluator
Suggested next steps¶
- if needed, add per-thesis source line references back to the original evaluator docs
- if needed, add an alternative evaluator ranking based on inverse thesis-rank weight instead of raw agreement count
- if needed, convert the final evaluator preferability result into a short decision note for future reuse