First Theses Summary Recap¶

This file recaps the thesis-analysis work completed in this session under docs/planner-graph-ref/analyse.

Goal of the session¶

Build a normalized, cross-evaluator view of the analytic docs in this directory so the evaluator set can be compared at four levels:

shared theses
per-evaluator stance on those theses
thesis popularity across evaluators
evaluator ranking derived from those thesis patterns

Source set analyzed¶

Primary evaluator docs:

deepseek-4-pro-range.md
gemini-3.1-pro-range.md
glm-5.1-pro-range.md
gpt-5.4-range.md
gpt-5.5-range.md
kimi-2.6-range.md
mimo-2.5-pro-range.md
opus-4.7-range.md
qwen-3.6-range.md
qwen-3.7-max-range.md

Cross-check / aggregate source:

rankings-matrix.md

Artifacts created¶

1. `thesis-index.md`¶

Purpose:

normalize the evaluator docs into a single list of theses
record which evaluators explicitly agree with each thesis
separate universal, strong-majority, contested, rejection, and ranking theses

Key outcome:

the evaluator set converges strongly on a small common core
most disagreement lives in granularity, bootstrap-node treatment, state-field choices, and corrector placement

2. `thesis-matrix.md`¶

Purpose:

convert the thesis list into a stance matrix
mark each evaluator with A, D, or — for theses 1-40

Key outcome:

the universal core became mechanically visible
contested theses became much easier to isolate and compare

3. `thesis-agreement-chart.md`¶

Purpose:

rank theses by evaluator agreement count

Key outcome:

the most common theses are the shared diagnosis and shared architectural constraints
the least common theses are niche implementation preferences such as decision_origin, Opus naming conventions, and recipes caching support

4. `evaluator-common-thesis-ranking.md`¶

Purpose:

rank evaluators by the summed commonness of the theses they contain

Scoring rule:

for each evaluator, sum the agreement-count values of every thesis they mark A

Key outcome:

evaluators with broader overlap with the directory-wide consensus rank highest on this metric

5. `evaluator-preferability-ranking.md`¶

Purpose:

answer the meta-question: which evaluator is most preferable if the goal is the most deliberated and comprehensive analysis?

Composite rule used:

40% weighted common-thesis score
20% thesis coverage count
25% structural deliberation signals
15% document depth

Key outcome:

deepseek-4-pro ranked first for this meta-task
qwen-3.7-max and opus-4.7 followed as the strongest cross-check evaluators

Main findings from the session¶

Universal evaluator consensus¶

All evaluators agree on the following high-level points:

plan_node is too monolithic and should be split
the current graph hides important routing intelligence
deterministic pre-LLM logic should be graph-visible
the LLM phase should be thinner
loop-backs should re-enter at a top-of-loop phase
ask_user should remain the single interrupting node
the outer conversation graph should stay out of scope
the target architecture should use durable phases, not a node per if
redirect logic should not live in edge functions

Strong consensus but not universal¶

Examples:

gpt-5.4 is the strongest baseline proposal
_route fields are usually viewed as a bad pattern
explicit LLM-failure provenance is valuable
dedicated bootstrap nodes are usually considered graph noise

Most contested areas¶

Examples:

whether ask_region / ask_currency deserve dedicated nodes
whether decision_origin is worth a dedicated field
whether post-LLM policy should be one node or a smaller chain
whether caching recipes in state is useful or too risky

Final evaluator ranking for comprehensive analysis¶

From evaluator-preferability-ranking.md:

deepseek-4-pro
qwen-3.7-max
opus-4.7
kimi-2.6
glm-5.1
gpt-5.5
qwen-3.6-plus
mimo-2.5-pro
gpt-5.4
gemini-3.1-pro

Recommended practical use:

one evaluator only: deepseek-4-pro
primary + cross-check: deepseek-4-pro + opus-4.7
maximum exhaustiveness: deepseek-4-pro + qwen-3.7-max + opus-4.7

Caveats¶

these rankings are internal to this source set and scoring method
they measure analysis behavior, not guaranteed architectural correctness
universal theses compress differences; most rank separation comes from non-universal theses plus document structure/depth
rankings-matrix.md was used as a cross-check and aggregate source, not treated as another evaluator

Suggested next steps¶

if needed, add per-thesis source line references back to the original evaluator docs
if needed, add an alternative evaluator ranking based on inverse thesis-rank weight instead of raw agreement count
if needed, convert the final evaluator preferability result into a short decision note for future reuse

First Theses Summary Recap¶

Goal of the session¶

Source set analyzed¶

Artifacts created¶

1. thesis-index.md¶

2. thesis-matrix.md¶

3. thesis-agreement-chart.md¶

4. evaluator-common-thesis-ranking.md¶

5. evaluator-preferability-ranking.md¶

Main findings from the session¶

Universal evaluator consensus¶

Strong consensus but not universal¶

Most contested areas¶

Final evaluator ranking for comprehensive analysis¶

Caveats¶

Suggested next steps¶

1. `thesis-index.md`¶

2. `thesis-matrix.md`¶

3. `thesis-agreement-chart.md`¶

4. `evaluator-common-thesis-ranking.md`¶

5. `evaluator-preferability-ranking.md`¶