Twilight of the Gods. Comparing how 11 LLMs approach a code-reorganization task.¶

Other languages

Эта статья также доступна на русском: Гибель богов.

This is a detailed write-up of one experiment. I took a god node from a real LangGraph agent and asked 5 American and 6 Chinese models first to propose how to untangle it, then to evaluate each other's proposals. After that, I tried three different ways to figure out which of them to trust on the matter.

Contents

The original problem What a god node is and why it's dangerous
What the plan node actually did
Lemonade from lemons Why all this, and how the experiment is set up
Stage one. The models generate proposals
The proposals table
A bit more on each proposal
Stage two. The models evaluate the proposals
The reviews table
A bit more on each review
Stage three. The wet swimsuit contest Deciding who's good at what
Approach one. Do the scores agree? Picking the best proposal
Approach two. Comparing reviews by theses Picking the best analyst
Approach three. Center of opinion and medoid Picking the best analyst again
Deus ex machina. Picking the best analyst one more time
Takeaways Which model to use as a generator, which as an evaluator, and where your heart will find peace.

The original problem¶

You know how it goes: you're building a practice AI agent with the fellas on a course by Data Sanity, and amid the colorful whirl of rapidly accreting features you suddenly notice that one of the project's internal agents has a state graph (LangGraph) that looks like this:

flowchart TD
    planner_start([START]) --> plan[plan]

    plan -->|search| search[search]
    plan -->|ask_user| ask_user[ask_user / interrupt]
    plan -->|reflect| reflect[reflect]
    plan -->|calculate| calculate[calculate]
    plan -->|finish| finish[finish]

    search -->|last_observation| observe[observe]
    search -->|no hits / backend failure| plan
    observe --> plan
    calculate --> plan
    ask_user --> observe_user[observe_user]
    observe_user --> plan
    reflect --> plan
    finish --> planner_end([END])

At first glance this is just a cute little octopus — nothing to worry about. But once you know how much logic this octopus has to hold in its modest eight-legged head, it becomes clear right away that we're looking at an anti-pattern. In this case, let's call it a god node.

The plan node hides about 350 lines of logic, including iterative checks, bootstrap questions about region and currency, schema preparation, acquisition-task routing, the LLM call, the subsequent correction of the decision, and so on.

The problem isn't just the size of the function. When important orchestration is hidden inside a single node, the graph stops being a representation of the system. It's harder to explain, harder to debug, harder to test, and more dangerous to change. So the obvious task isn't merely to "chop a big function into pieces" but to lift the hidden control logic up to the graph level, so that the resulting architecture becomes clearer and more amenable to further development.

What the `plan` node actually did¶

The agent this graph was meant to describe was, broadly, in the business of collecting various parameters for downstream calculations. Some of these parameters it cleverly searched for on the internet; some it asked the user about. And it did all this by a not-fully-deterministic algorithm, because depending on the context of a particular conversation, the right way to obtain the same parameter could vary considerably. Here is the set of real functions that had been packed into the plan node:

Responsibility	What logic was hidden inside `plan`
Iteration loop	Incrementing `iterations`, entering a new planning step, checking `status == "aborted"` and `max_iters`
Region bootstrap question	The `_needs_region_question()` check and a forced transition to `ask_user` for `core.region`
Currency bootstrap question	The `_needs_currency_question()` check and a forced transition to `ask_user` for `core.currency`
Proactive decomposition	Generating `dynamic_decompositions` for fields that need to be broken into components
Assembling acquisition recipes	Calling `build_dynamic_recipes()` and preparing the task structure for the subsequent field collection
Schema preparation	Calling `compose_ready_fields()`, merging ready component fields into aggregates, and updating `schema`
Calculator limits	Checking the calculator attempt limit, a successful or already-current calculation, and other stopping conditions
Recovery after a blocked calculation	Handling the blocked-calculator scenario, finding the next data-collection task, and, if needed, a fallback decomposition for the problem field
General data-collection routing	Choosing the next data-collection task without the LLM, including a fast pass over already-opened and component tasks
Auto-finish logic	Checking whether all source data has been collected, whether any fields remain to act on, and whether the loop can finish without extra steps
LLM planning	Gathering prompt context, calling `_llm().structured(...)`, and obtaining a `PlannerDecision`
Post-LLM decomposition	Generating an additional decomposition for the model-chosen field if it needs to be broken into components
Redirecting and normalizing decisions	Redirecting decisions for derived fields, forcing `ask_user -> search` for fields better found via search, and other deterministic rewrites after the LLM
Per-field retries and limits	Detecting repeated searches, limits on `search` and `ask_user`, and switching to `ask_user`, `reflect`, or `finish` once the limits are exhausted
Calculation-decision correction	Fixing a premature `finish`: if the calculation isn't complete yet, the decision is rewritten to `calculate` or routed into an extra re-evaluation step
Bookkeeping state management	Resetting or updating `decision`, `decision_origin`, `llm_failed`, `status`, and other transient flags that accompanied the branching
Logging and event dispatch	Logging the final decision, emitting progress events, and assembling the final state update before returning from the node

Lemonade from lemons¶

So, we find ourselves in a situation many people will recognize all too well these days. Monsieur Claude has sculpted us a little masterpiece out of spaghetti. Combing out that spaghetti is nowhere near as much fun as bolting on feature after feature with no review whatsoever. But that's only true until you realize you can reduce the entropy with the same tool that raised it. And that's a far more pleasant prospect.

But can you trust the untangling of code to the very model that, given half a chance, so gleefully tangles it up?

To answer that, I decided to collect several independent architectural proposals from different models, and compare what they'd advise.

Eleven models were invited onto the runway for judging:

GPT-5.4
GPT-5.5
DeepSeek-4-pro
Gemini-3.1-pro
GLM-5.1
Kimi-2.6
MiMo-2.5-pro
Opus-4.7
Qwen-3.6-plus
Qwen-3.7-max
Fable-5

First, each of them made its own proposal for splitting plan. Then the models switched into evaluator mode, read the entire set of finished proposals, and ranked them.

To make sure I was collecting independent opinions rather than a relay of retellings of one lucky text, the following conditions were enforced:

While the proposals were being generated, the models couldn't see one another's work.
While the analyses were being generated, they saw every proposal but none of the other analyses.
Each run took place in a fresh session.

All of the work was done in OpenCode with the Oh My Openagent plugin, at maximum reasoning effort for every model.

Stage one. The models generate proposals¶

In the first stage, each model proposed its own way to lift the plan node's logic up to the graph level. The prompt used to generate each proposal (only the output filename changed):

look at docs/planner-graph-ref/current-graph.md. Looks like "plan" node contains too many logic in it. give a proposal of how to move this logic to graph level in <model>-proposal.md

The proposals table¶

Model	Graph style	Core idea
Fable-5	Balanced, 5-stage	Split `plan` into `tick -> prepare -> select -> decide -> guard`: move the bootstrap questions and limits into `tick`, state preparation into `prepare`, the deterministic branches into `select`, the model call into `decide`, and the post-model decision fixes into `guard`
GPT-5.4	Moderate, phase-based	Almost the same layout as Fable-5, but the initial region and currency questions are pulled out of the entry node into a separate `bootstrap_gate`
GPT-5.5	More detailed, disciplined	Formalize the loop's bookkeeping logic as much as possible: separately extract schema preparation, decision normalization, retry, and calculator policy
DeepSeek-4-pro	Compact, 4-phase	Collapse the whole loop into four large phases and keep post-LLM adjustments in a single `adjust` node
Gemini-3.1-pro	Coarse-grained, minimalist	Heavily coarsen the graph: fold almost all deterministic checks into `evaluate_rules` and leave the LLM as a separate final phase
GLM-5.1	Conservative, two-step	Make the minimal safe split: one pre-check before the loop and one shared node that picks the next action
Kimi-2.6	Detailed pipeline	Explicitly separate bootstrap, a calculator gate, task acquisition, and a dedicated forced-policy layer after the decision
MiMo-2.5-pro	Moderately coarse	Split the graph into large blocks `guards -> acquire -> decide`, without surfacing every individual policy check
Opus-4.7	Maximally decomposed	Turn nearly every hidden policy into a separate gate or corrector, then funnel everything into `dispatch`
Qwen-3.6-plus	Medium detail	Extract preflight and decomposition as separate phases, and run completion through an extra finish-check
Qwen-3.7-max	Very detailed pipeline	Unfold almost the entire hidden state machine: separate checks, post-processing, cap enforcement, and final routing

A bit more on each proposal¶

Fable-5¶

flowchart TD
    planner_start([START]) --> tick[tick]

    tick -->|aborted / max_iters| finish[finish]
    tick -->|region or currency missing| ask_user[ask_user / interrupt]
    tick -->|otherwise| prepare[prepare]

    prepare --> select[select]
    select -->|deterministic decision found| guard[guard]
    select -->|no decision| decide[decide / LLM]
    decide --> guard

    guard -->|search| search[search]
    guard -->|ask_user| ask_user
    guard -->|reflect| reflect[reflect]
    guard -->|calculate| calculate[calculate]
    guard -->|finish| finish

    search -->|last_observation present| observe[observe]
    search -->|no hits or backend failure| tick
    observe --> tick
    calculate --> tick
    ask_user --> observe_user[observe_user]
    observe_user --> tick
    reflect --> tick
    finish --> planner_end([END])

Fable-5 proposed spreading the hidden logic across five stages. tick takes on the start of the iteration: the step counter, the status == "aborted" stop, the max_iters limit, and the initial questions about region and currency.

The preparatory layer goes into prepare: proactive decomposition, gathering recipes for later data collection, and the compose_ready_fields call.

Next, select runs the chain of deterministic choices without the main model call. This is where the calculator checks land, along with the fast pass over already-collected data, auto-finish, and the data-collection branches that follow a blocked calculation.

If select finds no deterministic answer, control passes to decide, which only calls the model and builds a structured decision.

Then guard collects in one place the redirection of decisions for derived fields, the forced switch to search for fields better looked up online, the limits on repeated actions, and the adjustment before the final routing to search, ask_user, reflect, calculate, or finish. Previously all these rewrites and limits were smeared across the tail of plan. The action nodes don't change in the process; they simply return not to the old plan node but back to tick.

This split makes the architecture more observable. From the graph you can already tell where a deterministic decision is made, where the LLM is needed, and where a decision passes through the layer of limits and fixes. The most important find is decision_origin. It lets the shared guard tell which decision came from the LLM and which was found deterministically, so it doesn't apply the same policy to every branch indiscriminately. The design has weak spots too: not all LLM calls are moved into decide, and guard remains a fairly dense node.

GPT-5.4¶

flowchart TD
    START --> tick
    tick -->|continue| bootstrap_gate
    tick -->|terminal| finish

    bootstrap_gate -->|needs region/currency| ask_user
    bootstrap_gate -->|ready| prepare_context

    prepare_context --> acquisition_gate
    acquisition_gate -->|deterministic decision| decision_policy
    acquisition_gate -->|needs LLM| llm_plan

    llm_plan --> decision_policy

    decision_policy -->|search| search
    decision_policy -->|ask_user| ask_user
    decision_policy -->|reflect| reflect
    decision_policy -->|calculate| calculate
    decision_policy -->|finish| finish

    search -->|last_observation| observe
    search -->|no observation| tick
    observe --> tick
    calculate --> tick
    reflect --> tick
    ask_user --> observe_user
    observe_user --> tick
    finish --> END

GPT-5.4 is arranged almost exactly like Fable-5. It also has a loop-entry node, context preparation, a deterministic fork before the model, a separate model call, and a single decision-fixing layer after it.

The key difference is that the initial region and currency questions don't stay inside tick but are pulled into a separate bootstrap_gate. So tick is responsible only for the start of the step, the iteration counter, and the stop checks, while bootstrap_gate decides whether you can proceed or must first send the user to ask_user.

The remaining stages mostly match Fable-5 in substance, just named more explicitly.

GPT-5.4 gives clear working guidance. Its section on anti-patterns makes it possible to carry out the refactor well, with a real grasp of the goals.

GPT-5.5¶

flowchart TD
    planner_start([START]) --> enter_iteration

    enter_iteration -->|aborted or max_iters| finish
    enter_iteration --> prepare_schema

    prepare_schema -->|region missing| require_region
    prepare_schema -->|currency missing| require_currency
    prepare_schema -->|calculator cap hit| finish
    prepare_schema -->|calculator current| finish
    prepare_schema --> acquisition_gate

    require_region --> ask_user
    require_currency --> ask_user

    acquisition_gate -->|component task| normalize_decision
    acquisition_gate -->|all raw inputs ready| maybe_calculate
    acquisition_gate -->|LLM needed| plan

    plan --> normalize_decision
    normalize_decision --> retry_gate
    retry_gate --> maybe_calculate

    maybe_calculate -->|calculator required| calculate
    maybe_calculate -->|search| search
    maybe_calculate -->|ask_user| ask_user
    maybe_calculate -->|reflect| reflect
    maybe_calculate -->|finish| finish

    search -->|last_observation| observe
    search -->|no hits| enter_iteration
    observe --> enter_iteration
    calculate --> enter_iteration
    ask_user --> observe_user
    observe_user --> enter_iteration
    reflect --> enter_iteration
    finish --> planner_end([END])

GPT-5.5 goes further than the first two and splits the loop's bookkeeping part more aggressively. enter_iteration remains the entry node: it increments the counter and checks for a stop. Then prepare_schema gathers the state preparation, separately checks for region and currency, and also holds the early exits for the calculator limit and an already-finished calculation. If region or currency is needed, control goes to require_region or require_currency, and then to the ordinary ask_user.

Next, acquisition_gate tries to find a deterministic answer before calling the model: to determine which missing component of a composite field to collect next, to recognize that all the source data is already gathered, or to admit that an LLM call is needed (plan).

After the model, the decision passes through normalize_decision, which gathers fixes such as redirecting derived fields and generating additional decomposition.

Then retry_gate separately handles the limits on repeated searches and questions, and maybe_calculate decides whether control should be sent to calculate before the final transition.

So it's still the same general approach, but GPT-5.5 surfaces not only the large stages but also the rules for handling a PlannerDecision: how to fix the chosen action, when to stop repeated searches and questions, and when to send control to calculate instead of finishing.

DeepSeek-4-pro¶

flowchart TD
    planner_start([START]) --> guard[guard]

    guard -->|ask region/currency| ask_user
    guard -->|aborted or max_iters| finish
    guard -->|continue| prepare

    prepare -->|calculator cap or success| finish
    prepare -->|blocked calculator / auto-complete| adjust
    prepare -->|continue| plan

    plan -->|decomposition needed| plan
    plan -->|continue| adjust

    adjust -->|search| search
    adjust -->|ask_user| ask_user
    adjust -->|calculate| calculate
    adjust -->|reflect| reflect
    adjust -->|finish| finish

    search -->|last_observation| observe
    search -->|no hits| guard
    observe --> guard
    calculate --> guard
    ask_user --> observe_user
    observe_user --> guard
    reflect --> guard
    finish --> planner_end([END])

DeepSeek-4-pro opts for a more compact split. Here guard combines the loop entry, the stop checks, and the initial region and currency questions. If everything checks out, control passes to prepare, which assembles the pre-decision state, checks the calculator limits and success, handles a blocked calculation, and performs auto-finish if all the data is already gathered.

If prepare finds no deterministic answer, the graph goes to plan, where the model is called for the search.

That said, the search with a new decomposition isn't pulled out of plan. If the model chose a composite field with no ready recipe, the same node adds the missing decomposition and calls the model again with the updated context. In the Mermaid diagram this is, for some reason, drawn as a plan -> plan loop — though it would make more sense not to depict it at all.

After that, everything lands in adjust. This node gathers the decision rewrites, the retry limits, the calculation fix, and the final routing to search, ask_user, calculate, reflect, or finish. The result is a shorter graph, but more logic stays inside prepare, plan, and adjust.

The graph came out compact yet still coherent, though prepare and adjust stay fairly chunky. There are a few unpleasant mistakes. There's no reason at all to keep recipes in the checkpoints, and the plan -> plan loop isn't needed.

Gemini-3.1-pro¶

flowchart TD
    planner_start([START]) --> prepare_state
    prepare_state --> evaluate_rules

    evaluate_rules --> route_after_rules{route_after_rules}
    route_after_rules -->|ask_user| ask_user
    route_after_rules -->|calculate| calculate
    route_after_rules -->|search| search
    route_after_rules -->|finish| finish
    route_after_rules -->|needs_llm| llm_plan

    llm_plan --> route_after_llm{route_after_llm}
    route_after_llm -->|search| search
    route_after_llm -->|ask_user| ask_user
    route_after_llm -->|reflect| reflect
    route_after_llm -->|finish| finish

    search --> route_after_search{route_after_search}
    route_after_search -->|found| observe
    route_after_search -->|failed| prepare_state
    observe --> prepare_state
    calculate --> prepare_state
    ask_user --> observe_user
    observe_user --> prepare_state
    reflect --> prepare_state
    finish --> planner_end([END])

Gemini-3.1-pro has a certain style. It seems to care more about its carbon footprint than about software architecture: it puts out a very rough solution, moderately thought through, but it spends a minimal number of tokens.

prepare_state takes on state preparation. The node increments the iteration counter, kicks off proactive decomposition, gathers the ready component fields, and updates the schema.

After that, evaluate_rules checks all the deterministic conditions: stopping by limits, region and currency, the calculator state, a blocked calculation, the next data-collection task, and auto-finish.

If a rule fires, the graph goes straight to the appropriate action: ask_user, calculate, search, or finish. If not, control reaches llm_plan, where the model is finally called and a PlannerDecision is built.

Gemini-3.1-pro gives a crude first sketch. State preparation, deterministic rules, and the LLM call are separated out, but evaluate_rules still bundles too much heterogeneous logic. The post-LLM policy is barely described. It's hard to tell whether the redirects, retry limits, and calculation fix are tucked away inside the nodes or simply missing.

GLM-5.1¶

flowchart TD
    START --> pre_check

    pre_check -->|abort or finish| finish
    pre_check -->|ask region| ask_region
    pre_check -->|ask currency| ask_currency
    pre_check -->|continue| acquire_or_plan

    ask_region --> observe_user_region[observe_user]
    ask_currency --> observe_user_currency[observe_user]
    observe_user_region --> pre_check
    observe_user_currency --> pre_check

    acquire_or_plan -->|search| search
    acquire_or_plan -->|ask_user| ask_user
    acquire_or_plan -->|calculate| calculate
    acquire_or_plan -->|reflect| reflect
    acquire_or_plan -->|finish| finish

    search -->|observe| observe
    search -->|retry| pre_check
    observe --> pre_check
    calculate --> pre_check
    ask_user --> observe_user
    observe_user --> pre_check
    reflect --> pre_check
    finish --> END

GLM-5.1 looks like it takes only the first step toward refactoring the node rather than doing it. All the early logic is folded into pre_check: the loop entry, stopping by status and limits, the region and currency questions, proactive decomposition, schema assembly, and the early calculator exits.

After pre_check, one large working node remains, acquire_or_plan. It either finds the next data-collection task without calling the model, or checks for auto-finish, or calls the model and immediately applies the post-decision fixes.

In essence, almost no real splitting happens — most of the logic still lives inside acquire_or_plan, just under a new name.

On top of that, GLM-5.1 didn't even produce a Mermaid diagram, describing it in text instead. Here it's redrawn for the article's consistency, without changing the proposal's structure itself.

Kimi-2.6¶

flowchart TD
    START([START]) --> tick

    tick -->|aborted or max_iters| finish
    tick -->|needs region| ask_region
    tick -->|needs currency| ask_currency
    tick -->|ready| prepare

    prepare --> calc_gate
    calc_gate -->|calc success| finish
    calc_gate -->|calc cap reached| finish
    calc_gate -->|continue| acquire

    acquire -->|auto_finish| calc_adjust
    acquire -->|blocked_task| route_direct
    acquire -->|needs LLM| decide

    decide --> enforce
    route_direct --> enforce

    enforce -->|search| search
    enforce -->|ask_user| ask_user
    enforce -->|reflect| reflect
    enforce -->|calculate| calculate
    enforce -->|finish| finish

    search -->|has observation| observe
    search -->|no hits| tick
    observe --> tick
    calculate --> tick
    ask_user --> observe_user
    observe_user --> tick
    reflect --> tick
    ask_region --> observe_user
    ask_currency --> observe_user
    finish --> END([END])

Kimi-2.6 almost made it. tick handles the loop entry and the stop checks, and the initial questions are pulled into separate ask_region and ask_currency branches. Then prepare handles state preparation: proactive decomposition, recipes, and assembling the ready fields. After it, calc_gate separately checks the calculator's lifecycle, choosing between a successful calculation, exhausted attempts, and continuing.

The next node, acquire, looks for a deterministic path: auto-finish, a blocked calculation task, or the need to call the model. If there's already a data-collection task, it goes through route_direct; if not, control passes to decide, where the model is called.

The route_direct and decide branches then converge in enforce, where the constraints, redirects, limits, and calculation fixes are applied before the final transition to an action.

One thing worth noting: Kimi-2.6's proposal has an auto_finish -> calc_adjust branch, but calc_adjust is never described and leads nowhere. Otherwise the graph looks quite reasonable. A pity.

Kimi-2.6 produced a fairly balanced solution, close in form to GPT-5.4, say. But the picture is spoiled by the hallucinated calc_adjust branch, a somewhat bloated enforce_policy, and a route_direct that barely does anything.

MiMo-2.5-pro¶

flowchart TD
    START --> guards

    guards -->|bootstrap gate| ask_user
    guards -->|early_finish| finish
    guards -->|proceed| acquire

    acquire -->|has_task| decide
    acquire -->|no_task| finish

    decide --> plan_router[post_process / route_after_plan]
    plan_router -->|search| search
    plan_router -->|ask_user| ask_user
    plan_router -->|reflect| reflect
    plan_router -->|calculate| calculate
    plan_router -->|finish| finish

    search -->|observe| observe
    search -->|retry| guards
    observe --> guards
    calculate --> guards
    ask_user --> observe_user
    observe_user --> guards
    reflect --> guards
    finish --> END

The first node, guards, takes on the loop entry, the stop checks, the region and currency questions, and the early calculator exits. If nothing stopped the loop, control goes to acquire.

acquire handles state preparation: proactive decomposition, schema assembly, and finding the next data-collection task. If there's a task, the graph goes to decide; if no tasks remain, it can finish.

decide not only calls the model when needed but also turns the found data-collection task into a decision, and applies the redirects, limits, and calculation fixes.

All in all, the split is a bit more substantial than GLM-5.1's, but still too cautious.

Here as well, the original was plain text, not Mermaid. I redrew it.

Opus-4.7¶

flowchart TD
    START([START]) --> g_iter
    g_iter -->|aborted/cap| dispatch
    g_iter -->|ok| g_region

    g_region -->|missing| emit_region --> dispatch
    g_region -->|present| g_currency
    g_currency -->|missing| emit_currency --> dispatch
    g_currency -->|present| enrich

    enrich --> g_calc_caps
    g_calc_caps -->|cap reached| emit_calc_abort --> dispatch
    g_calc_caps -->|success| emit_calc_done --> dispatch
    g_calc_caps -->|continue| acquire

    acquire -->|task found| emit_acq --> c_calc_adjust
    acquire -->|auto-finish| emit_auto_finish --> c_calc_adjust
    acquire -->|work remains| plan_llm

    plan_llm --> c_target_decompose
    c_target_decompose --> c_redirect_derived
    c_redirect_derived --> c_redirect_web
    c_redirect_web --> c_search_cap
    c_search_cap --> c_ask_cap
    c_ask_cap --> c_calc_adjust
    c_calc_adjust --> dispatch

    dispatch -->|search| search
    dispatch -->|ask_user| ask_user
    dispatch -->|calculate| calculate
    dispatch -->|reflect| reflect
    dispatch -->|finish| finish

    search -->|hits| observe --> START
    search -->|no hits| START
    ask_user --> observe_user --> START
    calculate --> START
    reflect --> START
    finish --> END([END])

Opus-4.7 unfolds almost all the hidden logic into the open. The entry section is a chain of checks. g_iter handles iteration and stopping, g_region and g_currency separately check the initial fields, enrich prepares the schema and decompositions, and g_calc_caps handles the early calculator exits.

If there's a deterministic data-collection task or an auto-finish ahead, acquire turns it into a decision via emit_acq or emit_auto_finish; if not, control goes to plan_llm.

After the model call, Opus-4.7 doesn't keep a single guard but spreads the decision-fixing across a chain of correctors. c_target_decompose generates additional decomposition for a composite field, c_redirect_derived redirects derived fields, c_redirect_web turns a too-early user question into a search, c_search_cap and c_ask_cap watch the limits, and c_calc_adjust fixes a premature finish before a calculation.

All the branches then converge in dispatch, which sends the graph on to a specific action.

As a map of the hidden policies it's very clear, but as a working design it looks heavy. Almost every rule becomes its own node.

The way I see it, this is the best solution to settle on while you're still in the exploratory phase of building the agent and it isn't yet entirely clear how it will be structured. This level of detail helps you trace the agent's behavior through the graph's checkpoints as it talks with testers or users. And once it's clear the design has stabilized and the graph rarely changes, you can coarsen it. With LLM-assisted development, that's cheap.

Qwen-3.6-plus¶

flowchart TD
    planner_start([START]) --> preflight

    preflight -->|needs region| ask_region
    preflight -->|needs currency| ask_currency
    preflight -->|calc cap exceeded| finish
    preflight -->|calc success| finish
    preflight -->|ok| decompose

    ask_region --> observe_region
    ask_currency --> observe_currency
    observe_region --> preflight
    observe_currency --> preflight

    decompose --> plan
    plan -->|search| search
    plan -->|ask_user| ask_user
    plan -->|reflect| reflect
    plan -->|calculate| calculate
    plan -->|finish| route_finish_check

    search -->|last_observation| observe
    search -->|no hits| plan
    observe --> plan
    calculate --> plan
    ask_user --> observe_user
    observe_user --> plan
    reflect --> plan

    route_finish_check -->|caps ok, all done| planner_end([END])
    route_finish_check -->|cap exceeded| finish
    finish --> planner_end

Qwen-3.6-plus starts with a large preflight. It absorbs the stop checks, the region and currency questions, the calculator limit, and a successful calculation.

Region and currency are pulled into separate ask_region and ask_currency branches, after which the answers return control to preflight. If all the initial checks pass, the graph goes to decompose, where the dynamic decompositions and the schema are prepared.

Then control reaches plan. This node stays fairly fat; it builds the model's decision. And where did our little rascal Qwen stash the logic for redirects, limits, and calculation fixes? In a conditional edge function, route_after_plan!

This is the design's main flaw. An edge in LangGraph should only read state and return the name of the next node, not rewrite decision/status. This isn't just some anti-pattern. It simply won't work. The edge returns a route string and can't persist the rewritten decision, so when a limit fires, ask_user will interrupt with the old search decision and no question text, and the aborted status will be lost.

The diagram also reveals a second problem. The working loops (search, observe, calculate, observe_user, reflect) return to plan rather than preflight, so the stop checks and max_iters on the main loop are never re-checked.

Only the finish check gets its own node. If the model decided on finish, the graph first goes to route_finish_check, which verifies whether it's really safe to finish or whether it needs to return to work.

There's some ambiguity around finish in this design. According to the proposal text, it's the existing terminal node that simply ends the planner graph. But the Mermaid draws some completions straight into planner_end and others through finish, so the role of finish itself is unclear.

All in all, the split came out slapdash. Some important stages are extracted quite sensibly, but the area around plan, routing, and finish is a kind of flight-of-fancy zone. It's a thing that looks like a refactoring proposal from afar, but you'd be wise not to get any closer.

Qwen-3.7-max¶

flowchart TD
    planner_start([START]) --> check_termination

    check_termination -->|aborted or max_iters| finish
    check_termination -->|continue| check_region_currency

    check_region_currency -->|needs region| ask_region
    check_region_currency -->|needs currency| ask_currency
    check_region_currency -->|ready| compose_schema

    ask_region --> observe_user
    ask_currency --> observe_user
    observe_user --> check_region_currency

    compose_schema --> check_calculator
    check_calculator -->|calc cap reached| finish
    check_calculator -->|calc succeeded| finish
    check_calculator -->|calc blocked| acquisition_routing
    check_calculator -->|pending| acquisition_routing

    acquisition_routing -->|has task| decide_action
    acquisition_routing -->|no task| check_completion
    check_completion -->|all collected| finish
    check_completion -->|missing fields| decide_action

    decide_action --> llm_decide
    llm_decide -->|LLM failure| finish
    llm_decide -->|valid decision| post_process
    post_process --> enforce_caps

    enforce_caps -->|search cap| ask_user
    enforce_caps -->|ask cap| finish
    enforce_caps -->|ok| route_decision

    route_decision -->|search| search
    route_decision -->|ask_user| ask_user
    route_decision -->|reflect| reflect
    route_decision -->|calculate| calculate
    route_decision -->|finish| finish

    search -->|has observation| observe
    search -->|no hits| check_termination
    observe --> check_termination
    calculate --> check_termination
    ask_user --> observe_user
    reflect --> check_termination
    finish --> planner_end([END])

Qwen-3.7-max proposes even more nodes than Opus-4.7.

First, check_termination checks for a stop and the iteration limit, then check_region_currency separately handles the region and currency questions.

After that, compose_schema assembles the schema and decompositions, and check_calculator separately checks the calculator limit, a successful calculation, a blocked state, and ordinary continuation.

Then acquisition_routing handles the case where the calculation is blocked on a missing field. Using the static_field_acquisition recipes, it looks for the next acquisition_task, and if there's no ready recipe, it tries to generate dynamic_decompositions for the blocked field.

If acquisition_routing finds no acquisition_task, check_completion separately checks whether work can be finished or whether the schema still has missing fields. Only after these nodes does the graph reach llm_decide, where the model is called.

After the model, the decision still passes through post_process, where additional decompositions are generated and redirects performed, then through enforce_caps, where the search and question limits are applied, and only then through route_decision does it head to a specific action.

In its degree of splitting, the proposal resembles Opus-4.7. But there are strong differences in how the idea is implemented.

Opus-4.7 has the better explanatory layer: a 17-line map of the old plan's responsibilities, a clear g_* / c_* taxonomy, the unifying invariant that "every branch first reduces its result to a PlannerDecision, and then dispatch sends it to the right action," a table of state-field ownership, and a coherent three-step migration. Opus-4.7 gives the best map of the control logic, even though it keeps a bit of routing in state.

Qwen-3.7-max, by contrast, is far sloppier and more self-contradictory. All routing hangs on a bookkeeping field, _route. Each node writes to that field to tell the edge where to go next, and the edge obeys. It's a homemade goto layered over LangGraph's conditional edges. As a result, the routing decision is hidden inside the nodes again, and the graph stops reflecting the real control flow — which partly defeats the whole point of the refactor. On top of that, the field has to be set in every node, and the moment you forget it somewhere or leave a stale value, you get silent misrouting.

route_decision is called a node in the text, but it isn't registered as a node when the graph is built and is used as a routing function with a side effect. acquisition_task is looked up repeatedly in several places.

The final disappointment is the observe_user -> check_region_currency loop, which bypasses check_termination — the only node that increments iterations. So on the most heavily traveled "ask → answer" loop the counter never fires: max_iters and a number of search mechanisms stop working.

Stage two. The models evaluate the proposals¶

In the second stage, each model read all eleven proposals and ranked them. The prompt used to generate each analysis (only the output filename changed):

You are skilled software architect specializing in llm agentic development. In docs/planner-graph-ref/proposals there are some proposals to refactor current planner component graph. Evaluate those proposals and create ranged list of them in docs/planner-graph-ref/analyse/<model>-range.md with explanation of what good and bad sides you can see in each. Make a conclusion of what proposal or combination of them you can advice as best solution.

The reviews table¶

Model	Review style	The gist
Fable-5	Meticulous, with code checks	The only report with verifiable facts; found both bugs; full coverage and a recommended composition with an execution order
GPT-5.4	Pragmatic, architectural	The only one to check the contents of `tests/planner/` (and caught Opus-4.7 out by doing so); the sober premise that "the best option depends on the goal." Ranked its own proposal first
GPT-5.5	Careful, disciplined	Not a single false fact; the best treatment of the constraints (checkpoints, edges, namespace)
DeepSeek-4-pro	Maximally dense	The most information-dense artifact and clearly worth reading, but crowns a bugged favorite
Gemini-3.1-pro	Minimalist, broad strokes	No per-proposal evaluation; even the explicit instructions from the prompt went unmet. Last place on every metric
GLM-5.1	Systematic, but scored on a homemade scale	Caught Kimi-2.6's double calculation and gave a node-name map, but mixed up the GPT-5.5 and Qwen-3.6-plus proposals
Kimi-2.6	Broad, self-critical	Found the iteration bug in Qwen-3.7-max on its own, but drags Gemini-3.1-pro up to 4th place and recommends the questionable `recipes`-in-`state` design
MiMo-2.5-pro	Systematic, with scales	Came up with an interesting proposal-convergence idea and tried to derive the best proposal from it, but its scores are falsely precise and it's too easygoing about risks
Opus-4.7	Deep, taxonomic	The richest "decision package," and the only one to propose rules that keep the node from regrowing. Spoiled by two factual errors — a nonexistent test file, and calling Kimi-2.6's bug a "nice touch"
Qwen-3.6-plus	Thoughtful about the problem itself	The best line-by-line map of `plan_node` in the whole set and a merciless self-assessment, but the middle of its ranking can't be trusted
Qwen-3.7-max	Didactic, explanatory	A systematic take on the question of how many nodes the graph should keep. Praises DeepSeek-4-pro's dubious self-loop

A bit more on each review¶

Fable-5¶

Fable-5's review is the most thorough in the set. It's the only report with a verifiable set of facts: six claims about the current code with precise file:line citations, and all six survive independent re-verification.

It's the only one to find both bugs that actually affect the choice: Kimi-2.6's double _adjust_calculation_decision and Qwen-3.7-max's skipped iteration increment. Most found neither, and both bugs were later confirmed, separately, by other models on their own.

Full coverage of all 11 proposals under unified criteria, a comparison matrix, an open list of rejected ideas with their authors named. It closes with a concrete recommendation and a three-step plan for assembling the final design from several proposals.

It does have weak spots, though. The text is dense, it leans on line numbers (which will go stale with the first change to agent.py), and there are no numeric scores. It ranked its own model family's proposal first. To its credit, it flagged that conflict of interest right in the report.

GPT-5.4¶

GPT-5.4 is a solid practitioner. It's the only one to check what's actually in tests/planner/. That lets it ground the test-migration plan in reality and, along the way, catch Opus-4.7 citing a nonexistent test file. Plus sharp observations the others lack. For example, which designs let observe_user parse the bootstrap answers itself, and which keep state safe to serialize.

It rests on the reasonable premise that the best option depends on what you need: minimal changes or maximal structure.

Its downsides mirror Fable-5's. Each proposal gets a fairly shallow review, without line-level evidence, and it missed both bugs. It ranks its own proposal first but, unlike Fable-5, never flags the fact.

GPT-5.5¶

GPT-5.5's review is the cleanest text in the whole set. Re-verification found not a single false claim in it. And it explains, better than anyone, the constraints the entire task actually revolves around. A node boundary is a checkpoint boundary (the planner sits on an external Postgres saver, so every extra node is a write on every turn); edge functions are not "free" transit code; and a checkpoint-namespace change is a decision that touches the user experience, since a production update will drop some already-started conversations.

It's the only review that, once read, lets you make your own balanced call on how many nodes you really need. The model also criticizes its own proposal for a poorly placed node. The ending is crisp and to the point: a hybrid recommendation, a ready Mermaid diagram, and a seven-step migration.

On the downside, it's a bit short on the per-proposal coverage, has no comparison matrix, and caught neither of the two bugs.

DeepSeek-4-pro¶

DeepSeek-4-pro is a less clear-cut case. There's plenty to praise in the review, but you can't trust its conclusions, and the latter outweighs the former.

What earns praise. It's the most information-dense artifact of all eleven. Agree/disagree tables for each proposal, a unique table for resolving the contradictions between reports, a full synthesized design with Mermaid, a line count, and a migration plan.

What lets it down is precisely the conclusion. It awards Kimi-2.6's proposal first place (though by consensus it's mid-pack) and praises the route_direct → enforce_policy wiring as a "clean pattern." But that's exactly the bug where _adjust_calculation_decision fires twice. Meanwhile, Opus-4.7 is buried in last place. Its own recommendation bakes recipes into state, which the serializer simply won't allow.

Gemini-3.1-pro¶

Gemini-3.1-pro is true to form. A minimalist proposal, a minimalist review. An outright also-ran.

There is something useful, even so. Its taxonomy of four paradigms (phase pipeline / minimalism / micro-nodes / edge routing) gives a decent quick lay of the land. And its own four-node design hangs together well.

What buries it as a review is that it contains no per-proposal evaluation at all. Eleven options are collapsed into four groups, so you can't weigh any specific one through it. And the single substantive claim it makes turns out to be demonstrably false. Neither bug was noticed.

A good one-page cheat sheet, but one to read after a proper report, not instead of one.

GLM-5.1¶

GLM-5.1's review is solid and systematic in form. Aspect tables with a unified structure for each proposal, a genuinely useful name-mapping table (which name each model gives each node), and a phase 0–5 migration.

And beyond form, it's one of only two reports to catch Kimi-2.6's double _adjust_calculation_decision, even if only in passing.

A couple of things spoil the picture. It confuses the GPT-5.5 proposal with the observe_region/observe_currency micro-nodes that actually belong to Qwen-3.6-plus's design. The scores it hands out don't match the written rationale beneath them. Some risks (the same recipes, the plan → plan loop) are quietly glossed over. It does no fact-checking of its own.

Kimi-2.6¶

Kimi-2.6 has strong finds, but you can't trust its conclusions. It caught the skipped-iteration-increment bug in Qwen-3.7-max, one of only two reports to do so. It placed its own proposal eighth, and its own "what to avoid" list at points directly contradicts its own design.

But the misses outweigh that. It drags Gemini-3.1-pro's proposal up to fourth place on a confused reading of its return edges. The return-past-the-guard-node that Kimi-2.6 calls a bug in Qwen-3.7-max is treated as a trifle in Gemini-3.1-pro.

The number-two item in its recommendation is DeepSeek-4-pro's recipes caching, which the serializer won't accept.

MiMo-2.5-pro¶

The most valuable part of MiMo-2.5-pro's review is the convergence appendix. It tallies where all eleven proposals independently agreed (10 of 11 for extracting search, all 11 for a single interrupt, and so on).

Coverage is full, the bottom line is restrained, and the recommendation is sensible (Fable-5 as the base + GPT-5.4's anti-patterns + optional splits from GPT-5.5).

But the middle of the table lets it down. Scores given to one decimal place promise a precision nowhere to be found in the written rationale. It accepts risky ideas (recipes, separate bootstrap nodes) too complacently, found no bugs, and has no fact-checking layer.

Opus-4.7¶

Opus-4.7's proposal was contentious, but its review turned out to be one of the best in the set.

It delivered the deepest "decision package" of all: seven explicit criteria, letter grades, thorough verdicts on each proposal, the best taxonomy (that same invariant — "first reduce to a PlannerDecision, then dispatch") and two concrete amendments. And it's the only one that thought about how to keep the graph clean after the refactor, proposing as a step zero to write the guardrails into planner/AGENTS.md before touching any code.

But two baffling inaccuracies crept into the text. It cites the test file tests/planner/test_planner_agent.py, which doesn't exist, twice. And it calls Kimi-2.6's route_direct → enforce_policy wiring a "nice touch," even though that's the double-calculation bug. Plus verbosity and links that break when moved.

Qwen-3.6-plus¶

Another pleasant turnaround. With my characteristic tact, I'd called the Qwen-3.6-plus proposal a flight-of-fancy zone. Yet its review turned out far more sensible.

What's more, it has one of the most useful artifacts in the whole set. A table with a line-by-line breakdown of plan_node (lines 848–1192, with an "is an LLM needed here?" column). If you want any sense of why and along which seams to cut a god node, this is where to start.

Its realistic self-assessment wins you over. Qwen-3.6-plus places its own proposal eighth and explains its return-to-plan regression itself.

Now for the bad. The top and bottom of the chart look carefully done, while the middle is slapped together carelessly. For some reason DeepSeek-4-pro is filed under "over-engineered," though it has only four nodes. Kimi-2.6 is in fifth place without a single word about the double calculation. Confused nitpicking over Qwen-3.7-max's _route branch. No fact-checking of its own.

Qwen-3.7-max¶

Qwen-3.7-max's review is didactically strong. For each proposal it has "why not more nodes" and "why not fewer" sections, so the reader comes away not with a bare verdict but with an understanding of the decision space itself. Eight weighted axes, full coverage, a sober self-assessment (its own proposal in eighth place).

There are weak spots too. It praises DeepSeek-4-pro's plan → plan loop as "elegant," even though the other reviews rightly read it as just more checkpoint writes to the database. And DeepSeek-4-pro's inflated second place rests partly on that praise. It's also lenient toward the problems in its own design, and it has no code-checking layer of its own.

Stage three. The wet swimsuit contest¶

Here I'll try to find some answers to the burning questions of our time:

Which model is best for generating architectural solutions?
Which model is best for evaluating a set of architectural solutions?
And how do you dump responsibility for architectural decisions onto a model completely?

Approach one. Do the scores agree?¶

The first thing you want to do is simply compare the models' average scores. It turned out that opinions on the architectural proposals themselves can be boiled down to a fairly clear consensus.

In the chart below, each line shows how one proposal was scored by the different models, and the colored dashed line is its average across all the scores.

Combined chart of proposal scores across the models

The "clumping" of the lines for Gemini-3.1-pro is because it sorted the solutions into 4 groups instead of producing a full ranking.

You can also note the suspicious tangle of lines for Kimi-2.6. It's as if it handed out its scores somewhat at random.

Best proposals: the summary¶

Rank	Proposal	Average score
1	Fable-5	10.7
2	GPT-5.4	9.2
3	GPT-5.5	8.0
4	DeepSeek-4-pro	7.9
5	Kimi-2.6	6.0
6	Opus-4.7	5.9
7	GLM-5.1	5.5
8	Gemini-3.1-pro	3.8
9	MiMo-2.5-pro	3.6
10	Qwen-3.7-max	2.7
11	Qwen-3.6-plus	2.6

For picking the best architecture proposal, this method is good enough. It's simple, transparent, and reproducible.

Approach two. Comparing reviews by theses¶

The average ratings show which proposal most of the field likes. But running a series of reviews from different models for every decision you make is simply too expensive. At least for now. So I'd like to figure out which model is preferable to use as an evaluator of solutions. The goal is to identify the best analyst.

How to do that? As a first method, I tried a thesis-based analysis:

collect a list of the theses that appear in the evaluators' reports;
mark which models support which thesis;
build an agreement matrix and rank the evaluators by how well their analysis matches the common core;
separately account for breadth of coverage, text structure, and depth of reasoning.

Let me explain the last point. Breadth of coverage — how many theses from the master list a review touched at all (the broadest reached 36, the stingiest 19). Text structure — how many substantive components of an analysis the review assembles. These are evaluation criteria, strengths and weaknesses, a synthesis with a recommendation, a migration plan, a risk analysis, and a comparison table. Depth of reasoning — essentially just the length of the review.

Across all the reviews, the theses normalized down to about forty.

Examples of consensus theses (the number in parentheses shows how many reviews stated it):

plan_node is a god node and needs to be split (10);
the graph is deceptive: too much routing logic is hidden inside plan_node (10);
the deterministic logic before the LLM call should be pulled out into phases visible on the graph (10);
the LLM call itself should shrink to a thin planning phase (10);
returns from the action nodes should enter the top phase of the loop, not the middle of planning (10);
ask_user remains the only interrupt node (10);
the goal is a pipeline of stable, medium-granularity phases. Not a node per if, but not so over-split that it piles up checkpoint writes (10);
the post-LLM policy and redirects should become explicit architecture rather than staying buried in a single planner function or in the edges (10).

Examples of contested theses that few stated:

giving region and currency their own bootstrap nodes (2);
using DeepSeek-4-pro's plan → plan loop for late (post-LLM) decomposition (3);
adding a route_direct adapter to funnel deterministic and LLM decisions into a single post-phase (3);
introducing a separate decision_origin field instead of a plain llm_failed (2);
keeping recipes in state (1);
breaking the post-LLM policy into a normalize_decision → retry_gate → maybe_calculate chain (3).

I ran this algorithm three times.

On the first pass I ranked the evaluators by the popularity of the theses each review states and adjusted a little for coverage, structure, and depth. The winner was DeepSeek-4-pro.

What is "the popularity of the theses stated"? Each thesis has a popularity — how many reviews stated it. A review's score is the sum of the popularities of all the theses in it. So a widely shared thesis earned more points than a niche one, and the review that gathered the most widely held ideas came out on top. DeepSeek-4-pro pulled ahead exactly that way. It signed on to the most theses, even if not the most popular ones on average.

The second time I formalized the comparison. I broke each review down into per-aspect scores, averaged them, and looked at whose review was closest to that average. DeepSeek-4-pro came out first again. But the rest of the table reshuffled quite a bit.

An aspect is a recurring architectural claim into which several closely related theses from different reviews are consolidated. For example: "aim for a 4–6 phase graph," "one policy node after the LLM is enough," "you need separate bootstrap nodes for region and currency". There were nineteen of these in all. For each aspect a review received a number from −1 to +1 depending on whether it supported the aspect, stayed silent, or rejected it. That turned each text into a vector of 19 numbers. A mean vector is derived from these, and the reports are sorted by their closeness to it.

The third time I left the method untouched and only renamed the DeepSeek-4-pro file so it would sort to the end of the list. Because I suspected its position at the top of the report list might influence how the thesis list itself gets determined. As a result, DeepSeek-4-pro slid to fourth place, and the thesis set itself came out shorter.

If you write all three rankings side by side, you can see how the evaluators get tossed around from run to run (Fable-5 did not yet take part in these experiments):

Evaluator	Run 1	Run 2	Run 3
DeepSeek-4-pro	1	1	4
Qwen-3.7-max	2	7	2
Opus-4.7	3	5	9
Kimi-2.6	4	9	10
GLM-5.1	5	8	3
GPT-5.5	6	3	6
Qwen-3.6-plus	7	6	7
MiMo-2.5-pro	8	2	1
GPT-5.4	9	4	5
Gemini-3.1-pro	10	10	8

At this point I realized I couldn't pin down the best analyst semantically. I made a couple more attempts using embeddings for the comparison instead of thesis normalization. But I won't tire you further, weary reader, with the nonsense that came out of it.

Approach three. Center of opinion and medoid¶

I couldn't compute the most skilled analyst from the texts of the analyses. So I'll compute it from metrics. And the only metric I really have is the ranks each review handed out to the proposals.

First an artificial center is built: the arithmetic mean of each proposal's rank across all the analyses. Then each report is compared against that average ranking. As a control, the medoid is computed separately by measuring the total distance from each real report to all the others.

This method shows how accurately an analyst predicted the distribution of the refactoring proposals' ranks in the aggregate rating.

Evaluator ranks compared by closeness to the average ranking and by medoid

Both methods showed that the best predictor of the overall rating was GPT-5.5. Further down the list there are minor discrepancies between the two computation methods, but minor is exactly what they are. One position either way. And the medoid distances show that the models that swapped places differ by only hundredths and thousandths.

It's fairly expected that GPT-5.4 and GPT-5.5 predicted the final rating well. And it's fairly surprising that MiMo-2.5-pro and Qwen-3.7-max creep up toward the top of the table. Then again, the score chart bears it out. Those Chinese analysts land fairly close to the average.

Deus ex machina¶

But how do you ultimately choose the model whose report counts as sufficient for picking the best refactoring solution?

The fact that a model ranked the proposals close to the consensus doesn't mean at all that it also gave enough information to make a deliberate decision. A person may well have reasons to decide in favor of a proposal that didn't even place near the top of the ranking. And they're better off basing that decision on the fullest and highest-quality review from an analyst model.

Until I could strictly identify the best analyst by semantic methods, I decided to enlist the help of "experts." There's a clear winner among the architecture proposals and a clear winner at predicting the analysts' proposal rating. Those are Fable-5 and GPT-5.5, respectively.

So I'll ask them to rank the analytic reports by the quality of their analysis. Naturally, neither contestant gets to see the other's work. The prompt was:

You are skilled software architect specializing in llm agentic development. In docs/planner-graph-ref/analyse there are some analitic reports about proposals to refactor current planner component graph. Evaluate those reports and create ranged list of them in docs/planner-graph-ref/best-analyst/<model>-ba-range.md with explanation of what good and bad sides you can see in each. Range them by principle of what report is the best only report to read for having enough information to make a deliberated decision of what proposal to choose for implementation.

The results of the two meta-analyses diverged in the details but not in the main thrust:

Analyst-report ranks in the Fable-5 and GPT-5.5 meta-analysis

The chart shows unanimous endorsement of Fable-5's analysis. After that, each judge rates its own corporate approach to the problem a touch higher. And word is the Chinese analysts to watch first are Qwen-3.7-max and GLM-5.1. Maybe DeepSeek-4-pro too.

And all these assessments diverge quite a bit from the results of the best rating predictor. Which is no surprise. These are somewhat different questions, after all. That's exactly why they're split into separate parts of the analysis.

Let's go ahead and see exactly how they diverge:

Best analyst: Fable-5 and GPT-5.5 meta-analyses on the sides, ranking closeness to the average in the center

As for trends, I'll point out that the GPT family tops the ratings on both dimensions, GLM-5.1 and Qwen-3.7-max sit in the middle, and Gemini-3.1-pro is at the bottom.

Methods for evaluating the analysts¶

Fable-5 rated each report on five dimensions:

Fact integrity. Whether every claim about the current code and the proposals is checked against the source.
Coverage. Whether all 11 proposals are assessed individually.
Decision power. Whether the report ends in a concrete adopt/combine/reject recommendation with a migration order.
Risk coverage. Whether the pinch points are addressed: a checkpoint write at every node boundary, the single user-question interrupt, the state-serialization constraints, the real composition of the test suite.
Calibration. Whether the bottom line stays in line with the broader consensus.

And it didn't just tally how many errors it found in a report. It looked at where each error sits and what role it plays in the argument.

GPT-5.5 took almost the same route, but across six criteria:

Current-code groundedness. Whether claims are checked against the current code, the planner's state, and the real interrupt/resume surface.
Decision-critical risk coverage. Checkpoint compatibility, state serialization, edge-function purity, the single user-question interrupt, iteration semantics, LLM-failure behavior, and the scope of rewriting deterministic vs. LLM decisions.
Proposal coverage. Whether all 11 are assessed individually.
Implementation usefulness. Whether there's a recommendation, a migration path, and a clear sense of what to adopt and what to reject.
Reliability. No factual mistakes and no inflated scores for its own proposal.
Clarity. Whether the report is written clearly enough to serve as a single briefing.

The verdicts themselves¶

Each expert gave every report under review a set of positive and negative points, capped with a short verdict. In the table below I include only those verdicts.

Report	Fable-5 verdict	GPT-5.5 verdict
`fable-5-range`	Sufficient. True facts, full coverage, both decisive bug-finds, and a concrete execution order. This is the report to read if you read one.	A+. The only report that is both comprehensive and hard-nosed about the hidden operational risks. You could pick an implementation plan from this one alone.
`opus-4.7-range`	Sufficient with corrections. The richest "what to do next" in the corpus. Just ignore the nonexistent test file and read Kimi-2.6's "nice touch" as the bug it is.	A−. Architecturally strong and nearly the best. It loses only because GPT-5.5 is the cleaner single read.
`gpt-5.5-range`	Sufficient, and the safest. Nothing here will mislead you; you'll simply learn less about each option than Fable-5 or Opus-4.7 would have told you.	A−. A strong, compact basis for a decision. It trails the top pair only because a deliberated choice wants more evidence.
`gpt-5.4-range`	Sufficient-minus. Accurate and decision-shaped, but shallower per option, and it ranks itself first without disclosure.	A. The best lens on maintainable architecture and rollout. Slightly less self-contained, because it borrows some of its mechanics from others.
`glm-5.1-range`	Adequate. The headline is right, coverage is systematic, and there's one real catch. It falls short of "sufficient" because of a cross-attribution error and noisy scores.	B−. An easy read and often right, but it misses too many key risks to rely on alone.
`qwen-3.7-max-range`	Adequate. The best didactics of the lot. The one miss: DeepSeek-4-pro is placed too high, at #2, and that #2 rests on praise for the looping `plan → plan` node.	B+. A solid secondary review. Good enough to confirm the direction, too weak as the only basis.
`deepseek-4-pro-range`	Insufficient alone, excellent second. Deep, but it crowns a proposal with a bug and a risky state design.	C−. A lot of content, but as the only source it could steer a team toward the wrong base proposal and unsafe state choices.
`qwen-3.6-plus-range`	Adequate-minus. Excellent for understanding the problem, but its mid-table judgments are too noisy to act on.	B. Useful and mostly headed the right way, but without a stronger review the reader risks over-complicating the graph.
`mimo-2.5-pro-range`	Adequate-minus. The convergence appendix and the headline are right, but the middle of the table sounds confident while resting on thin evidence.	C. Good for a cross-check, but not as a single source of truth, and too lenient toward risky designs.
`kimi-2.6-range`	Insufficient alone. Real catches and a sober self-assessment, but it could lead you to shortlist Gemini-3.1-pro's design and a state shape that's unsafe for the serializer.	C+. The observations are useful, but trusting it alone to pick a proposal is risky.
`gemini-3.1-pro-range`	Insufficient. A handy one-page taxonomy, but only as a supplement to a real review. On its own it will nudge you to reject a good proposal for a flaw it doesn't have.	D. Fine for a first sketch of the picture, but not for a decision. On its own it misses the most important implementation risks.

Takeaways¶

Which model to use for generating architecture.

The simpler the decision you need to make, the more readily you can just take a proposal from Fable or GPT-5.5. And now that Fable is unavailable outside the US, the choice between Opus and GPT is far from obvious. As a quick default I'd lean toward GPT.

But if the problem really deserves careful thought, it's better to generate several proposals from whatever models you have access to or prefer. What's more, several proposals from the same model can differ quite a lot. Using a few cheap generations from GLM, Kimi, or Qwen, you can try to reach a result comparable to using the expensive US flagships.

In practice, for the project I chose GPT-5.4's solution with minor improvements borrowed from Opus and GPT-5.5. And not because Fable hadn't been released yet at the time. It's because pulling bootstrap_gate out into its own node struck me as a good move for the project's anticipated future development. And I'd most likely have picked the same GPT-5.4 variant even with Fable's proposal on the table, borrowing only the neat decision_origin idea.

Which model to use as an evaluator.

Again, it splits two ways.

If you have a set of options and just need to quickly pick the best one (or assemble the best combination) without overthinking it, take GPT-5.5 or 5.4. In an all-Chinese setting, you can fall back on MiMo or Qwen-3.7.

But if you want to dig into the details and make your own choice, then Fable. Better yet, Fable + GPT-5.5. Or Opus + GPT-5.5. With the Chinese models here I can't recommend any of them with confidence. They make too many mistakes in their analysis. You'd have to either add an extra layer of analysis to catch those errors or read the reports with a very critical eye.

How to offload all the responsibility onto the model.

For now, you can't. You still have to think for yourself. A fine way to pass the time while waiting for the AI singularity. And modern LLMs help a great deal with that, supplying decent-quality material to analyze.

Twilight of the Gods. Comparing how 11 LLMs approach a code-reorganization task.¶

The original problem¶

What the plan node actually did¶

Lemonade from lemons¶

Stage one. The models generate proposals¶

The proposals table¶

A bit more on each proposal¶

Fable-5¶

GPT-5.4¶

GPT-5.5¶

DeepSeek-4-pro¶

Gemini-3.1-pro¶

GLM-5.1¶

Kimi-2.6¶

MiMo-2.5-pro¶

Opus-4.7¶

Qwen-3.6-plus¶

Qwen-3.7-max¶

Stage two. The models evaluate the proposals¶

The reviews table¶

A bit more on each review¶

Fable-5¶

GPT-5.4¶

GPT-5.5¶

DeepSeek-4-pro¶

Gemini-3.1-pro¶

GLM-5.1¶

Kimi-2.6¶

MiMo-2.5-pro¶

Opus-4.7¶

Qwen-3.6-plus¶

Qwen-3.7-max¶

Stage three. The wet swimsuit contest¶

Approach one. Do the scores agree?¶

Best proposals: the summary¶

Approach two. Comparing reviews by theses¶

Approach three. Center of opinion and medoid¶

Deus ex machina¶

Methods for evaluating the analysts¶

The verdicts themselves¶

Takeaways¶

Comments

What the `plan` node actually did¶