Extended Thinking Results - QuantEcon Style Guide Checker

Testing results from the v0.7.0 extended thinking integration (2026-02-13).

The Problem: False Positives from Autoregressive Generation¶

Prior to v0.7.0, the style checker had a persistent false positive problem where the model would report violations that didn’t exist — the “suggested fix” was identical to the “current text.” This happened at rates of 40–60% depending on the rule.

Root Cause¶

The root cause is inherent to how autoregressive language models generate text:

The model starts writing a violation block (commits tokens)
Mid-way through, it realizes the text is actually compliant
But it can’t “undo” the tokens already written
So it either retracts inline or emits identical current/suggested text

No amount of prompt engineering can fix this — the model must commit tokens sequentially and can’t look ahead.

Solution: Extended Thinking¶

Extended thinking (Anthropic’s feature for Claude Sonnet 4.5) lets the model reason internally in a “thinking” phase before producing any output tokens:

The model analyzes the entire document silently
Verifies every candidate violation before committing any output
Only includes confirmed violations in the response
Result: 0% false positive rate

Configuration¶

thinking = {
    "type": "enabled",
    "budget_tokens": 10000,  # Max tokens for internal reasoning
}
temperature = 1.0  # Required by Anthropic for extended thinking

Experiment Results¶

All experiments used rule qe-writing-001 (one sentence per paragraph) against the test lecture markov_chains_jax.md.

Prompt Iteration Results¶

#	Approach	Violations Found	False Positives	FP Rate
1	Baseline (verbose prompt, no thinking)	21	9	43%
2	Minimal prompt, no thinking	27	13	48%
3	Minimal prompt + “verify first”	16	10	63%
4	Minimal prompt + “analyze then report”	20	8	40%
5	Minimal prompt + extended thinking	6	0	0%

Ground Truth Validation¶

A deterministic Python script (find_multisentence.py) identified 5–6 genuine violations in the test lecture. The extended thinking result (6 violations, all genuine) aligned correctly with ground truth.

The baseline’s 21 “violations” included:

9 outright false positives (identical text)
Several debatable items (compound sentences, list items)
Only ~6 genuine violations

Full Production Validation¶

After deploying to production, extended thinking was validated across all 8 writing rules:

Metric	Result
Total issues found	40
Applied fixes (rule-type)	25
Style suggestions (advisory)	15
False positives	0

All 25 applied fixes were legitimate corrections. All 15 style suggestions were reasonable recommendations.

Prompt Design Learnings¶

What Hurt (v0.6.1 approach)¶

Category-specific instructions — 8 verbose prompts (~120 lines each) with ~60% boilerplate diluted the actual task signal
“Decision process” instructions (e.g., “for each paragraph, decide if...”) triggered exhaustive classify-everything behavior, generating more false positives
Scope instructions in the prompt (e.g., “skip code blocks”) conflicted across categories — scope is rule-specific, not prompt-level

What Worked (v0.7.0 approach)¶

Minimal rule-agnostic prompt (~40 lines) — identity line + task + format template
Rules carry their own context — each rule defines its scope, unit of analysis, and violation criteria
Extended thinking — model reasons internally before any output
“Verify before reporting” in prompt — combined with extended thinking, this ensures only confirmed violations are output

The Winning Prompt¶

You are a style checker for QuantEcon lecture files written in MyST Markdown.

## Task

Find all violations of the provided rule in the lecture document.

First, silently analyze the entire document and identify candidate violations.
Then, verify each candidate — confirm the current text actually violates the rule
and the fix changes the text.
Only include confirmed violations in your response. Report 0 if none exist.

## Response Format
[template...]

This is 40 lines vs the previous 120-line category-specific prompts, and it produces better results.

Key Decisions¶

Decision	Rationale
`thinking_budget=10000`	Enough for careful analysis, not excessive cost
`temperature=1.0`	Required by Anthropic for extended thinking
8 identical prompt files (for now)	Consolidation to single file planned (validated on writing, pending other categories)
Archive v0.6.1 prompts	Reference for regression testing and comparison