Testing results from the v0.7.0 extended thinking integration (2026-02-13).
The Problem: False Positives from Autoregressive Generation¶
Prior to v0.7.0, the style checker had a persistent false positive problem where the model would report violations that didn’t exist — the “suggested fix” was identical to the “current text.” This happened at rates of 40–60% depending on the rule.
Root Cause¶
The root cause is inherent to how autoregressive language models generate text:
The model starts writing a violation block (commits tokens)
Mid-way through, it realizes the text is actually compliant
But it can’t “undo” the tokens already written
So it either retracts inline or emits identical current/suggested text
No amount of prompt engineering can fix this — the model must commit tokens sequentially and can’t look ahead.
Solution: Extended Thinking¶
Extended thinking (Anthropic’s feature for Claude Sonnet 4.5) lets the model reason internally in a “thinking” phase before producing any output tokens:
The model analyzes the entire document silently
Verifies every candidate violation before committing any output
Only includes confirmed violations in the response
Result: 0% false positive rate
Configuration¶
thinking = {
"type": "enabled",
"budget_tokens": 10000, # Max tokens for internal reasoning
}
temperature = 1.0 # Required by Anthropic for extended thinkingExperiment Results¶
All experiments used rule qe-writing-001 (one sentence per paragraph) against the test lecture markov_chains_jax.md.
Prompt Iteration Results¶
| # | Approach | Violations Found | False Positives | FP Rate |
|---|---|---|---|---|
| 1 | Baseline (verbose prompt, no thinking) | 21 | 9 | 43% |
| 2 | Minimal prompt, no thinking | 27 | 13 | 48% |
| 3 | Minimal prompt + “verify first” | 16 | 10 | 63% |
| 4 | Minimal prompt + “analyze then report” | 20 | 8 | 40% |
| 5 | Minimal prompt + extended thinking | 6 | 0 | 0% |
Ground Truth Validation¶
A deterministic Python script (find_multisentence.py) identified 5–6 genuine violations in the test lecture. The extended thinking result (6 violations, all genuine) aligned correctly with ground truth.
The baseline’s 21 “violations” included:
9 outright false positives (identical text)
Several debatable items (compound sentences, list items)
Only ~6 genuine violations
Full Production Validation¶
After deploying to production, extended thinking was validated across all 8 writing rules:
| Metric | Result |
|---|---|
| Total issues found | 40 |
| Applied fixes (rule-type) | 25 |
| Style suggestions (advisory) | 15 |
| False positives | 0 |
All 25 applied fixes were legitimate corrections. All 15 style suggestions were reasonable recommendations.
Prompt Design Learnings¶
What Hurt (v0.6.1 approach)¶
Category-specific instructions — 8 verbose prompts (~120 lines each) with ~60% boilerplate diluted the actual task signal
“Decision process” instructions (e.g., “for each paragraph, decide if...”) triggered exhaustive classify-everything behavior, generating more false positives
Scope instructions in the prompt (e.g., “skip code blocks”) conflicted across categories — scope is rule-specific, not prompt-level
What Worked (v0.7.0 approach)¶
Minimal rule-agnostic prompt (~40 lines) — identity line + task + format template
Rules carry their own context — each rule defines its scope, unit of analysis, and violation criteria
Extended thinking — model reasons internally before any output
“Verify before reporting” in prompt — combined with extended thinking, this ensures only confirmed violations are output
The Winning Prompt¶
You are a style checker for QuantEcon lecture files written in MyST Markdown.
## Task
Find all violations of the provided rule in the lecture document.
First, silently analyze the entire document and identify candidate violations.
Then, verify each candidate — confirm the current text actually violates the rule
and the fix changes the text.
Only include confirmed violations in your response. Report 0 if none exist.
## Response Format
[template...]This is 40 lines vs the previous 120-line category-specific prompts, and it produces better results.
Key Decisions¶
| Decision | Rationale |
|---|---|
thinking_budget=10000 | Enough for careful analysis, not excessive cost |
temperature=1.0 | Required by Anthropic for extended thinking |
| 8 identical prompt files (for now) | Consolidation to single file planned (validated on writing, pending other categories) |
| Archive v0.6.1 prompts | Reference for regression testing and comparison |