Overview¶
qebench is a Python CLI tool built on three layers:
┌─────────────────────────────────────┐
│ CLI Commands │ ← Typer commands (translate, add, stats)
├─────────────────────────────────────┤
│ Scoring │ Providers │ ← Elo, XP, glossary │ Claude, OpenAI
├─────────────────────────────────────┤
│ Data Layer (models + utils) │ ← Pydantic models, JSON I/O
└─────────────────────────────────────┘Module Map¶
src/qebench/
├── cli.py # Typer app — 10 commands: stats, add, translate, export, submit, doctor, update, validate, run, judge
├── models.py # Pydantic models: Term, Sentence, Paragraph, DataFile
├── commands/
│ ├── stats.py # Dataset coverage + XP leaderboard (Rich panels/tables)
│ ├── add.py # Interactive entry creation → saves to per-user file
│ ├── translate.py # Translation practice game loop
│ ├── export.py # Export 6 JSON files for the dashboard
│ ├── submit.py # Git pull/commit/push workflow
│ ├── doctor.py # 8 preflight checks (gh, git, repo, data, etc.)
│ ├── update.py # Pull latest code + uv sync dependencies
│ ├── validate.py # Schema validation for all dataset files
│ ├── run.py # Batch translate via LLM providers
│ └── judge.py # Anonymous head-to-head translation judging
├── scoring/
│ ├── elo.py # Elo rating for model comparison
│ ├── formatting.py # MyST formatting fidelity checks (directive balance, punctuation, etc.)
│ ├── glossary.py # Glossary compliance + reference overlap scoring
│ ├── judgments.py # Judgment persistence + Elo update orchestration
│ └── xp.py # XP tracking per user
├── providers/
│ ├── base.py # Abstract TranslationProvider + TranslationResult
│ ├── claude.py # Anthropic Claude provider
│ ├── openai.py # OpenAI provider
│ └── prompts.py # Prompt template loading/validation (supports {glossary} placeholder)
└── utils/
├── dataset.py # Load/save JSON data, config, domain list, glossary loading
├── display.py # Rich console singleton
└── github.py # get_github_username() via gh CLI (cached)Data Flow¶
Translation Session¶
User runs: qebench translate -n 5
│
▼
github.py ──→ auto-detects username via `gh api user`
│
▼
dataset.py ──→ loads terms/sentences/paragraphs from data/**/*.json
│ (merges _seed_*.json + per-user files)
▼
translate.py ──→ picks entries, presents English, collects Chinese
│
├──→ confidence prompt ──→ 1–5 rating of translator certainty
├──→ notes prompt ──→ optional context / reasoning
├──→ _reference_panel() ─→ shows reference (educational, no score)
├──→ _save_attempt() ──→ appends to results/translations/{username}.jsonl
│ (includes cli_version for schema migration)
└──→ xp.award_xp() ──→ updates results/xp/{username}.jsonAdd Entry¶
User runs: qebench add
│
▼
add.py ──→ questionary prompts for entry type + fields
│
▼
models.py ──→ validates entry via Pydantic
│
▼
add.py ──→ _save_to_user_file() ──→ appends to data/terms/{username}.jsonSubmit Results¶
User runs: qebench submit
│
▼
submit.py ──→ git pull --rebase
│ ──→ git add data/ results/
│ ──→ git commit -m "benchmark: add data ..."
│ ──→ git push
▼
Dashboard CI rebuilds automatically on pushExport & Dashboard¶
User (or CI) runs: qebench export
│
▼
export.py ──→ loads all data + results
│ ──→ computes coverage, domain stats, difficulty stats,
│ leaderboard, activity feed, term samples
│ ──→ writes 6 JSON files to docs/_static/dashboard/data/
▼
MyST build + gh-pages deploys the dashboardScoring Module¶
Formatting Fidelity (scoring/formatting.py)¶
Automated checks that verify structural integrity of LLM translations. These
are invoked by qebench judge and displayed in the reveal panel.
| Function | Returns | What it checks |
|---|---|---|
check_directive_balance(source, translated) | bool | Fence count (```) matches between source and translation |
check_fence_consistency(translated) | bool | No mixed $$ / ```{math} markers |
check_code_block_integrity(source, translated) | bool | Code blocks preserved verbatim |
check_fullwidth_punctuation(text) | float | Fraction (0–1) of punctuation that is fullwidth (,。!?) |
check_directive_spacing(text) | float | Fraction (0–1) of CJK→directive boundaries with proper spacing |
formatting_score(source, translated) | dict | Runs all checks, returns per-check results |
Helper functions:
_extract_code_blocks(text)— extracts fenced code blocks from markdown_strip_code_and_math(text)— removes code and math blocks before punctuation analysis
Glossary Loading (utils/dataset.py)¶
The load_glossary() function fetches the glossary from config.yaml’s
glossary_path (URL or local path):
load_glossary(force_refresh=False)
│
├── glossary_path is URL?
│ ├── Fetch via urllib.request.urlopen()
│ ├── Cache to .cache/glossary.json
│ └── On network failure: fall back to cache
│
└── glossary_path is local path?
└── Read directly
│
▼
_extract_glossary_terms(data)
└── Parses glossary JSON → list[dict] (each dict has en + zh-cn keys)Prompt Template System (providers/prompts.py)¶
Templates use {placeholder} syntax. Required placeholders:
{source_lang}, {target_lang}, {domain}, {text}.
Optional placeholder: {glossary} — auto-populated from the glossary when
present in a template. Double braces {{...}} are treated as literal braces
(e.g., {{math}} renders as {math} in the final prompt).
Key Design Decisions¶
JSON files over SQLite¶
Git-friendly, transparent, easy for RAs to inspect and edit manually. Per-user files avoid merge conflicts when multiple RAs work simultaneously.
Per-user data files¶
Each contributor gets their own file (data/terms/{username}.json). Seed data
uses the _seed_ prefix (data/terms/_seed_economics.json). All files are
loaded together at runtime via glob — the distinction is purely organizational.
GitHub identity via gh CLI¶
Username auto-detected with gh api user --jq .login (cached with lru_cache).
No manual --user flags needed. Requires gh auth login as a one-time setup.
Pydantic for schemas¶
Type safety + auto JSON Schema generation + validation in one place. Models serve double duty as the validation layer and the documentation of the data format.
Similarity as a trigger, not a grade¶
Character-level Jaccard similarity (_char_overlap) is computed for each
translation, but it’s used as an informational metric and a trigger: when
similarity falls below 85%, the user is prompted for why their translation
differs (formal/informal register, regional preference, context, abbreviation,
alternative technical term, etc.). This captures the variation and the
reasoning behind it — the most valuable data for improving the translator.
XP stored per-user in JSON¶
Each user gets a separate file (results/xp/{username}.json). Avoids write
conflicts when multiple RAs work simultaneously. Aggregation happens at display
time.
Recursive add() for “add another”¶
The add command calls itself recursively to allow adding multiple entries in
one session without restarting. Simple and works well for CLI UX.
Directory Layout¶
benchmark.translate-zh-cn/
├── data/
│ ├── terms/
│ │ ├── _seed_economics.json # Seeded terms (read-only, by domain)
│ │ ├── _seed_mathematics.json
│ │ ├── ... # 15 seed files total, 314 terms
│ │ └── {username}.json # Per-user contributions
│ ├── sentences/
│ │ ├── _seed_lectures.json # 80 sentences from lecture repos
│ │ └── {username}.json
│ └── paragraphs/
│ ├── _seed_lectures.json # 17 paragraphs with math/code/directives
│ └── {username}.json
├── prompts/
│ ├── default.txt # General-purpose translation prompt
│ ├── academic.txt # Academic register emphasis
│ ├── action-basic.txt # MyST-aware rules (no glossary)
│ └── action-new.txt # MyST-aware rules + glossary injection
├── scripts/
│ ├── seed_from_glossary.py # Seed terms from action-translation glossary
│ ├── seed_from_lectures.py # Seed sentences/paragraphs from lecture repos
│ └── classify_difficulty.py # Auto-classify term difficulty
├── results/
│ ├── translations/ # User translation attempts (JSONL per user)
│ ├── model-outputs/ # LLM translations (JSONL per model×prompt)
│ ├── judgments/ # Judge results (JSONL per user)
│ ├── xp/ # XP totals per user (JSON per user)
│ └── elo.json # Model Elo ratings
├── .cache/
│ ├── glossary.json # Cached glossary from action-translation
│ └── lectures/ # Cloned lecture repos (gitignored)
├── docs/
│ ├── _static/dashboard/ # Chart.js dashboard + exported JSON
│ └── ... # MyST documentation
├── config.yaml # Language pair, domains, targets, glossary URL
├── REVIEW.md # Design review & gap analysis
└── src/qebench/ # Python package (see Module Map above)Configuration¶
All language-specific settings live in config.yaml:
language_pair:
source: en
target: zh-cn
domains:
- economics
- mathematics
- statistics
# ...
targets:
terms: 500
sentences: 100
paragraphs: 30The CLI code is language-agnostic — it reads domain lists and targets from config at runtime. This makes it possible to extract the tool for other language pairs later.