Architecture

Overview¶

qebench is a Python CLI tool built on three layers:

┌─────────────────────────────────────┐
│          CLI Commands               │  ← Typer commands (translate, add, stats)
├─────────────────────────────────────┤
│     Scoring    │    Providers       │  ← Elo, XP, glossary │ Claude, OpenAI
├─────────────────────────────────────┤
│     Data Layer (models + utils)     │  ← Pydantic models, JSON I/O
└─────────────────────────────────────┘

Module Map¶

src/qebench/
├── cli.py                 # Typer app — 10 commands: stats, add, translate, export, submit, doctor, update, validate, run, judge
├── models.py              # Pydantic models: Term, Sentence, Paragraph, DataFile
├── commands/
│   ├── stats.py           # Dataset coverage + XP leaderboard (Rich panels/tables)
│   ├── add.py             # Interactive entry creation → saves to per-user file
│   ├── translate.py       # Translation practice game loop
│   ├── export.py          # Export 6 JSON files for the dashboard
│   ├── submit.py          # Git pull/commit/push workflow
│   ├── doctor.py          # 8 preflight checks (gh, git, repo, data, etc.)
│   ├── update.py          # Pull latest code + uv sync dependencies
│   ├── validate.py        # Schema validation for all dataset files
│   ├── run.py             # Batch translate via LLM providers
│   └── judge.py           # Anonymous head-to-head translation judging
├── scoring/
│   ├── elo.py             # Elo rating for model comparison
│   ├── formatting.py      # MyST formatting fidelity checks (directive balance, punctuation, etc.)
│   ├── glossary.py        # Glossary compliance + reference overlap scoring
│   ├── judgments.py       # Judgment persistence + Elo update orchestration
│   └── xp.py              # XP tracking per user
├── providers/
│   ├── base.py            # Abstract TranslationProvider + TranslationResult
│   ├── claude.py          # Anthropic Claude provider
│   ├── openai.py          # OpenAI provider
│   └── prompts.py         # Prompt template loading/validation (supports {glossary} placeholder)
└── utils/
    ├── dataset.py         # Load/save JSON data, config, domain list, glossary loading
    ├── display.py         # Rich console singleton
    └── github.py          # get_github_username() via gh CLI (cached)

Data Flow¶

Translation Session¶

User runs: qebench translate -n 5
    │
    ▼
github.py ──→ auto-detects username via `gh api user`
    │
    ▼
dataset.py ──→ loads terms/sentences/paragraphs from data/**/*.json
    │           (merges _seed_*.json + per-user files)
    ▼
translate.py ──→ picks entries, presents English, collects Chinese
    │
    ├──→ confidence prompt ──→ 1–5 rating of translator certainty
    ├──→ notes prompt      ──→ optional context / reasoning
    ├──→ _reference_panel() ─→ shows reference (educational, no score)
    ├──→ _save_attempt()   ──→ appends to results/translations/{username}.jsonl
    │                        (includes cli_version for schema migration)
    └──→ xp.award_xp()    ──→ updates results/xp/{username}.json

Add Entry¶

User runs: qebench add
    │
    ▼
add.py ──→ questionary prompts for entry type + fields
    │
    ▼
models.py ──→ validates entry via Pydantic
    │
    ▼
add.py ──→ _save_to_user_file() ──→ appends to data/terms/{username}.json

Submit Results¶

User runs: qebench submit
    │
    ▼
submit.py ──→ git pull --rebase
    │       ──→ git add data/ results/
    │       ──→ git commit -m "benchmark: add data ..."
    │       ──→ git push
    ▼
Dashboard CI rebuilds automatically on push

Export & Dashboard¶

User (or CI) runs: qebench export
    │
    ▼
export.py ──→ loads all data + results
    │       ──→ computes coverage, domain stats, difficulty stats,
    │           leaderboard, activity feed, term samples
    │       ──→ writes 6 JSON files to docs/_static/dashboard/data/
    ▼
MyST build + gh-pages deploys the dashboard

Scoring Module¶

Formatting Fidelity (`scoring/formatting.py`)¶

Automated checks that verify structural integrity of LLM translations. These are invoked by qebench judge and displayed in the reveal panel.

Function	Returns	What it checks
`check_directive_balance(source, translated)`	`bool`	Fence count (```) matches between source and translation
`check_fence_consistency(translated)`	`bool`	No mixed `$$` / ```{math} markers
`check_code_block_integrity(source, translated)`	`bool`	Code blocks preserved verbatim
`check_fullwidth_punctuation(text)`	`float`	Fraction (0–1) of punctuation that is fullwidth (，。！？)
`check_directive_spacing(text)`	`float`	Fraction (0–1) of CJK→directive boundaries with proper spacing
`formatting_score(source, translated)`	`dict`	Runs all checks, returns per-check results

Helper functions:

_extract_code_blocks(text) — extracts fenced code blocks from markdown
_strip_code_and_math(text) — removes code and math blocks before punctuation analysis

Glossary Loading (`utils/dataset.py`)¶

The load_glossary() function fetches the glossary from config.yaml’s glossary_path (URL or local path):

load_glossary(force_refresh=False)
    │
    ├── glossary_path is URL?
    │     ├── Fetch via urllib.request.urlopen()
    │     ├── Cache to .cache/glossary.json
    │     └── On network failure: fall back to cache
    │
    └── glossary_path is local path?
          └── Read directly
    │
    ▼
_extract_glossary_terms(data)
    └── Parses glossary JSON → list[dict] (each dict has en + zh-cn keys)

Prompt Template System (`providers/prompts.py`)¶

Templates use {placeholder} syntax. Required placeholders: {source_lang}, {target_lang}, {domain}, {text}.

Optional placeholder: {glossary} — auto-populated from the glossary when present in a template. Double braces {{...}} are treated as literal braces (e.g., {{math}} renders as {math} in the final prompt).

Key Design Decisions¶

JSON files over SQLite¶

Git-friendly, transparent, easy for RAs to inspect and edit manually. Per-user files avoid merge conflicts when multiple RAs work simultaneously.

Per-user data files¶

Each contributor gets their own file (data/terms/{username}.json). Seed data uses the _seed_ prefix (data/terms/_seed_economics.json). All files are loaded together at runtime via glob — the distinction is purely organizational.

GitHub identity via `gh` CLI¶

Username auto-detected with gh api user --jq .login (cached with lru_cache). No manual --user flags needed. Requires gh auth login as a one-time setup.

Pydantic for schemas¶

Type safety + auto JSON Schema generation + validation in one place. Models serve double duty as the validation layer and the documentation of the data format.

Similarity as a trigger, not a grade¶

Character-level Jaccard similarity (_char_overlap) is computed for each translation, but it’s used as an informational metric and a trigger: when similarity falls below 85%, the user is prompted for why their translation differs (formal/informal register, regional preference, context, abbreviation, alternative technical term, etc.). This captures the variation and the reasoning behind it — the most valuable data for improving the translator.

XP stored per-user in JSON¶

Each user gets a separate file (results/xp/{username}.json). Avoids write conflicts when multiple RAs work simultaneously. Aggregation happens at display time.

Recursive `add()` for “add another”¶

The add command calls itself recursively to allow adding multiple entries in one session without restarting. Simple and works well for CLI UX.

Directory Layout¶

benchmark.translate-zh-cn/
├── data/
│   ├── terms/
│   │   ├── _seed_economics.json   # Seeded terms (read-only, by domain)
│   │   ├── _seed_mathematics.json
│   │   ├── ...                    # 15 seed files total, 314 terms
│   │   └── {username}.json        # Per-user contributions
│   ├── sentences/
│   │   ├── _seed_lectures.json    # 80 sentences from lecture repos
│   │   └── {username}.json
│   └── paragraphs/
│       ├── _seed_lectures.json    # 17 paragraphs with math/code/directives
│       └── {username}.json
├── prompts/
│   ├── default.txt                # General-purpose translation prompt
│   ├── academic.txt               # Academic register emphasis
│   ├── action-basic.txt           # MyST-aware rules (no glossary)
│   └── action-new.txt             # MyST-aware rules + glossary injection
├── scripts/
│   ├── seed_from_glossary.py      # Seed terms from action-translation glossary
│   ├── seed_from_lectures.py      # Seed sentences/paragraphs from lecture repos
│   └── classify_difficulty.py     # Auto-classify term difficulty
├── results/
│   ├── translations/              # User translation attempts (JSONL per user)
│   ├── model-outputs/             # LLM translations (JSONL per model×prompt)
│   ├── judgments/                  # Judge results (JSONL per user)
│   ├── xp/                        # XP totals per user (JSON per user)
│   └── elo.json                   # Model Elo ratings
├── .cache/
│   ├── glossary.json              # Cached glossary from action-translation
│   └── lectures/                  # Cloned lecture repos (gitignored)
├── docs/
│   ├── _static/dashboard/         # Chart.js dashboard + exported JSON
│   └── ...                        # MyST documentation
├── config.yaml                    # Language pair, domains, targets, glossary URL
├── REVIEW.md                      # Design review & gap analysis
└── src/qebench/                   # Python package (see Module Map above)

Configuration¶

All language-specific settings live in config.yaml:

language_pair:
  source: en
  target: zh-cn

domains:
  - economics
  - mathematics
  - statistics
  # ...

targets:
  terms: 500
  sentences: 100
  paragraphs: 30

The CLI code is language-agnostic — it reads domain lists and targets from config at runtime. This makes it possible to extract the tool for other language pairs later.