All commands available in qebench, organized by the typical daily workflow.
qebench update¶
Pull the latest code, data, and dependencies from GitHub, then enrich term contexts from QuantEcon lecture repos. Run this at the start of every session to ensure you have everyone’s latest contributions and any CLI updates.
uv run qebench updateNo options — it runs three steps:
Pull —
git pull --rebaseto get the latest code and dataSync —
uv syncto install any new or updated dependenciesEnrich — clone/update QuantEcon lecture repos into
.cache/lectures/and add context sentences to terms that don’t have them yet
The enrichment step scans four lecture repositories for sentences that use
each term, storing up to 5 example sentences per term. These context sentences
are shown during qebench translate to help you choose the right Chinese
translation. The lecture repos are cached locally (shallow clones, gitignored)
so subsequent runs only pull changes.
If already up to date, it tells you so. If the pull fails (e.g. you have uncommitted changes), resolve them first then try again.
qebench stats¶
Show dataset coverage, domain breakdown, and progress toward targets.
uv run qebench statsOutput includes:
Progress bars for terms, sentences, and paragraphs vs. targets
Domain breakdown table with entry counts
XP leaderboard ranked by total XP (with translate/add/judge breakdown)
Total entries summary
No options — always shows the full dataset overview.
qebench translate¶
Collect human translations. Presents English text, collects your Chinese translation and a confidence rating, then reveals the reference for learning. Every translation — including ones that differ from the reference — is valuable data for understanding translation variation.
uv run qebench translate [OPTIONS]| Option | Short | Default | Description |
|---|---|---|---|
--count | -n | 5 | Number of entries per session |
--domain | -d | all | Filter by domain (e.g. economics) |
--difficulty | all | Filter: basic, intermediate, or advanced |
Your GitHub username is detected automatically via gh auth.
Examples:
# Quick 3-term session on economics
uv run qebench translate -n 3 -d economics
# Practice advanced terms
uv run qebench translate --difficulty advanced
# Default session (5 random entries)
uv run qebench translateWhat’s recorded per entry:
Your Chinese translation
Confidence level (1–5)
Character similarity to the reference (informational, not a grade)
If your translation differs: the reason why (formal/informal register, regional preference, contextual, abbreviation, alternative technical term, or other)
Optional notes for further explanation
Divergent translations are valuable — they help us understand cultural nuance and variation.
For terms that have context sentences (populated by qebench update),
a random example sentence from a QuantEcon lecture is shown alongside the
term. This helps you understand how the term is used in practice and choose
the most appropriate Chinese translation.
Each completed entry earns 10 XP. A cli_version field is automatically saved with every record for future schema migration.
qebench add¶
Contribute new terms, sentences, or paragraphs to the dataset through interactive prompts.
uv run qebench addNo options — the command walks you through the process:
Choose entry type — term, sentence, or paragraph
Fill in fields — English text, Chinese translation, domain, difficulty, etc.
Preview — see a summary before saving
Confirm — save to the appropriate domain JSON file
Continue? — option to add another entry
Each contributed entry earns 15 XP. A cli_version field is automatically saved with every entry.
qebench judge¶
Judge anonymous translations head-to-head. Shows two translations of the same source text, you rate each on accuracy and fluency, then pick a winner. Results update Elo ratings for the models.
uv run qebench judge # Default: 10 rounds
uv run qebench judge -n 5 # Quick 5-round session
uv run qebench judge -d economics # Filter to economics entriesOptions¶
| Option | Short | Default | Description |
|---|---|---|---|
--count | -n | 10 | Number of rounds per session |
--domain | -d | all | Filter by domain |
Prerequisites¶
Model outputs must exist in results/model-outputs/. Generate them with qebench run first.
How It Works¶
Entries are paired with model translations from
results/model-outputs/Two translations are shown anonymously as A and B
You rate each on accuracy (1–10) and fluency (1–10)
You pick a winner (A, B, tie, or neither)
Elo ratings are updated; results saved to
results/judgments/
Pick Tie if both translations are equally good. Pick Neither if both translations are poor and neither is acceptable.
If two models have translated the same entry, they’re paired directly. If only one model has output, it’s paired against the human reference. Identical translation pairs are automatically skipped.
Each judgment earns 5 XP.
qebench submit¶
Pull latest changes, commit your data and results, and push to GitHub. This is the primary way to share your contributions.
uv run qebench submitNo options — it handles the full git workflow:
Pull —
git pull --rebaseto get latest changesStage — adds
data/andresults/directoriesCommit — creates a commit attributed to your GitHub username
Push — pushes to
main, which triggers a dashboard rebuild
If there are no local changes in data/ or results/, it exits early.
qebench doctor¶
Run preflight checks to verify your environment is set up correctly.
uv run qebench doctorChecks performed:
GitHub CLI (
gh) installedGitHub authentication configured
Git installed and inside a repo
Remote origin configured
config.yamlfoundDataset has entries
uvpackage manager available
Run this once after initial setup, or whenever something seems wrong.
qebench validate¶
Validate all dataset JSON files against the Pydantic schemas. Useful for checking your contributed entries before submitting.
uv run qebench validateChecks every file in data/terms/, data/sentences/, and data/paragraphs/
against the corresponding model (Term, Sentence, Paragraph). Reports all
validation errors with file names and entry IDs, then exits non-zero if any
were found.
This also runs automatically in CI on every push and PR.
qebench run¶
Batch translate dataset entries using an LLM provider. Results are saved to
results/model-outputs/ as JSONL files.
uv run qebench run # Default: claude, all terms
uv run qebench run --provider openai # Use OpenAI
uv run qebench run --model gpt-5.4-mini # Override model
uv run qebench run --prompt academic # Use academic prompt template
uv run qebench run --type sentences --domain economics # Filter entries
uv run qebench run --count 10 --dry-run # Preview without API callsOptions¶
| Option | Default | Description |
|---|---|---|
--provider, -p | claude | LLM provider: claude, openai |
--model, -m | (provider default) | Override the default model (Claude: claude-sonnet-4-6, OpenAI: gpt-5.4) |
--prompt | default | Prompt template name from prompts/ |
--count, -n | 0 (all) | Max entries to translate |
--domain, -d | (all) | Filter entries by domain |
--type, -t | terms | Entry type: terms, sentences, paragraphs |
--dry-run | false | Preview entries without calling the API |
Prerequisites¶
Install LLM dependencies:
uv sync --extra llmSet your API key via environment variable (ANTHROPIC_API_KEY or OPENAI_API_KEY).
qebench export¶
Export dataset statistics and results to JSON files for the dashboard website.
uv run qebench exportWrites 6 JSON files to docs/_static/dashboard/data/:
coverage.json— terms/sentences/paragraphs vs. targetsdomains.json— per-domain entry countsdifficulty.json— basic/intermediate/advanced distributionleaderboard.json— XP rankings across usersactivity.json— recent translation attemptssamples.json— sample terms for the browse section
This is run automatically by CI when changes are pushed.
XP System¶
Actions earn experience points tracked per user:
| Action | XP per item |
|---|---|
| Translate an entry | 10 |
| Add a new entry | 15 |
| Judge a comparison | 5 |
XP is stored in results/xp/{username}.json and shown at the end of each session.