Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

CLI Reference

All commands available in qebench, organized by the typical daily workflow.

qebench update

Pull the latest code, data, and dependencies from GitHub, then enrich term contexts from QuantEcon lecture repos. Run this at the start of every session to ensure you have everyone’s latest contributions and any CLI updates.

uv run qebench update

No options — it runs three steps:

  1. Pullgit pull --rebase to get the latest code and data

  2. Syncuv sync to install any new or updated dependencies

  3. Enrich — clone/update QuantEcon lecture repos into .cache/lectures/ and add context sentences to terms that don’t have them yet

The enrichment step scans four lecture repositories for sentences that use each term, storing up to 5 example sentences per term. These context sentences are shown during qebench translate to help you choose the right Chinese translation. The lecture repos are cached locally (shallow clones, gitignored) so subsequent runs only pull changes.

If already up to date, it tells you so. If the pull fails (e.g. you have uncommitted changes), resolve them first then try again.


qebench stats

Show dataset coverage, domain breakdown, and progress toward targets.

uv run qebench stats

Output includes:

No options — always shows the full dataset overview.


qebench translate

Collect human translations. Presents English text, collects your Chinese translation and a confidence rating, then reveals the reference for learning. Every translation — including ones that differ from the reference — is valuable data for understanding translation variation.

uv run qebench translate [OPTIONS]
OptionShortDefaultDescription
--count-n5Number of entries per session
--domain-dallFilter by domain (e.g. economics)
--difficultyallFilter: basic, intermediate, or advanced

Your GitHub username is detected automatically via gh auth.

Examples:

# Quick 3-term session on economics
uv run qebench translate -n 3 -d economics

# Practice advanced terms
uv run qebench translate --difficulty advanced

# Default session (5 random entries)
uv run qebench translate

What’s recorded per entry:

Divergent translations are valuable — they help us understand cultural nuance and variation.

For terms that have context sentences (populated by qebench update), a random example sentence from a QuantEcon lecture is shown alongside the term. This helps you understand how the term is used in practice and choose the most appropriate Chinese translation.

Each completed entry earns 10 XP. A cli_version field is automatically saved with every record for future schema migration.


qebench add

Contribute new terms, sentences, or paragraphs to the dataset through interactive prompts.

uv run qebench add

No options — the command walks you through the process:

  1. Choose entry type — term, sentence, or paragraph

  2. Fill in fields — English text, Chinese translation, domain, difficulty, etc.

  3. Preview — see a summary before saving

  4. Confirm — save to the appropriate domain JSON file

  5. Continue? — option to add another entry

Each contributed entry earns 15 XP. A cli_version field is automatically saved with every entry.


qebench judge

Judge anonymous translations head-to-head. Shows two translations of the same source text, you rate each on accuracy and fluency, then pick a winner. Results update Elo ratings for the models.

uv run qebench judge                       # Default: 10 rounds
uv run qebench judge -n 5                  # Quick 5-round session
uv run qebench judge -d economics          # Filter to economics entries

Options

OptionShortDefaultDescription
--count-n10Number of rounds per session
--domain-dallFilter by domain

Prerequisites

Model outputs must exist in results/model-outputs/. Generate them with qebench run first.

How It Works

  1. Entries are paired with model translations from results/model-outputs/

  2. Two translations are shown anonymously as A and B

  3. You rate each on accuracy (1–10) and fluency (1–10)

  4. You pick a winner (A, B, tie, or neither)

  5. Elo ratings are updated; results saved to results/judgments/

Pick Tie if both translations are equally good. Pick Neither if both translations are poor and neither is acceptable.

If two models have translated the same entry, they’re paired directly. If only one model has output, it’s paired against the human reference. Identical translation pairs are automatically skipped.

Each judgment earns 5 XP.


qebench submit

Pull latest changes, commit your data and results, and push to GitHub. This is the primary way to share your contributions.

uv run qebench submit

No options — it handles the full git workflow:

  1. Pullgit pull --rebase to get latest changes

  2. Stage — adds data/ and results/ directories

  3. Commit — creates a commit attributed to your GitHub username

  4. Push — pushes to main, which triggers a dashboard rebuild

If there are no local changes in data/ or results/, it exits early.


qebench doctor

Run preflight checks to verify your environment is set up correctly.

uv run qebench doctor

Checks performed:

Run this once after initial setup, or whenever something seems wrong.


qebench validate

Validate all dataset JSON files against the Pydantic schemas. Useful for checking your contributed entries before submitting.

uv run qebench validate

Checks every file in data/terms/, data/sentences/, and data/paragraphs/ against the corresponding model (Term, Sentence, Paragraph). Reports all validation errors with file names and entry IDs, then exits non-zero if any were found.

This also runs automatically in CI on every push and PR.


qebench run

Batch translate dataset entries using an LLM provider. Results are saved to results/model-outputs/ as JSONL files.

uv run qebench run                             # Default: claude, all terms
uv run qebench run --provider openai            # Use OpenAI
uv run qebench run --model gpt-5.4-mini         # Override model
uv run qebench run --prompt academic            # Use academic prompt template
uv run qebench run --type sentences --domain economics  # Filter entries
uv run qebench run --count 10 --dry-run         # Preview without API calls

Options

OptionDefaultDescription
--provider, -pclaudeLLM provider: claude, openai
--model, -m(provider default)Override the default model (Claude: claude-sonnet-4-6, OpenAI: gpt-5.4)
--promptdefaultPrompt template name from prompts/
--count, -n0 (all)Max entries to translate
--domain, -d(all)Filter entries by domain
--type, -ttermsEntry type: terms, sentences, paragraphs
--dry-runfalsePreview entries without calling the API

Prerequisites

Install LLM dependencies:

uv sync --extra llm

Set your API key via environment variable (ANTHROPIC_API_KEY or OPENAI_API_KEY).


qebench export

Export dataset statistics and results to JSON files for the dashboard website.

uv run qebench export

Writes 6 JSON files to docs/_static/dashboard/data/:

This is run automatically by CI when changes are pushed.


XP System

Actions earn experience points tracked per user:

ActionXP per item
Translate an entry10
Add a new entry15
Judge a comparison5

XP is stored in results/xp/{username}.json and shown at the end of each session.