This tutorial walks you through a judge session — comparing anonymous translations side-by-side, scoring them, and building Elo ratings that show which translation approaches work best.
Prerequisites¶
You’ve completed Getting Started
Model outputs exist in
results/model-outputs/. If they don’t, runqebench runfirst (see Running LLM Benchmarks)
Why Judge?¶
Human judgments are the gold standard for translation quality. qebench judge
pairs two translations of the same source text — from LLM models or the human
reference — and asks you to rate each on accuracy and fluency, then pick a
winner. Your judgments update Elo ratings that rank the models over time.
Step 1: Update and Check Stats¶
Always start with the latest data:
uv run qebench update
uv run qebench statsStep 2: Start a Judge Session¶
Run a quick 5-round session:
uv run qebench judge -n 5You’ll see a session header:
╭──────── Judge Session ────────╮
│ Rounds: 5 Domain: all │
│ Models: 2 User: alice │
╰───────────────────────────────╯Step 3: Read the Source Text¶
Each round shows the English source in a panel:
╭──── Judge (Round 1) TERM ────╮
│ │
│ inflation │
│ │
│ term-001 · economics · basic │
╰─────────────────────────────────╯Read the source carefully before looking at the translations.
Step 4: Compare Translations A and B¶
Two translations appear side by side:
╭── Translation A ──╮ ╭── Translation B ──╮
│ 通货膨胀 │ │ 通胀 │
╰───────────────────╯ ╰───────────────────╯The labels A and B are randomized — you don’t know which model produced which translation until the reveal.
Step 5: Pick the Winner¶
After comparing both translations, pick the overall winner:
Which is better overall?
A is better
❯ B is better
Tie — equally good
Neither — both are poorIf both are equally good, pick Tie. If both are poor and neither is acceptable, pick Neither. Don’t overthink it — go with your first instinct after reading both.
Step 6: Score Each Translation¶
If you picked A or B as the winner, you’ll be asked to rate each translation on two dimensions. (For Tie and Neither, scoring is skipped.)
Rate Translation A:
Accuracy (1-10): ▸ 9
Fluency (1-10): ▸ 8Then rate Translation B the same way:
Rate Translation B:
Accuracy (1-10): ▸ 7
Fluency (1-10): ▸ 9Accuracy means how faithfully the translation captures the meaning. Fluency means how natural and readable the Chinese is.
Step 7: See the Reveal¶
After picking, the result panel shows who won and automated scores:
╭──────────── Result ────────────╮
│ A (claude) B (human) │
│ Winner B wins! │
│ Elo 1520 1480 │
│ Ref. overlap 85% 100% │
│ Glossary 90% 100% │
╰────────────────────────────────╯Elo — model skill rating (higher = better track record)
Ref. overlap — character similarity to the reference translation
Glossary — percentage of key terms correctly translated
Step 8: Complete the Session¶
After all rounds, you’ll see:
╭─── Session Summary ───╮
│ Rounds completed: 5/5 │
│ XP earned: +25 │
│ Total XP: 75 │
╰─────────────────────────╯Each judgment earns 5 XP.
Step 9: Submit Your Judgments¶
Push your results to GitHub:
uv run qebench submitYour judgments are saved in results/judgments/{your-username}.jsonl and Elo
ratings are updated in results/elo.json.
Filtering by Domain¶
Focus your judgments on a specific domain:
uv run qebench judge -n 10 -d economicsThis is useful when you have domain expertise — your ratings will be more precise for terms you know well.
Matchup Strategy¶
The judge system pairs translations intelligently:
2+ models translated the same entry → two models are paired
1 model translated an entry → model is paired against the human reference
0 models → entry is skipped (nothing to compare)
Identical pairs → automatically skipped (nothing to judge)
You can exit a session early at any prompt by pressing Ctrl+C — completed rounds are saved.
Next Steps¶
Translate more entries: See Your First Translation Session to collect more data
Run more models: See Running LLM Benchmarks to generate model outputs
Check the leaderboard:
qebench statsshows the XP leaderboard and dataset coverage