Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Tutorial: Judging Translations

This tutorial walks you through a judge session — comparing anonymous translations side-by-side, scoring them, and building Elo ratings that show which translation approaches work best.

Prerequisites

Why Judge?

Human judgments are the gold standard for translation quality. qebench judge pairs two translations of the same source text — from LLM models or the human reference — and asks you to rate each on accuracy and fluency, then pick a winner. Your judgments update Elo ratings that rank the models over time.

Step 1: Update and Check Stats

Always start with the latest data:

uv run qebench update
uv run qebench stats

Step 2: Start a Judge Session

Run a quick 5-round session:

uv run qebench judge -n 5

You’ll see a session header:

╭──────── Judge Session ────────╮
│  Rounds: 5   Domain: all      │
│  Models: 2   User: alice       │
╰───────────────────────────────╯

Step 3: Read the Source Text

Each round shows the English source in a panel:

╭──── Judge  (Round 1)  TERM ────╮
│                                 │
│  inflation                      │
│                                 │
│  term-001 · economics · basic   │
╰─────────────────────────────────╯

Read the source carefully before looking at the translations.

Step 4: Compare Translations A and B

Two translations appear side by side:

╭── Translation A ──╮   ╭── Translation B ──╮
│  通货膨胀          │   │  通胀              │
╰───────────────────╯   ╰───────────────────╯

The labels A and B are randomized — you don’t know which model produced which translation until the reveal.

Step 5: Pick the Winner

After comparing both translations, pick the overall winner:

Which is better overall?
  A is better
❯ B is better
  Tie — equally good
  Neither — both are poor

If both are equally good, pick Tie. If both are poor and neither is acceptable, pick Neither. Don’t overthink it — go with your first instinct after reading both.

Step 6: Score Each Translation

If you picked A or B as the winner, you’ll be asked to rate each translation on two dimensions. (For Tie and Neither, scoring is skipped.)

Rate Translation A:
  Accuracy (1-10): ▸ 9
  Fluency (1-10):  ▸ 8

Then rate Translation B the same way:

Rate Translation B:
  Accuracy (1-10): ▸ 7
  Fluency (1-10):  ▸ 9

Accuracy means how faithfully the translation captures the meaning. Fluency means how natural and readable the Chinese is.

Step 7: See the Reveal

After picking, the result panel shows who won and automated scores:

╭──────────── Result ────────────╮
│          A (claude)  B (human) │
│ Winner                B wins!  │
│ Elo         1520       1480    │
│ Ref. overlap  85%      100%   │
│ Glossary      90%      100%   │
╰────────────────────────────────╯

Step 8: Complete the Session

After all rounds, you’ll see:

╭─── Session Summary ───╮
│  Rounds completed: 5/5 │
│  XP earned:  +25        │
│  Total XP:   75         │
╰─────────────────────────╯

Each judgment earns 5 XP.

Step 9: Submit Your Judgments

Push your results to GitHub:

uv run qebench submit

Your judgments are saved in results/judgments/{your-username}.jsonl and Elo ratings are updated in results/elo.json.

Filtering by Domain

Focus your judgments on a specific domain:

uv run qebench judge -n 10 -d economics

This is useful when you have domain expertise — your ratings will be more precise for terms you know well.

Matchup Strategy

The judge system pairs translations intelligently:

You can exit a session early at any prompt by pressing Ctrl+C — completed rounds are saved.

Next Steps