All data models are defined in src/qebench/models.py using Pydantic v2.
Entry Types¶
Term¶
The simplest entry type — a single English term and its Chinese translation.
class Term(BaseModel):
id: str # "term-001" — auto-generated by qebench add
en: str # "Bellman equation"
zh: str # "贝尔曼方程"
domain: str # "dynamic-programming"
difficulty: Difficulty # basic | intermediate | advanced
alternatives: list[str] = [] # ["贝尔曼等式"]
contexts: list[TermContext] = [] # usage sentences from lectures
source: str = "" # "quantecon/dp-intro"Validation rules:
idmust match pattern^term-\d{3,}$enandzhmust be non-emptydomainmust be non-empty (checked against config at runtime)
TermContext¶
A supporting model that stores a usage sentence from a QuantEcon lecture.
Populated by qebench update when it scans lecture repositories.
class TermContext(BaseModel):
text: str # "Dynamic programming is a powerful technique for solving..."
source: str # "lecture-python-intro/intro.md"Up to 5 context sentences are stored per term (deterministic selection for
stable version-controlled output). During qebench translate, one random
context is shown to help the translator understand how the term is used.
Sentence¶
A complete sentence with optional human evaluation scores.
class Sentence(BaseModel):
id: str # "sent-042"
en: str
zh: str
domain: str
difficulty: Difficulty
key_terms: list[str] = [] # ["term-001", "term-005"]
human_scores: HumanScores | None = None
source: str = ""key_terms links sentences to the terms they contain — useful for measuring
whether term-level accuracy translates to sentence-level quality.
Paragraph¶
The richest entry type — paragraphs can contain math, code, and mixed content.
class Paragraph(BaseModel):
id: str # "para-007"
en: str
zh: str
domain: str
difficulty: Difficulty
key_terms: list[str] = []
contains_math: bool = False
contains_code: bool = False
human_scores: HumanScores | None = None
source: str = ""Supporting Types¶
Difficulty¶
class Difficulty(str, Enum):
basic = "basic"
intermediate = "intermediate"
advanced = "advanced"HumanScores¶
class HumanScores(BaseModel):
accuracy: int # 1-10
fluency: int # 1-10Both fields are constrained to the range [1, 10] via Field(ge=1, le=10).
JSON Schema Generation¶
Pydantic models auto-generate JSON Schema for CI validation:
from qebench.models import Term
schema = Term.model_json_schema()This can be used with jsonschema to validate data files in CI without
loading the full Python package.
Storage Format¶
Data files are stored as bare JSON arrays:
[
{
"id": "term-001",
"en": "inflation",
"zh": "通货膨胀",
"domain": "economics",
"difficulty": "basic",
"alternatives": ["通胀"],
"source": ""
}
]The loader also supports a wrapped format (for future versioning):
{
"version": "1.0",
"entries": [...]
}