-
Notifications
You must be signed in to change notification settings - Fork 17
feat: Support DSV4 example. #304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
898621d
fa3eb91
ab25f96
4ba9fe0
7432582
45fd1dc
0de2a14
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -63,8 +63,6 @@ | |||||||||||||||||||||||||||||
| APIType, | ||||||||||||||||||||||||||||||
| BenchmarkConfig, | ||||||||||||||||||||||||||||||
| DatasetType, | ||||||||||||||||||||||||||||||
| LoadPattern, | ||||||||||||||||||||||||||||||
| LoadPatternType, | ||||||||||||||||||||||||||||||
| StreamingMode, | ||||||||||||||||||||||||||||||
| TestMode, | ||||||||||||||||||||||||||||||
| TestType, | ||||||||||||||||||||||||||||||
|
|
@@ -302,7 +300,15 @@ def setup_benchmark(config: BenchmarkConfig, test_mode: TestMode) -> BenchmarkCo | |||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Tokenizer check (light API call, no download) | ||||||||||||||||||||||||||||||
| model_name = config.model_params.name | ||||||||||||||||||||||||||||||
| tokenizer_name = model_name if _check_tokenizer_exists(model_name) else None | ||||||||||||||||||||||||||||||
| tokenizer_override = config.model_params.tokenizer_name | ||||||||||||||||||||||||||||||
| tokenizer_name: str | None | ||||||||||||||||||||||||||||||
| if tokenizer_override: | ||||||||||||||||||||||||||||||
| tokenizer_name = tokenizer_override | ||||||||||||||||||||||||||||||
| logger.info( | ||||||||||||||||||||||||||||||
| f"Tokenizer available for model: {model_name} (override: {tokenizer_override})" | ||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||
|
Comment on lines
+306
to
+309
|
||||||||||||||||||||||||||||||
| tokenizer_name = tokenizer_override | |
| logger.info( | |
| f"Tokenizer available for model: {model_name} (override: {tokenizer_override})" | |
| ) | |
| if _check_tokenizer_exists(tokenizer_override): | |
| tokenizer_name = tokenizer_override | |
| logger.info( | |
| f"Using tokenizer override for model: {model_name} ({tokenizer_override})" | |
| ) | |
| else: | |
| tokenizer_name = None | |
| logger.warning( | |
| f"Tokenizer override not available for model: {model_name} ({tokenizer_override})" | |
| ) |
Copilot
AI
May 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In accuracy phases, load_pattern=ctx.rt_settings.load_pattern means accuracy evaluation will be throttled by the main run’s load pattern (e.g., Poisson target QPS / concurrency), which can make scoring unnecessarily slow. Previously this forced MAX_THROUGHPUT; consider keeping Burst/MAX_THROUGHPUT for accuracy phases (or making it explicitly configurable) to avoid long eval runs.
Copilot
AI
May 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except Exception as e: logger.error(...) logs only str(e) and drops the traceback, which makes scorer failures hard to diagnose in CI/user runs. Consider including exc_info=True (and/or logging the exception type) so the report includes actionable stack traces while still continuing past failures.
| logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") | |
| logger.error( | |
| "Scoring failed for %s: %s: %s", | |
| eval_cfg.dataset_name, | |
| type(e).__name__, | |
| e, | |
| exc_info=True, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accuracy scoring loop contains significant code duplication for the result dictionary and uses assert statements for data validation inside an exception handler. If dataset.data is missing, the assert will raise an AssertionError which, while caught by the except Exception, makes the logic brittle. Refactoring this to use a base dictionary and safer access improves maintainability and ensures the loop continues gracefully as intended.
for eval_cfg in ctx.eval_configs:
# Prepare base results; ensure data exists to avoid TypeError in len()
num_samples = len(eval_cfg.dataset.data) if eval_cfg.dataset.data is not None else 0
res_base = {
"dataset_name": eval_cfg.dataset_name,
"num_samples": num_samples,
"extractor": eval_cfg.extractor.__name__,
"ground_truth_column": eval_cfg.ground_truth_column,
}
try:
scorer_instance = eval_cfg.scorer(
eval_cfg.dataset_name,
eval_cfg.dataset,
eval_cfg.report_dir,
extractor=eval_cfg.extractor,
ground_truth_column=eval_cfg.ground_truth_column,
)
score, n_repeats = scorer_instance.score()
accuracy_scores[eval_cfg.dataset_name] = {
**res_base,
"score": score,
"n_repeats": n_repeats,
}
logger.info(
f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)"
)
except Exception as e:
scoring_failed = True
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
accuracy_scores[eval_cfg.dataset_name] = {
**res_base,
"score": None,
"error": str(e),
}| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -196,6 +196,12 @@ class ModelParams(BaseModel): | |
| StreamingMode, | ||
| cyclopts.Parameter(alias="--streaming", help="Streaming mode: auto/on/off"), | ||
| ] = StreamingMode.AUTO | ||
| tokenizer_name: str | None = Field( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we add this to CLI as well? (by using cyclopts.Parameter) |
||
| None, | ||
| description="Local tokenizer path override. Use when AutoTokenizer.from_pretrained " | ||
| "fails for the HF model name (e.g. transformers ≥5.4 rope_theta regression " | ||
| "for DeepSeek-V4). Defaults to the model name if unset.", | ||
| ) | ||
|
|
||
|
|
||
| class SubmissionReference(BaseModel): | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -133,9 +133,13 @@ def text_after_first_chunk(self) -> str: | |||||||||||||||||||
| """ | ||||||||||||||||||||
| parts: list[str] = [] | ||||||||||||||||||||
| if self.reasoning: | ||||||||||||||||||||
| if isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | ||||||||||||||||||||
| if isinstance(self.reasoning, str): | ||||||||||||||||||||
| # str reasoning is the fully joined streaming trace — include it | ||||||||||||||||||||
| # in the TPOT denominator. Over-counts by one token (the first | ||||||||||||||||||||
| # token is not excluded), but the error is negligible in practice. | ||||||||||||||||||||
| parts.append(self.reasoning) | ||||||||||||||||||||
| elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | ||||||||||||||||||||
|
||||||||||||||||||||
| parts.extend(self.reasoning[1:]) | ||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This logic to include string reasoning in the TPOT denominator introduces a slight inaccuracy in metrics by failing to exclude the first token of the response. If the accumulator preserves reasoning as a tuple of chunks (as suggested in
Suggested change
|
||||||||||||||||||||
| # str reasoning: single chunk, skip entirely (it IS the first chunk) | ||||||||||||||||||||
| if self.output: | ||||||||||||||||||||
| if isinstance(self.output, str): | ||||||||||||||||||||
| # Non-streaming: if reasoning was present and was the first chunk, | ||||||||||||||||||||
|
|
||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -123,7 +123,12 @@ def __init__(self, data: dict[str, Any]): | |
| def __call__(self, df: pd.DataFrame) -> pd.DataFrame: | ||
| """Add the static columns to the row.""" | ||
| for key, value in self.data.items(): | ||
| df[key] = value | ||
| # Wrap dict/list values in a list so pandas doesn't try to align | ||
| # on index keys (e.g. {"thinking": True} would produce NaN otherwise). | ||
| if isinstance(value, (dict, list)): | ||
| df[key] = [value] * len(df) | ||
| else: | ||
| df[key] = value | ||
|
Comment on lines
+126
to
+131
|
||
| return df | ||
|
|
||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -103,7 +103,7 @@ class ABCDExtractor(Extractor, extractor_id="abcd_extractor"): | |
| Returns: | ||
| "choice" key (see GQPA dataset columns) or empty string if no answer is found. | ||
| Examples: | ||
| >>> ABCDExtractor.extract("The answer is B") | ||
| >>> ABCDExtractor.extract("Answer: B") | ||
| 'choice2' | ||
| >>> ABCDExtractor.extract("**Answer:** C") | ||
| 'choice3' | ||
|
|
@@ -220,6 +220,121 @@ def extract(cls, text: str, default: str | None = None) -> str | None: | |
| return default if default is not None else "" | ||
|
|
||
|
|
||
| class LetterExtractor(Extractor, extractor_id="letter_extractor"): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: abcd extractor and letter abstractor names are bit confusing |
||
| """Extract MCQ answer letter (A–J) from response text, returning the letter directly. | ||
|
|
||
| Like ABCDExtractor but returns the raw letter ("A", "B", … "J") instead of | ||
| mapping to "choice1"–"choice4". Supports datasets with up to ten answer | ||
| options (e.g. MMLU-Pro A–J) where ground-truth labels are stored as the | ||
| letter itself. | ||
|
|
||
| Examples: | ||
| >>> LetterExtractor.extract("Answer: B") | ||
| 'B' | ||
| >>> LetterExtractor.extract("**Answer:** G") | ||
| 'G' | ||
| >>> LetterExtractor.extract("\\\\boxed{E}") | ||
| 'E' | ||
| """ | ||
|
|
||
| LETTERS = frozenset("ABCDEFGHIJ") | ||
|
|
||
| PATTERNS = [ | ||
| # 0) **Answer:** A or *Answers* – B | ||
| re.compile( | ||
| r"""(?ix) | ||
| (?:\*{1,2}|_{1,2}) | ||
| Answer[s]? | ||
| \s*[:\-–]? | ||
| (?:\*{1,2}|_{1,2}) | ||
| \s* | ||
| ([A-J])\b | ||
| """, | ||
| re.X, | ||
| ), | ||
| # 0.1) Answer: A (with optional markdown) | ||
| re.compile( | ||
| r"""(?ix) | ||
| ^\s* | ||
| (?:\*{1,2}|_{1,2})? | ||
| Answer:? | ||
| (?:\*{1,2}|_{1,2})? | ||
| \s*:?\s* | ||
| (?:\*{1,2}|_{1,2})? | ||
| ([A-J]) | ||
| (?:\*{1,2}|_{1,2})? | ||
| \s* | ||
| """, | ||
| re.MULTILINE, | ||
| ), | ||
| # 1) Answer: (C) | ||
| re.compile(r"(?ix)\bAnswer[s]?\b\s*[:\-–]?\s*\(\s*([A-J])\s*\)"), | ||
| # 2) Answer: C | ||
| re.compile(r"(?ix)\bAnswer[s]?\b\s*[:\-–]?\s*([A-J])\b"), | ||
| # 3) Option B or Choice: C | ||
| re.compile(r"(?ix)\b(?:Option|Choice)\b\s*[:\-–]?\s*([A-J])\b"), | ||
| # 7) \boxed{A} | ||
| re.compile(r"(?x)\\boxed\{[^}]*?([A-J])[^}]*\}", re.MULTILINE), | ||
| # 7.5) \boxed{\textbf{C}} | ||
| re.compile( | ||
| r"(?x)\\boxed\{[^}]*?\\textbf\{[^}]*?([A-J])[^}]*\}[^}]*\}", re.MULTILINE | ||
| ), | ||
| # 7.51) \boxed{\text{C}} | ||
| re.compile( | ||
| r"(?x)\\boxed\{[^}]*?\\text\{[^}]*?([A-J])[^}]*\}[^}]*\}", re.MULTILINE | ||
| ), | ||
| # 4) bare singletons: (A) [B] | ||
| re.compile(r"(?x)(?<![A-Za-z0-9])[\(\[]\s*([A-J])\s*[\)\]](?![A-Za-z0-9])"), | ||
| # 5) Markdown-wrapped: *A* **B** | ||
| re.compile( | ||
| r"(?x)(?<![A-Za-z0-9])(?:\*{1,2}|_{1,2})([A-J])(?:\*{1,2}|_{1,2})(?![A-Za-z0-9])" | ||
| ), | ||
| # 6) \textbf{C} | ||
| re.compile(r"(?x)\\textbf\{[^}]*?([A-J])[^}]*\}"), | ||
| # 8) **D) description** | ||
| re.compile(r"""(?x) | ||
| (?<![A-Za-z0-9]) | ||
| (?:\*{1,2}|_{1,2}) | ||
| \s*([A-J])\) | ||
| [^*_\n]+? | ||
| (?:\*{1,2}|_{1,2}) | ||
| (?![A-Za-z0-9]) | ||
| """), | ||
| # 9) final fallback: line starting with a single letter | ||
| re.compile( | ||
| r"""(?x)^\s* | ||
| (?:\*{1,2}|_{1,2})? | ||
| ([A-J]) | ||
| (?:\*{1,2}|_{1,2})? | ||
| \s*[\.\)\-–:]? | ||
| \s*.*$ | ||
| """, | ||
| re.MULTILINE, | ||
| ), | ||
| ] | ||
|
|
||
| @classmethod | ||
| def extract(cls, text: str, default: str | None = None) -> str | None: | ||
| matches = [] | ||
| for prio, pat in enumerate(cls.PATTERNS): | ||
| m = pat.search(text) | ||
| if m: | ||
| letter = m.group(1).upper() | ||
| if letter in cls.LETTERS: | ||
| matches.append((prio, m, letter)) | ||
|
|
||
| matches.sort(key=lambda triple: (triple[0], len(triple[1].group(0)))) | ||
|
|
||
| for _, _, letter in matches: | ||
| return letter | ||
|
|
||
| stripped = text.removeprefix("**") | ||
| if stripped and stripped[0].upper() in cls.LETTERS: | ||
| return stripped[0].upper() | ||
|
|
||
| return default if default is not None else "" | ||
|
|
||
|
|
||
| class BoxedMathExtractor(Extractor, extractor_id="boxed_math_extractor"): | ||
| """Extract boxed math answer from response text. | ||
| Based on OpenAI's extract_boxed_math function from GPT-OSS. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -68,15 +68,14 @@ def add_chunk(self, delta: OpenAISSEDelta) -> StreamChunk | None: | |
|
|
||
| def get_final_output(self) -> QueryResult: | ||
| if self.reasoning_chunks: | ||
| # If there are reasoning chunks, then the first chunk received | ||
| # is the first reasoning chunk. The rest of the reasoning chunks, | ||
| # as well as the output chunks can be joined together. | ||
| resp_reasoning: list[str] = [self.reasoning_chunks[0]] | ||
| if len(self.reasoning_chunks) > 1: | ||
| resp_reasoning.append("".join(self.reasoning_chunks[1:])) | ||
| # All reasoning chunks are joined into a single string so the full | ||
| # thinking trace is captured as-is in events.jsonl. TPOT still uses | ||
| # text_after_first_chunk(), which includes string reasoning in the | ||
| # denominator (off by one token vs. the true "after first chunk" | ||
| # count, which is negligible). | ||
|
||
| text_output = TextModelOutput( | ||
| output="".join(self.output_chunks), | ||
| reasoning=resp_reasoning, | ||
| reasoning="".join(self.reasoning_chunks), | ||
| ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Joining all reasoning chunks into a single string causes a loss of precision in TPOT (Time Per Output Token) calculations. The Since # Preserve reasoning chunks as a tuple to allow accurate TPOT
# calculation (excluding the first chunk). TextModelOutput
# handles joining for display/logging via __str__.
text_output = TextModelOutput(
output=tuple(self.output_chunks),
reasoning=tuple(self.reasoning_chunks),
) |
||
| elif self.output_chunks: | ||
| # If there are only output chunks, the first chunk is used for | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad
ExceptionfromAutoTokenizer.from_pretrainedwithout logging can mask unrelated configuration issues (e.g., missing files, permission errors, offline mode) and make token-metric failures opaque. Consider catching a narrower set of expected failures (or at least logging the original exception at debug/warning level) before falling back toPreTrainedTokenizerFast.