Rename TokenizerConfig.__post__init__ to __post_init__#292
Open
Chessing234 wants to merge 1 commit into
Open
Conversation
TokenizerConfig.__post__init__ has a stray extra underscore in the method name. Python's @DataClass only invokes __post_init__ after initialization, so the current name is never called. Consequences: - The pad_token_id fallback to eos_token_id (line 66) never runs; a missing pad_token_id silently propagates as None into tokenize_in_parallel(..., pad_token_id=...). - The EOS / BOS / "segment_before_tokenization is experimental" warnings are never emitted, so users configuring the CLI with incomplete tokens get no indication that something is missing. Rename to __post_init__ so @DataClass actually runs the hook. No behaviour change beyond restoring the intended post-init logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
In
python/dolma/cli/tokenizer.py,TokenizerConfigdeclaresbut the method name has an extra underscore. The
@dataclassmachineryonly invokes
__post_init__(single underscore betweenpostandinit), so this method is never called.Root cause
Plain typo in the method name (
__post__init__instead of__post_init__). Nothing else in the file references it, so the typois silent: the code compiles and imports fine, and the tokenizer CLI
just runs without the intended post-init validation.
Why the fix is correct
Renaming to
__post_init__makes@dataclassinvoke it exactly asintended:
pad_token_idfalls back toeos_token_idwhen the user doesn'tpass one (line 66). Without the fix,
pad_token_idstaysNoneand propagates into
tokenize_in_parallel(..., pad_token_id=None, ...)at line 222 of
TokenizerCli.run."segment_before_tokenization is experimental" warnings fire for
misconfigured runs, instead of being silently skipped.
This is a one-character change, no logic in the body of the method is
touched, and no call site needs to change because nothing was
explicitly calling the misnamed method.
Change
python/dolma/cli/tokenizer.py:def __post__init__(self):→def __post_init__(self):.