Skip to content

Rename TokenizerConfig.__post__init__ to __post_init__#292

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/tokenizer-config-post-init-typo
Open

Rename TokenizerConfig.__post__init__ to __post_init__#292
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/tokenizer-config-post-init-typo

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

In python/dolma/cli/tokenizer.py, TokenizerConfig declares

def __post__init__(self):
    ...
    if self.pad_token_id is None:
        self.pad_token_id = self.eos_token_id
    ...

but the method name has an extra underscore. The @dataclass machinery
only invokes __post_init__ (single underscore between post and
init), so this method is never called.

Root cause

Plain typo in the method name (__post__init__ instead of
__post_init__). Nothing else in the file references it, so the typo
is silent: the code compiles and imports fine, and the tokenizer CLI
just runs without the intended post-init validation.

Why the fix is correct

Renaming to __post_init__ makes @dataclass invoke it exactly as
intended:

  • pad_token_id falls back to eos_token_id when the user doesn't
    pass one (line 66). Without the fix, pad_token_id stays None
    and propagates into tokenize_in_parallel(..., pad_token_id=None, ...)
    at line 222 of TokenizerCli.run.
  • The "NO EOS TOKEN PROVIDED", "NO BOS TOKEN PROVIDED", and
    "segment_before_tokenization is experimental" warnings fire for
    misconfigured runs, instead of being silently skipped.

This is a one-character change, no logic in the body of the method is
touched, and no call site needs to change because nothing was
explicitly calling the misnamed method.

Change

python/dolma/cli/tokenizer.py: def __post__init__(self):
def __post_init__(self):.

TokenizerConfig.__post__init__ has a stray extra underscore in the
method name. Python's @DataClass only invokes __post_init__ after
initialization, so the current name is never called. Consequences:

- The pad_token_id fallback to eos_token_id (line 66) never runs; a
  missing pad_token_id silently propagates as None into
  tokenize_in_parallel(..., pad_token_id=...).
- The EOS / BOS / "segment_before_tokenization is experimental"
  warnings are never emitted, so users configuring the CLI with
  incomplete tokens get no indication that something is missing.

Rename to __post_init__ so @DataClass actually runs the hook. No
behaviour change beyond restoring the intended post-init logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant