Skip to content

Fix ft_dataset write_results dropping messages and opening output in binary mode#287

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/ft-dataset-write-results-binary-mode-and-double-get
Open

Fix ft_dataset write_results dropping messages and opening output in binary mode#287
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/ft-dataset-write-results-binary-mode-and-double-get

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

write_results in python/dolma/core/ft_dataset.py:

with smart_open.open(config.out_path, "wb") as o:
    while True:
        msg = q.get()

        if msg == _WRITER_EXIT_MSG:
            break

        if not flag.is_set():
            o.write(q.get())
            o.write("\n")
            written += 1

Two bugs in five lines:

  1. o.write(q.get()) calls q.get() a second time, so the msg retrieved at the top of the loop is discarded (only used for the exit check) and a fresh message is pulled and written. Every other producer message is silently dropped, and the write itself can even pull and write the _WRITER_EXIT_MSG sentinel ("__WRITE__EXIT__") into the output file, leaving the writer blocked on the next q.get().

  2. The file is opened with mode "wb" (binary), but the producer (process_file) puts plain str values on the queue (q.put(f"__label__{label} {final_text}")) and line 112 writes the literal "\n". Both calls raise TypeError: a bytes-like object is required, not 'str' the moment any record is produced.

Root cause

  1. Typo / copy-paste: the retrieved msg was never written — q.get() was called twice.
  2. Mode mismatch between the opened file (binary) and the str payloads produced upstream.

Fix

  • Write the already-retrieved msg (o.write(msg)).
  • Open the output path in text mode ("w") to match the str payloads.

No change to the sentinel-based shutdown or the flag-based early-exit logic.

… in binary mode

`write_results` reads a message off the queue, checks it against the
exit sentinel, and is supposed to write it out:

    while True:
        msg = q.get()

        if msg == _WRITER_EXIT_MSG:
            break

        if not flag.is_set():
            o.write(q.get())   # <-- second q.get(): discards msg, pulls next
            o.write("\n")
            written += 1

Two bugs in five lines:

1. `o.write(q.get())` calls `q.get()` a second time, so the `msg`
   already retrieved at the top of the loop is never written — it was
   only used for the exit check. Half of the producer's messages are
   dropped, and the write itself can even pull and write the
   `_WRITER_EXIT_MSG` sentinel into the output file.

2. The output file is opened with mode `"wb"` (binary), but the
   producer puts `str` values on the queue and line 112 writes the
   literal `"\n"`. Both calls raise
   `TypeError: a bytes-like object is required, not 'str'` the moment
   anything is produced.

Write `msg` instead of re-polling the queue, and open the path in
text mode (`"w"`) to match the `str` payloads produced by
`process_file`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant