Fix ft_dataset write_results dropping messages and opening output in binary mode#287
Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Conversation
… in binary mode
`write_results` reads a message off the queue, checks it against the
exit sentinel, and is supposed to write it out:
while True:
msg = q.get()
if msg == _WRITER_EXIT_MSG:
break
if not flag.is_set():
o.write(q.get()) # <-- second q.get(): discards msg, pulls next
o.write("\n")
written += 1
Two bugs in five lines:
1. `o.write(q.get())` calls `q.get()` a second time, so the `msg`
already retrieved at the top of the loop is never written — it was
only used for the exit check. Half of the producer's messages are
dropped, and the write itself can even pull and write the
`_WRITER_EXIT_MSG` sentinel into the output file.
2. The output file is opened with mode `"wb"` (binary), but the
producer puts `str` values on the queue and line 112 writes the
literal `"\n"`. Both calls raise
`TypeError: a bytes-like object is required, not 'str'` the moment
anything is produced.
Write `msg` instead of re-polling the queue, and open the path in
text mode (`"w"`) to match the `str` payloads produced by
`process_file`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
write_resultsinpython/dolma/core/ft_dataset.py:Two bugs in five lines:
o.write(q.get())callsq.get()a second time, so themsgretrieved at the top of the loop is discarded (only used for the exit check) and a fresh message is pulled and written. Every other producer message is silently dropped, and the write itself can even pull and write the_WRITER_EXIT_MSGsentinel ("__WRITE__EXIT__") into the output file, leaving the writer blocked on the nextq.get().The file is opened with mode
"wb"(binary), but the producer (process_file) puts plainstrvalues on the queue (q.put(f"__label__{label} {final_text}")) and line 112 writes the literal"\n". Both calls raiseTypeError: a bytes-like object is required, not 'str'the moment any record is produced.Root cause
msgwas never written —q.get()was called twice.strpayloads produced upstream.Fix
msg(o.write(msg))."w") to match thestrpayloads.No change to the sentinel-based shutdown or the
flag-based early-exit logic.