Fix get_nl_ratio ZeroDivisionError on empty document text#291
Open
Chessing234 wants to merge 1 commit into
Open
Fix get_nl_ratio ZeroDivisionError on empty document text#291Chessing234 wants to merge 1 commit into
Chessing234 wants to merge 1 commit into
Conversation
starcoder.get_nl_ratio computes comment-to-code ratio as:
if language == "python":
comments = get_text_python(text)
ratio = len(comments) / len(text)
else:
ratio = comment_size(text, language) / len(text)
return ratio
With empty text (0 bytes), both branches divide by 0 and raise
ZeroDivisionError. This is reachable: CodeStarCoderTaggers2.predict
(code_taggers.py:253) calls get_nl_ratio(doc.text, lang) without any
try/except when lang is python / java / javascript, so an empty-source
document aborts the tagger mid-run, aborting the current shard's
processing.
The older CodeStarCoderTaggers (v1, code_taggers.py:209) does wrap the
call in bare-except, so it silently coerces the crash into nl_ratio =
-1.0, but v2 has no such safety net.
Returning 0.0 for empty text is the natural zero-comment-zero-code
answer and keeps the tagger's output schema consistent: any downstream
filter that relies on the ratio score sees a well-defined float instead
of a missing / erroring span. No behaviour change for non-empty text.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
python/dolma/taggers/code/starcoder.py::get_nl_ratiocomputes the comment-to-code ratio as:When
textis empty (0 bytes), both branches divide by 0 and raiseZeroDivisionError.This is reachable from
CodeStarCoderTaggers2.predictinpython/dolma/taggers/code/code_taggers.py:253, which callsget_nl_ratio(doc.text, lang)without atry/exceptwhen the detected language is Python/Java/JavaScript. An empty-source document (e.g. a 0-byte file that made it through earlier filters, or a document whose body got stripped upstream) therefore aborts the tagger mid-run and takes the current shard's processing with it.The older
CodeStarCoderTaggers(v1,code_taggers.py:209) does wrap the call intry: ... except: nl_ratio = -1.0, so v1 silently coerces the crash into a sentinel; v2 has no such safety net.Fix
Short-circuit in
get_nl_ratiofor empty text — before any division — returning0.0:def get_nl_ratio(text, language): """get the ratio of comments to code in a program""" + if not text: + return 0.0 if language == "python": comments = get_text_python(text) ratio = len(comments) / len(text) else: ratio = comment_size(text, language) / len(text) return ratio0.0: natural "zero comments over zero code" answer, keeps the tagger's output schema consistent — downstream filters that checknl_ratio_doc/code_to_comment_ratiospans see a well-defined float instead of a missing / erroring span or v1's-1.0magic number.Two added lines, no other changes.