Skip to content

Fix get_nl_ratio ZeroDivisionError on empty document text#291

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/get-nl-ratio-zero-division-empty-text
Open

Fix get_nl_ratio ZeroDivisionError on empty document text#291
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/get-nl-ratio-zero-division-empty-text

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

python/dolma/taggers/code/starcoder.py::get_nl_ratio computes the comment-to-code ratio as:

def get_nl_ratio(text, language):
    """get the ratio of comments to code in a program"""
    if language == "python":
        comments = get_text_python(text)
        ratio = len(comments) / len(text)
    else:
        ratio = comment_size(text, language) / len(text)
    return ratio

When text is empty (0 bytes), both branches divide by 0 and raise ZeroDivisionError.

This is reachable from CodeStarCoderTaggers2.predict in python/dolma/taggers/code/code_taggers.py:253, which calls get_nl_ratio(doc.text, lang) without a try / except when the detected language is Python/Java/JavaScript. An empty-source document (e.g. a 0-byte file that made it through earlier filters, or a document whose body got stripped upstream) therefore aborts the tagger mid-run and takes the current shard's processing with it.

The older CodeStarCoderTaggers (v1, code_taggers.py:209) does wrap the call in try: ... except: nl_ratio = -1.0, so v1 silently coerces the crash into a sentinel; v2 has no such safety net.

Fix

Short-circuit in get_nl_ratio for empty text — before any division — returning 0.0:

 def get_nl_ratio(text, language):
     """get the ratio of comments to code in a program"""
+    if not text:
+        return 0.0
     if language == "python":
         comments = get_text_python(text)
         ratio = len(comments) / len(text)
     else:
         ratio = comment_size(text, language) / len(text)
     return ratio
  • Empty text → 0.0: natural "zero comments over zero code" answer, keeps the tagger's output schema consistent — downstream filters that check nl_ratio_doc / code_to_comment_ratio spans see a well-defined float instead of a missing / erroring span or v1's -1.0 magic number.
  • Non-empty text: unchanged — both language branches execute exactly as before.

Two added lines, no other changes.

starcoder.get_nl_ratio computes comment-to-code ratio as:

    if language == "python":
        comments = get_text_python(text)
        ratio = len(comments) / len(text)
    else:
        ratio = comment_size(text, language) / len(text)
    return ratio

With empty text (0 bytes), both branches divide by 0 and raise
ZeroDivisionError. This is reachable: CodeStarCoderTaggers2.predict
(code_taggers.py:253) calls get_nl_ratio(doc.text, lang) without any
try/except when lang is python / java / javascript, so an empty-source
document aborts the tagger mid-run, aborting the current shard's
processing.

The older CodeStarCoderTaggers (v1, code_taggers.py:209) does wrap the
call in bare-except, so it silently coerces the crash into nl_ratio =
-1.0, but v2 has no such safety net.

Returning 0.0 for empty text is the natural zero-comment-zero-code
answer and keeps the tagger's output schema consistent: any downstream
filter that relies on the ratio score sees a well-defined float instead
of a missing / erroring span. No behaviour change for non-empty text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant