Fix get_nl_ratio ZeroDivisionError on empty document text by Chessing234 · Pull Request #291 · allenai/dolma

Chessing234 · 2026-04-20T06:16:35Z

Bug

python/dolma/taggers/code/starcoder.py::get_nl_ratio computes the comment-to-code ratio as:

def get_nl_ratio(text, language):
    """get the ratio of comments to code in a program"""
    if language == "python":
        comments = get_text_python(text)
        ratio = len(comments) / len(text)
    else:
        ratio = comment_size(text, language) / len(text)
    return ratio

When text is empty (0 bytes), both branches divide by 0 and raise ZeroDivisionError.

This is reachable from CodeStarCoderTaggers2.predict in python/dolma/taggers/code/code_taggers.py:253, which calls get_nl_ratio(doc.text, lang) without a try / except when the detected language is Python/Java/JavaScript. An empty-source document (e.g. a 0-byte file that made it through earlier filters, or a document whose body got stripped upstream) therefore aborts the tagger mid-run and takes the current shard's processing with it.

The older CodeStarCoderTaggers (v1, code_taggers.py:209) does wrap the call in try: ... except: nl_ratio = -1.0, so v1 silently coerces the crash into a sentinel; v2 has no such safety net.

Fix

Short-circuit in get_nl_ratio for empty text — before any division — returning 0.0:

 def get_nl_ratio(text, language):
     """get the ratio of comments to code in a program"""
+    if not text:
+        return 0.0
     if language == "python":
         comments = get_text_python(text)
         ratio = len(comments) / len(text)
     else:
         ratio = comment_size(text, language) / len(text)
     return ratio

Empty text → 0.0: natural "zero comments over zero code" answer, keeps the tagger's output schema consistent — downstream filters that check nl_ratio_doc / code_to_comment_ratio spans see a well-defined float instead of a missing / erroring span or v1's -1.0 magic number.
Non-empty text: unchanged — both language branches execute exactly as before.

Two added lines, no other changes.

starcoder.get_nl_ratio computes comment-to-code ratio as: if language == "python": comments = get_text_python(text) ratio = len(comments) / len(text) else: ratio = comment_size(text, language) / len(text) return ratio With empty text (0 bytes), both branches divide by 0 and raise ZeroDivisionError. This is reachable: CodeStarCoderTaggers2.predict (code_taggers.py:253) calls get_nl_ratio(doc.text, lang) without any try/except when lang is python / java / javascript, so an empty-source document aborts the tagger mid-run, aborting the current shard's processing. The older CodeStarCoderTaggers (v1, code_taggers.py:209) does wrap the call in bare-except, so it silently coerces the crash into nl_ratio = -1.0, but v2 has no such safety net. Returning 0.0 for empty text is the natural zero-comment-zero-code answer and keeps the tagger's output schema consistent: any downstream filter that relies on the ratio score sees a well-defined float instead of a missing / erroring span. No behaviour change for non-empty text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix get_nl_ratio ZeroDivisionError on empty document text#291

Fix get_nl_ratio ZeroDivisionError on empty document text#291
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/get-nl-ratio-zero-division-empty-text

Chessing234 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 20, 2026

Bug

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant