docs: clarify temporal deduplication strategies and document types (Issue #267)#276
Open
ada-ggf25 wants to merge 3 commits intoallenai:mainfrom
Open
docs: clarify temporal deduplication strategies and document types (Issue #267)#276ada-ggf25 wants to merge 3 commits intoallenai:mainfrom
ada-ggf25 wants to merge 3 commits intoallenai:mainfrom
Conversation
Describe temporal deduplication behaviour when sequencing dedupe runs and reusing Bloom filters. Clarify the role of document structure, timestamps and document types, and add an example temporal paragraph-level deduplication configuration with key points.
Update the main documentation index to signal that the deduplication page now covers temporal strategies and document type handling.
docs: clarify temporal deduplication strategies and document types
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
docs: clarify temporal deduplication strategies and document types
Fixes #267
Summary
This PR improves the Dolma documentation around temporal deduplication and document types, addressing conceptual questions raised in allenai/dolma#267. It explains how temporal behaviour emerges from the existing deduplication and mixer pipeline, and how different document categories can be handled in practice.
Motivation
In issue #267, users asked:
The existing documentation briefly covers deduplication and the document format, but it does not make the temporal aspects or the role of document types explicit. This PR fills that gap using the current implementation behaviour, without changing any code.
What this PR changes
docs/deduplication.mdAdds a new section “Temporal deduplication and document types” that:
2019-08,2019-09,2019-10) are processed.source(top-level) for high-level source/category.added/created(where present) for acquisition and creation times.metadatafields for more fine-grained document-type information.documentslist order encodes the temporal policy.bloom_filter.fileacross snapshots leads to temporal deduplication.Adds a short reference at the top of the file pointing to
data-format.mdfor background on the Dolma document structure, so the new section has a clear foundation.docs/README.mdDeduplication (including temporal strategies and document types)Implementation notes
bff_duplicate_paragraph_spansand Bloom filter parameters similar to the current docs).Testing
docs/deduplication.mdanddocs/README.mdin a Markdown preview to check:data-format.md,mixer.md) render correctly.No automated tests are affected, as this is a documentation-only change.