Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ To read Dolma toolkit's documentation, visit the following pages:
- [Getting Started](getting-started.md)
- [Data Format](data-format.md)
- [Taggers](taggers.md)
- [Deduplication](deduplication.md)
- [Deduplication](deduplication.md) (including temporal strategies and document types)
- [Mixer](mixer.md)
- [Tokenization](tokenize.md)
- [Writing a Parallel Processor](parallel-processor.md)
Expand Down
118 changes: 118 additions & 0 deletions docs/deduplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Dropping any documents that are identified as duplicates, or deleting the duplic

See sample config files [dedupe-by-url.json](examples/dedupe-by-url.json) and [dedupe-paragraphs.json](examples/dedupe-paragraphs.json).

For an overview of the Dolma document format (including timestamps and metadata fields that are useful for temporal strategies), see the [data format documentation](data-format.md).

## Parameters

The following parameters are supported either via CLI (e.g. `dolma dedupe --parameter.name value`) or via config file (e.g. `dolma -c config.json dedupe`, where `config.json` contains `{"parameter" {"name": "value"}}`):
Expand Down Expand Up @@ -45,3 +47,119 @@ If running with lots of parallelism, you might need to increase the number of op
```shell
ulimit -n 65536
```

## Temporal deduplication and document types

In many curation pipelines, Dolma is applied repeatedly over **multiple temporal snapshots** of the same underlying source (for example, monthly crawls of the web, or periodically updated copies of a code or paper corpus). In these settings, it is often desirable to:

- keep **only one copy** of each document across all snapshots, and
- choose **which copy to keep** based on time (for example, preferring the most recent crawl) or document type.

Dolma does not have a separate “temporal deduper”; instead, temporal behaviour is obtained by **reusing the same Bloom filter across runs** and by controlling the **order in which snapshots are processed**.

### Document structure and timestamps

Dolma documents follow the unified format described in [data-format.md](data-format.md). Two fields are particularly relevant for temporal strategies:

- `source`: identifies the high-level data source (for example, `common-crawl`, `github`, `s2ag`).
- `added` / `created` (optional): timestamps indicating when AI2 acquired the document and when the original document was created (where available).

Temporal policies are usually expressed in terms of:

- **which snapshots to consider** (for example, directories such as `documents/2019-08/`, `documents/2019-09/`), and
- **which document to keep** when multiple snapshots contain equivalent content.

The deduper itself is agnostic to timestamps; it only sees the stream of documents and the state of the Bloom filter. Temporal behaviour comes from how you **sequence your runs**.

### Basic temporal strategy: “keep newest” or “keep oldest”

Suppose you have a directory structure like the one recommended in [data-format.md](data-format.md):

```plain-text
dataset-name/
documents/
2019-08/
2019-09/
2019-10/
```

You can implement simple temporal policies as follows:

- **Keep the newest copy of each document**
- Create a single Bloom filter file (for example, `bloom_filter.file = "bloom_filters/web.bin"`).
- Process snapshots **from newest to oldest**, reusing the same Bloom filter file each time.
- The first time a document (or paragraph) is seen, it is written and inserted into the Bloom filter. When the deduper later encounters the same content in an older snapshot, it will be treated as a duplicate and only marked in attributes.

- **Keep the oldest copy of each document**
- Use the same single Bloom filter file.
- Process snapshots **from oldest to newest**.
- The first copy you see is considered canonical; later copies are marked as duplicates when encountered.

In both cases, temporal deduplication is entirely controlled by:

- the **order in which you invoke** `dolma dedupe`, and
- the fact that the **Bloom filter file persists** and accumulates keys across invocations (with `bloom_filter.read_only = false`).

### Document types and multiple filters

Real-world corpora often contain multiple **document types** (for example, web pages, academic articles, code repositories) that may or may not be deduplicated against each other. Dolma does not prescribe a specific notion of document type; instead, it is usually encoded via:

- the `source` field on each document, and/or
- fields inside `metadata` (for example, `metadata.doc_type`, `metadata.source_dataset`).

You can express different policies by choosing **how many Bloom filters to maintain**:

- **Shared filter across types**
- Use a **single** `bloom_filter.file` for all relevant document types.
- All documents that hash to the same key (for example, by URL in `metadata.url` or by paragraph content) will be considered duplicates, even if they come from different sources.

- **Per-type filters**
- Use a **separate** `bloom_filter.file` for each type or source (for example, one for `common-crawl`, one for `s2ag`).
- This prevents cross-type deduplication while still deduplicating within each type over time.

The choice depends on whether, for a given corpus, you want to treat the same content found in multiple places (for example, a paper PDF vs a preprint) as duplicates or as distinct documents.

### Example: temporal paragraph deduplication by URL

The following (simplified) configuration illustrates a temporal paragraph-level deduplication strategy where we:

- deduplicate paragraphs based on content,
- reuse a single Bloom filter across monthly web snapshots, and
- treat all snapshots as a single logical stream.

```json
{
"documents": [
"dataset-name/documents/2024-03/*.jsonl.gz",
"dataset-name/documents/2024-02/*.jsonl.gz",
"dataset-name/documents/2024-01/*.jsonl.gz"
],
"dedupe": {
"name": "paragraph_duplicates_temporal",
"paragraphs": {
"attribute_name": "bff_duplicate_paragraph_spans"
},
"skip_empty": true,
"min_length": 0,
"min_words": 0
},
"bloom_filter": {
"file": "bloom_filters/web_paragraphs_temporal.bin",
"read_only": false,
"estimated_doc_count": 6000000,
"desired_false_positive_rate": 1e-4
},
"processes": 16
}
```

Key points:

- The **order of paths** in `documents` reflects the temporal policy (here, newest first, so newer snapshots are treated as canonical).
- The **same Bloom filter file** is used across all snapshots, so once a paragraph has been inserted, later encounters of the same content will be marked as duplicates.
- Document type distinctions (if any) are determined by `source` or `metadata` fields and can be further exploited in downstream [mixer](mixer.md) configs (for example, by filtering or weighting documents differently).

For more complex temporal policies (for example, favouring newer web pages but keeping older scientific articles), you can combine:

- separate runs of `dolma dedupe` with different `bloom_filter.file` values, and
- filtering by `source` or `metadata` fields in subsequent `dolma mix` configurations.