Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions docs/data-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ These are flat JSONs that look like:

where the `source` and `id` keys uniquely identify which document carries these attributes.

The mixer create a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries.
The mixer creates a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries.

Note that it's very important that the `*.jsonl.gz` files for attributes lines up exactly (same number of rows, same sort order) with the `*.jsonl.gz` files for the associated documents. It'll save us a lot of headache in the future.

Expand All @@ -109,6 +109,4 @@ Each attribute can have one or more scores associated with it; in the example ab
For each paragraph, the tuple indicate the start and end index of the paragraph, and the score associated with it.

The idea that we're going with is that attributes identify spans of text within a document that might be problematic.
These signal get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. so for example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating final data mixture
how does your signals data look?
}
These signals get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. For example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating the final data mixture.
2 changes: 1 addition & 1 deletion docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ dolma tag \
ft_lang_id_en_paragraph_with_doc_score_v2 \
char_length_with_paragraphs_v1 \
whitespace_tokenizer_with_paragraphs_v1 \
--processes 16 # run on 96 cores
--processes 16
```

To learn more about the taggers, see the [taggers documentation](taggers.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/mixer.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The following parameters are supported either via CLI (e.g. `dolma mix --paramet
|`streams[].span_replacement`|No| A list of objects specifying spans of text to be replaced. |
|`streams[].span_replacement[].span`|No| A json-path expression for an attribute that contains an array of spans. Each span should be list of length three: `[start, end, score]`. |
|`streams[].span_replacement[].min_score`|No| If the span score is less than this value, the span will not be replaced. |
|`streams[].span_replacement[].replacement`|No| The text that should be inserted in place of the span. Use `{}` to represent the original text. Field selection from the document is also supported by prefixing a jq selector with `$`. Note: Escape a leading $ if you do not with to use jq selector pattern. |
|`streams[].span_replacement[].replacement`|No| The text that should be inserted in place of the span. Use `{}` to represent the original text. Field selection from the document is also supported by prefixing a jq selector with `$`. Note: Escape a leading $ if you do not wish to use jq selector pattern. |
|`work_dir.input`|No| Path to a local scratch directory where temporary input files can be placed. If not provided, Dolma will make one for you and delete it upon completion. |
|`work_dir.output`|No| Path to a local scratch directory where temporary output files can be placed. If not provided, Dolma will make one for you and delete it upon completion. |
|`processes`|No| Number of processes to use for mixing. By default 1 process is used. |
Expand Down
2 changes: 1 addition & 1 deletion docs/parallel-processor.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ class CustomParallelProcessor(BaseParallelProcessor):
...
```

Let's dive a bit deeper into one might implement the `process_single` method in the case of removing empty documents.
Let's dive a bit deeper into how one might implement the `process_single` method in the case of removing empty documents.
We assume `source_path` is a path to a either local or remote JSONL gzip'ed file, and use `smart_open` to deal with that.

```python
Expand Down
8 changes: 4 additions & 4 deletions docs/taggers.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,16 +60,16 @@ A list of built-in taggers can be obtained by running `dolma list` command. At t
| `jigsaw_hatespeech_sentence_v2` | Tags spans of documents as containing hate speech or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) hate speech dataset. |
| `jigsaw_nsfw_document_v1` | Tags documents as containing NSFW content or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) NSFW dataset. |
| `jigsaw_nsfw_sentence_v2` | Tags spans of documents as containing NSFW content or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) NSFW dataset. |
| `olmo_pretokenizer_v1` | Count the number of tokens in each document using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is a the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
| `olmo_pretokenizer_with_paragraphs_v1` | Count the number of tokens in each document and each paragraph using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is a the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
| `olmo_pretokenizer_v1` | Count the number of tokens in each document using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
| `olmo_pretokenizer_with_paragraphs_v1` | Count the number of tokens in each document and each paragraph using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
| `pii_presidio_v1` | Tags spans of documents that contain personally identifiable information (PII) using the [Presidio Analyzer](https://microsoft.github.io/presidio/analyzer/) library. |
| `pii_regex_v1` | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. |
| `pii_regex_v2` | Faster implementation of `pii_regex_v1`. |
| `pii_regex_with_counts_v2` | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. It also counts the number of matches for each regular expression. |
| `pii_regex_with_counts_fast_v2` | Faster implementation of `pii_regex_with_counts_v2`. |
| `random_number_v1` | Assigns a random number to each document. This allows us to split the dataset into train, validation, and test sets. |
| `uniseg_length_paragraphs_v1` | Count the number of [unicode "words" (grapheme clusers)](https://www.unicode.org/reports/tr29/) in each paragraph. |
| `uniseg_length_paragraphs_with_doc_length_v1` | Count the number of [unicode "words" (grapheme clusers)](https://www.unicode.org/reports/tr29/) in each paragraph and the document. |
| `uniseg_length_paragraphs_v1` | Count the number of [unicode "words" (grapheme clusters)](https://www.unicode.org/reports/tr29/) in each paragraph. |
| `uniseg_length_paragraphs_with_doc_length_v1` | Count the number of [unicode "words" (grapheme clusters)](https://www.unicode.org/reports/tr29/) in each paragraph and the document. |
| `whitespace_tokenizer_v1` | Count the number of whitespace-separated tokens in each document. |
| `whitespace_tokenizer_with_paragraphs_v1` | Count the number of whitespace-separated tokens in each document and each paragraph. |

Expand Down
2 changes: 1 addition & 1 deletion docs/tokenize.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The following parameters are supported either via CLI (e.g. `dolma tokens --para
|`documents`|Yes| One or more paths for input document files. Paths can contain arbitrary wildcards. Can be local, or an S3-compatible cloud path. |
|`destination`|Yes| One or more paths for output files. Should match number of `documents` paths. Can be local, or an S3-compatible cloud path. |
|`tokenizer.name_or_path`|Yes| Name or path of the tokenizer to use. Must be a HuggingFace-compatible tokenizer. |
| `tokenzier.bos_token_id`| Yes if `tokenizer.eos_token_id` is missing | The id of the beginning-of-sequence token. |
| `tokenizer.bos_token_id`| Yes if `tokenizer.eos_token_id` is missing | The id of the beginning-of-sequence token. |
| `tokenizer.eos_token_id`| Yes if `tokenizer.bos_token_id` is missing | The id of the end-of-sequence token. |
| `tokenizer.pad_token_id`| No | The id of the padding token. |
| `tokenizer.segment_before_tokenization`| No | Whether to segment documents by paragraph before tokenization. This is useful for tokenizers like Llama that are very slow on long documents. Might not be needed once [this bugfix is merged](https://github.com/huggingface/tokenizers/pull/1413). Defaults to False.|
Expand Down