diff --git a/docs/data-format.md b/docs/data-format.md index 2848ff1b..9a4464a5 100644 --- a/docs/data-format.md +++ b/docs/data-format.md @@ -85,7 +85,7 @@ These are flat JSONs that look like: where the `source` and `id` keys uniquely identify which document carries these attributes. -The mixer create a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries. +The mixer creates a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries. Note that it's very important that the `*.jsonl.gz` files for attributes lines up exactly (same number of rows, same sort order) with the `*.jsonl.gz` files for the associated documents. It'll save us a lot of headache in the future. @@ -109,6 +109,4 @@ Each attribute can have one or more scores associated with it; in the example ab For each paragraph, the tuple indicate the start and end index of the paragraph, and the score associated with it. The idea that we're going with is that attributes identify spans of text within a document that might be problematic. -These signal get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. so for example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating final data mixture -how does your signals data look? -} +These signals get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. For example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating the final data mixture. diff --git a/docs/getting-started.md b/docs/getting-started.md index 3f053340..729241ad 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -83,7 +83,7 @@ dolma tag \ ft_lang_id_en_paragraph_with_doc_score_v2 \ char_length_with_paragraphs_v1 \ whitespace_tokenizer_with_paragraphs_v1 \ - --processes 16 # run on 96 cores + --processes 16 ``` To learn more about the taggers, see the [taggers documentation](taggers.md). diff --git a/docs/mixer.md b/docs/mixer.md index 6a20add0..6dbdc8ba 100644 --- a/docs/mixer.md +++ b/docs/mixer.md @@ -25,7 +25,7 @@ The following parameters are supported either via CLI (e.g. `dolma mix --paramet |`streams[].span_replacement`|No| A list of objects specifying spans of text to be replaced. | |`streams[].span_replacement[].span`|No| A json-path expression for an attribute that contains an array of spans. Each span should be list of length three: `[start, end, score]`. | |`streams[].span_replacement[].min_score`|No| If the span score is less than this value, the span will not be replaced. | -|`streams[].span_replacement[].replacement`|No| The text that should be inserted in place of the span. Use `{}` to represent the original text. Field selection from the document is also supported by prefixing a jq selector with `$`. Note: Escape a leading $ if you do not with to use jq selector pattern. | +|`streams[].span_replacement[].replacement`|No| The text that should be inserted in place of the span. Use `{}` to represent the original text. Field selection from the document is also supported by prefixing a jq selector with `$`. Note: Escape a leading $ if you do not wish to use jq selector pattern. | |`work_dir.input`|No| Path to a local scratch directory where temporary input files can be placed. If not provided, Dolma will make one for you and delete it upon completion. | |`work_dir.output`|No| Path to a local scratch directory where temporary output files can be placed. If not provided, Dolma will make one for you and delete it upon completion. | |`processes`|No| Number of processes to use for mixing. By default 1 process is used. | diff --git a/docs/parallel-processor.md b/docs/parallel-processor.md index fa15e200..8eeec1ab 100644 --- a/docs/parallel-processor.md +++ b/docs/parallel-processor.md @@ -55,7 +55,7 @@ class CustomParallelProcessor(BaseParallelProcessor): ... ``` -Let's dive a bit deeper into one might implement the `process_single` method in the case of removing empty documents. +Let's dive a bit deeper into how one might implement the `process_single` method in the case of removing empty documents. We assume `source_path` is a path to a either local or remote JSONL gzip'ed file, and use `smart_open` to deal with that. ```python diff --git a/docs/taggers.md b/docs/taggers.md index 4fdc0615..7f40df4f 100644 --- a/docs/taggers.md +++ b/docs/taggers.md @@ -60,16 +60,16 @@ A list of built-in taggers can be obtained by running `dolma list` command. At t | `jigsaw_hatespeech_sentence_v2` | Tags spans of documents as containing hate speech or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) hate speech dataset. | | `jigsaw_nsfw_document_v1` | Tags documents as containing NSFW content or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) NSFW dataset. | | `jigsaw_nsfw_sentence_v2` | Tags spans of documents as containing NSFW content or not using a FastText classifier trained on the [Jigsaw](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) NSFW dataset. | -| `olmo_pretokenizer_v1` | Count the number of tokens in each document using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is a the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). | -| `olmo_pretokenizer_with_paragraphs_v1` | Count the number of tokens in each document and each paragraph using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is a the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). | +| `olmo_pretokenizer_v1` | Count the number of tokens in each document using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). | +| `olmo_pretokenizer_with_paragraphs_v1` | Count the number of tokens in each document and each paragraph using pre-tokenizer used by [OLMo v1](https://allenai.org/olmo), which is the same as [GPT Neo-X 20B](https://huggingface.co/EleutherAI/gpt-neox-20b). | | `pii_presidio_v1` | Tags spans of documents that contain personally identifiable information (PII) using the [Presidio Analyzer](https://microsoft.github.io/presidio/analyzer/) library. | | `pii_regex_v1` | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. | | `pii_regex_v2` | Faster implementation of `pii_regex_v1`. | | `pii_regex_with_counts_v2` | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. It also counts the number of matches for each regular expression. | | `pii_regex_with_counts_fast_v2` | Faster implementation of `pii_regex_with_counts_v2`. | | `random_number_v1` | Assigns a random number to each document. This allows us to split the dataset into train, validation, and test sets. | -| `uniseg_length_paragraphs_v1` | Count the number of [unicode "words" (grapheme clusers)](https://www.unicode.org/reports/tr29/) in each paragraph. | -| `uniseg_length_paragraphs_with_doc_length_v1` | Count the number of [unicode "words" (grapheme clusers)](https://www.unicode.org/reports/tr29/) in each paragraph and the document. | +| `uniseg_length_paragraphs_v1` | Count the number of [unicode "words" (grapheme clusters)](https://www.unicode.org/reports/tr29/) in each paragraph. | +| `uniseg_length_paragraphs_with_doc_length_v1` | Count the number of [unicode "words" (grapheme clusters)](https://www.unicode.org/reports/tr29/) in each paragraph and the document. | | `whitespace_tokenizer_v1` | Count the number of whitespace-separated tokens in each document. | | `whitespace_tokenizer_with_paragraphs_v1` | Count the number of whitespace-separated tokens in each document and each paragraph. | diff --git a/docs/tokenize.md b/docs/tokenize.md index d9d3d19c..dc2dfa07 100644 --- a/docs/tokenize.md +++ b/docs/tokenize.md @@ -34,7 +34,7 @@ The following parameters are supported either via CLI (e.g. `dolma tokens --para |`documents`|Yes| One or more paths for input document files. Paths can contain arbitrary wildcards. Can be local, or an S3-compatible cloud path. | |`destination`|Yes| One or more paths for output files. Should match number of `documents` paths. Can be local, or an S3-compatible cloud path. | |`tokenizer.name_or_path`|Yes| Name or path of the tokenizer to use. Must be a HuggingFace-compatible tokenizer. | -| `tokenzier.bos_token_id`| Yes if `tokenizer.eos_token_id` is missing | The id of the beginning-of-sequence token. | +| `tokenizer.bos_token_id`| Yes if `tokenizer.eos_token_id` is missing | The id of the beginning-of-sequence token. | | `tokenizer.eos_token_id`| Yes if `tokenizer.bos_token_id` is missing | The id of the end-of-sequence token. | | `tokenizer.pad_token_id`| No | The id of the padding token. | | `tokenizer.segment_before_tokenization`| No | Whether to segment documents by paragraph before tokenization. This is useful for tokenizers like Llama that are very slow on long documents. Might not be needed once [this bugfix is merged](https://github.com/huggingface/tokenizers/pull/1413). Defaults to False.|