-
Notifications
You must be signed in to change notification settings - Fork 320
added pipeline sub command and update docs #1874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
839cb93
added pipeline sub command and update docs
jperez999 1523e27
merge in main
jperez999 6a1ed14
fix exception on table open
jperez999 d5d3e22
Merge branch 'main' into docs-replace
jperez999 01b8444
fix tests and updates to main
jperez999 2d9d71c
fix greptile comments
jperez999 7693c60
add uv.lock changes
jperez999 0cd4dba
revert uv lock
jperez999 9cf71cc
fix src uv lock to python increase
jperez999 852a449
remove uv lock files
jperez999 df914eb
fix format on file for pre-commit hooks
jperez999 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # Retriever CLI — Replacement Examples for `nv-ingest-cli` | ||
|
|
||
| This folder contains `retriever` command-line examples that deliver the same | ||
| end-user outcomes as the `nv-ingest-cli` examples in | ||
| `nv-ingest/docs/`, `nv-ingest/api/`, `nv-ingest/client/`, and `nv-ingest/deploy/`. | ||
|
|
||
| The original `nv-ingest-cli` documentation is **not removed** — these files sit | ||
| alongside it as a new-CLI counterpart you can link to or migrate to. | ||
|
|
||
| ## Key shape difference | ||
|
|
||
| `nv-ingest-cli` is a **single command that talks to a running REST service on | ||
| `localhost:7670`** and composes work via repeated `--task extract|split|caption|embed|dedup|filter|udf`. | ||
|
|
||
| `retriever` is a **multi-subcommand Typer app**. Most of the old CLI examples | ||
| map to `retriever pipeline run INPUT_PATH`, which runs the graph pipeline | ||
| locally (in-process or via Ray) and writes results to LanceDB and, optionally, | ||
| to Parquet / object storage. Other subcommands cover focused tasks: | ||
|
|
||
| | Old intent | New subcommand | | ||
| |------------|----------------| | ||
| | Extract + embed + store a batch of documents | `retriever pipeline run` | | ||
| | Run an ad-hoc PDF extraction stage | `retriever pdf stage` | | ||
| | Run an HTML / text / audio / chart stage | `retriever html run`, `retriever txt run`, `retriever audio extract`, `retriever chart run` | | ||
| | Upload stage output to LanceDB | `retriever vector-store stage` | | ||
| | Query LanceDB + compute recall@k | `retriever recall vdb-recall` | | ||
| | Run a QA evaluation sweep | `retriever eval run` | | ||
| | Serve / submit to the online REST API | `retriever online serve` / `retriever online stream-pdf` | | ||
| | Benchmark stage throughput | `retriever benchmark {split,extract,audio-extract,page-elements,ocr,all}` | | ||
| | Benchmark orchestration | `retriever harness {run,sweep,nightly,summary,compare}` | | ||
|
|
||
| ## Contents | ||
|
|
||
| | New file | Replaces example(s) in | | ||
| |----------|------------------------| | ||
| | [`retriever_cli.md`](retriever_cli.md) | `nv-ingest/docs/docs/extraction/nv-ingest_cli.md` and the rebranded mirror `cli-reference.md` | | ||
| | [`quickstart.md`](quickstart.md) | `nv-ingest/docs/docs/extraction/quickstart-guide.md` (the `nv-ingest-cli` section) | | ||
| | [`pdf-split-tuning.md`](pdf-split-tuning.md) | `nv-ingest/docs/docs/extraction/v2-api-guide.md` (CLI example) | | ||
| | [`smoke-test.md`](smoke-test.md) | `nv-ingest/api/api_tests/smoke_test.sh` | | ||
| | [`cli-client-usage.md`](cli-client-usage.md) | `nv-ingest/client/client_examples/examples/cli_client_usage.ipynb` | | ||
| | [`pdf-blueprint.md`](pdf-blueprint.md) | `nv-ingest/deploy/pdf-blueprint.ipynb` (CLI cell) | | ||
| | [`benchmarking.md`](benchmarking.md) | `nv-ingest/docs/docs/extraction/benchmarking.md` and `nv-ingest/tools/harness/README.md` | | ||
|
|
||
| ## Gaps with no retriever-CLI equivalent (kept out of this folder) | ||
|
|
||
| The following `nv-ingest-cli` examples are **not** migrated here because the | ||
| new CLI does not yet expose an equivalent — continue to use `nv-ingest-cli` | ||
| for these cases: | ||
|
|
||
| - `--task 'udf:{…}'` — user-defined functions | ||
| (`nv-ingest/docs/docs/extraction/user-defined-functions.md`, | ||
| `nv-ingest/examples/udfs/README.md`). `retriever` does not expose UDFs. | ||
| - `--task 'filter:{content_type:"image", min_size:…, min_aspect_ratio:…, max_aspect_ratio:…}'`. | ||
| The image scale/aspect-ratio filter stage is not reproduced in the new CLI. | ||
| - Bare service submission (`nv-ingest-cli --doc foo.pdf` with no extract tasks | ||
| and full content-type metadata returned by the service). `retriever online submit` | ||
| is currently a stub — only `retriever online stream-pdf` is implemented. | ||
| - `gen_dataset.py` dataset creation with enumeration and sampling. | ||
| - `--collect_profiling_traces --zipkin_host --zipkin_port`. Use | ||
| `--runtime-metrics-dir` / `--runtime-metrics-prefix` instead for a different | ||
| metrics flavor. | ||
|
|
||
| ## Conventions used in the examples | ||
|
|
||
| - Input paths assume you invoke `retriever` from the `nv-ingest/nemo_retriever` | ||
| directory (or point at absolute paths). | ||
| - `--save-intermediate <dir>` writes the extraction DataFrame as Parquet for | ||
| inspection. LanceDB output goes to `--lancedb-uri` (defaults to `./lancedb`). | ||
| - `--store-images-uri <uri>` stores extracted images to a local path or an | ||
| fsspec URI (e.g. `s3://bucket/prefix`). | ||
| - `--run-mode inprocess` skips Ray and is ideal for single-file demos and CI; | ||
| `--run-mode batch` (the default) uses Ray Data for throughput. | ||
|
|
||
| Run `retriever pipeline run --help` for the authoritative flag list. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # Benchmarking with the `retriever` CLI | ||
|
|
||
| This page is the `retriever`-CLI counterpart to | ||
| `nv-ingest/docs/docs/extraction/benchmarking.md` and | ||
| `nv-ingest/tools/harness/README.md`. | ||
|
|
||
| The old benchmarking workflow is driven by `tools/harness` and | ||
| `uv run nv-ingest-harness-run`. The `retriever` CLI exposes the harness (and | ||
| per-stage micro-benchmarks) as first-class subcommands, so you can run | ||
| benchmarks without `uv run` or a separate harness repo. | ||
|
|
||
| ## Harness (end-to-end benchmarks) | ||
|
|
||
| Old: | ||
|
|
||
| ```bash | ||
| cd tools/harness | ||
| uv sync | ||
| uv run nv-ingest-harness-run --case=e2e --dataset=bo767 | ||
| uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data | ||
| ``` | ||
|
|
||
| New — the harness is a subcommand on the main CLI (full parity): | ||
|
|
||
| ```bash | ||
| retriever harness run --case=e2e --dataset=bo767 | ||
| retriever harness run --case=e2e --dataset=/path/to/your/data | ||
| ``` | ||
|
|
||
| Related commands (browse with `--help`): | ||
|
|
||
| ```bash | ||
| retriever harness --help # run, sweep, nightly, summary, compare | ||
| retriever harness run --help | ||
| retriever harness sweep --help | ||
| retriever harness nightly --help | ||
| retriever harness summary --help | ||
| retriever harness compare --help | ||
| ``` | ||
|
|
||
| ### Harness with image / text storage | ||
|
|
||
| Old: | ||
|
|
||
| ```bash | ||
| retriever harness run --dataset bo20 --preset single_gpu \ | ||
| --override store_images_uri=stored_images --override store_text=true | ||
| ``` | ||
|
|
||
| New (unchanged — this form is already the `retriever` CLI): | ||
|
|
||
| ```bash | ||
| retriever harness run --dataset bo20 --preset single_gpu \ | ||
| --override store_images_uri=stored_images --override store_text=true | ||
| ``` | ||
|
|
||
| When `store_images_uri` is a relative path it resolves to | ||
| `artifact_dir/stored_images/` per run; absolute paths and fsspec URIs | ||
| (e.g. `s3://bucket/prefix`) are passed through unchanged. | ||
|
|
||
| ## Per-stage micro-benchmarks | ||
|
|
||
| The new CLI also exposes stage-level throughput benchmarks that had no direct | ||
| counterpart in `nv-ingest-cli`: | ||
|
|
||
| ```bash | ||
| retriever benchmark --help # split, extract, audio-extract, page-elements, ocr, all | ||
| retriever benchmark split --help | ||
| retriever benchmark extract --help | ||
| retriever benchmark audio-extract --help | ||
| retriever benchmark page-elements --help | ||
| retriever benchmark ocr --help | ||
| retriever benchmark all --help | ||
| ``` | ||
|
|
||
| Example — benchmark the PDF extraction actor: | ||
|
|
||
| ```bash | ||
| retriever benchmark extract ./data/pdf_corpus \ | ||
| --pdf-extract-batch-size 8 \ | ||
| --pdf-extract-actors 4 | ||
| ``` | ||
|
|
||
| Each benchmark reports rows/sec (or chunk rows/sec for audio) for its actor. | ||
| Use these when you want focused numbers for a single stage instead of an | ||
| end-to-end run. | ||
|
|
||
| ## Parity notes | ||
|
|
||
| - The harness use-cases in the old docs (`--case=e2e`, `--dataset=bo767`, | ||
| `--dataset=/path/...`, `--override ...`) are preserved verbatim — only the | ||
| launcher changes (`retriever harness run …` instead of | ||
| `uv run nv-ingest-harness-run …`). | ||
| - If you have a repo-local `uv` environment, `uv run retriever harness run …` | ||
| still works. | ||
| - Stage benchmarks (`retriever benchmark …`) are net-new relative to the old | ||
| `nv-ingest-cli` examples — they are the recommended way to profile | ||
| individual actors before tuning `pipeline run` flags. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # `retriever` CLI — Client-Usage Walk-through | ||
|
|
||
| This page is the `retriever`-CLI counterpart to | ||
| `nv-ingest/client/client_examples/examples/cli_client_usage.ipynb`. | ||
|
|
||
| The original notebook walks through `nv-ingest-cli` by: | ||
|
|
||
| 1. Printing `--help`. | ||
| 2. Submitting a single PDF with `extract + dedup + filter` tasks. | ||
| 3. Submitting a dataset of PDFs with the same task set. | ||
|
|
||
| The equivalent `retriever` workflow is shown below. You can drop these cells | ||
| into a new notebook (e.g. `retriever_client_usage.ipynb`) alongside the old | ||
| one. | ||
|
|
||
| ## 1. Help | ||
|
|
||
| ```bash | ||
| retriever --help | ||
| retriever pipeline run --help | ||
| ``` | ||
|
|
||
| Top-level `--help` lists the subcommand tree; `pipeline run --help` shows the | ||
| ingest-specific flags you will actually use in this walk-through. | ||
|
|
||
| ## 2. Submit a single PDF | ||
|
|
||
| Old notebook cell: | ||
|
|
||
| ```bash | ||
| nv-ingest-cli \ | ||
| --doc ${SAMPLE_PDF0} \ | ||
| --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \ | ||
| --task='dedup:{"content_type": "image", "filter": true}' \ | ||
| --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \ | ||
| --client_host=${REDIS_HOST} \ | ||
| --client_port=${REDIS_PORT} \ | ||
| --output_directory=${OUTPUT_DIRECTORY_SINGLE} | ||
| ``` | ||
|
|
||
| New: | ||
|
|
||
| ```bash | ||
| retriever pipeline run "${SAMPLE_PDF0}" \ | ||
| --input-type pdf \ | ||
| --method pdfium \ | ||
| --extract-text --extract-tables --extract-charts \ | ||
| --dedup --dedup-iou-thres 0.45 \ | ||
| --store-images-uri "${OUTPUT_DIRECTORY_SINGLE}/images" \ | ||
| --strip-base64 \ | ||
| --save-intermediate "${OUTPUT_DIRECTORY_SINGLE}" | ||
| ``` | ||
|
|
||
| ### Parity notes | ||
|
|
||
| - `extract_tables_method:"yolox"` is not a CLI selector — the pipeline picks | ||
| its table/structure detectors automatically. Tables are still extracted. | ||
| - `dedup:{content_type:"image", filter:true}` maps to `--dedup` (with | ||
| `--dedup-iou-thres` for the IoU threshold). | ||
| - `filter:{content_type:"image", min_size, min/max_aspect_ratio, filter:true}` | ||
| **has no parity.** There is no image scale/aspect-ratio filter in the | ||
| `retriever` CLI today. If that matters, drop to the Python API or keep the | ||
| old `nv-ingest-cli` for that example. | ||
| - `extract_images:true` is implicitly satisfied by `--store-images-uri` | ||
| (images are extracted and persisted to the URI). | ||
|
|
||
| ## 3. Submit a dataset of PDFs | ||
|
|
||
| Old notebook cell: | ||
|
|
||
| ```bash | ||
| nv-ingest-cli \ | ||
| --dataset ${BATCH_FILE} \ | ||
| --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \ | ||
| --task='dedup:{"content_type": "image", "filter": true}' \ | ||
| --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \ | ||
| --client_host=${REDIS_HOST} \ | ||
| --client_port=${REDIS_PORT} \ | ||
| --output_directory=${OUTPUT_DIRECTORY_BATCH} | ||
| ``` | ||
|
|
||
| New — point `retriever` at a directory of PDFs instead of a dataset JSON: | ||
|
|
||
| ```bash | ||
| # Assume $PDF_DIR is a directory holding your batch of PDFs. | ||
| retriever pipeline run "${PDF_DIR}" \ | ||
| --input-type pdf \ | ||
| --method pdfium \ | ||
| --extract-text --extract-tables --extract-charts \ | ||
| --dedup --dedup-iou-thres 0.45 \ | ||
| --store-images-uri "${OUTPUT_DIRECTORY_BATCH}/images" \ | ||
| --strip-base64 \ | ||
| --save-intermediate "${OUTPUT_DIRECTORY_BATCH}" | ||
| ``` | ||
|
|
||
| ### Parity notes | ||
|
|
||
| - The `dataset.json` (`sampled_files`) format and `gen_dataset.py` sampler | ||
| are not reproduced. Materialize a directory (or glob) containing the files | ||
| you want to process. | ||
| - The `--shuffle_dataset` knob is not present; set Ray block / batch sizes | ||
| via `--pdf-split-batch`, `--pdf-split-batch-size`, etc. for throughput. | ||
|
|
||
| ## 4. Inspect results | ||
|
|
||
| ```python | ||
| import pyarrow.parquet as pq | ||
| import lancedb | ||
|
|
||
| # Parquet extraction dumps written by --save-intermediate: | ||
| df = pq.read_table(OUTPUT_DIRECTORY_BATCH).to_pandas() | ||
| print(df[["source_id", "text", "content_type"]].head()) | ||
|
|
||
| # LanceDB rows (default table name "nv-ingest"): | ||
| db = lancedb.connect("./lancedb") | ||
| tbl = db.open_table("nv-ingest") | ||
| print(tbl.to_pandas().head()) | ||
| ``` | ||
|
|
||
| ## Migration summary | ||
|
|
||
| | Old notebook cell | New `retriever` form | Parity | | ||
| |-------------------|----------------------|--------| | ||
| | `!nv-ingest-cli --help` | `!retriever --help` (plus `retriever pipeline run --help`) | Full | | ||
| | Single-file extract + dedup + filter | `retriever pipeline run <file> … --dedup …` | Partial — no image-size/aspect filter, `extract_tables_method` auto-selected | | ||
| | Dataset extract + dedup + filter | `retriever pipeline run <dir> …` | Partial — no `dataset.json` loader; use a directory | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| # PDF Blueprint — `retriever` CLI Replacement | ||
|
|
||
| This page is the `retriever`-CLI counterpart to the CLI cell in | ||
| `nv-ingest/deploy/pdf-blueprint.ipynb`. | ||
|
|
||
| ## Original blueprint cell | ||
|
|
||
| ```bash | ||
| nv-ingest-cli \ | ||
| --doc nv-ingest/data/multimodal_test.pdf \ | ||
| --output_directory ./processed_docs \ | ||
| --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true", "extract_charts": "true"}' \ | ||
| --client_host=host.docker.internal \ | ||
| --client_port=7670 | ||
| ``` | ||
|
|
||
| This submits the blueprint's multimodal sample PDF to the running ingest | ||
| service and asks for text + tables + charts + images. | ||
|
|
||
| ## `retriever` equivalent | ||
|
|
||
| ```bash | ||
| retriever pipeline run nv-ingest/data/multimodal_test.pdf \ | ||
| --input-type pdf \ | ||
| --method pdfium \ | ||
| --extract-text --extract-tables --extract-charts \ | ||
| --store-images-uri ./processed_docs/images \ | ||
| --strip-base64 \ | ||
| --save-intermediate ./processed_docs | ||
| ``` | ||
|
|
||
| ### What you get (end-user outcome) | ||
|
|
||
| - The same multimodal content (text, table markdown, chart descriptions, | ||
| extracted images) is produced. | ||
| - Text / table / chart rows land in LanceDB at `./lancedb/nv-ingest.lance`. | ||
| - Parquet extraction rows are written under `./processed_docs/`. | ||
| - Extracted images are written under `./processed_docs/images/`, referenced by | ||
| `content_url` in the row metadata. | ||
|
|
||
| ### Notebook-friendly form | ||
|
|
||
| To keep the notebook self-contained, prefix the shell cell with `!`: | ||
|
|
||
| ```bash | ||
| !retriever pipeline run nv-ingest/data/multimodal_test.pdf \ | ||
| --input-type pdf \ | ||
| --method pdfium \ | ||
| --extract-text --extract-tables --extract-charts \ | ||
| --store-images-uri ./processed_docs/images \ | ||
| --strip-base64 \ | ||
| --save-intermediate ./processed_docs | ||
| ``` | ||
|
|
||
| And inspect the results in the next cell: | ||
|
|
||
| ```python | ||
| import pyarrow.parquet as pq | ||
| import lancedb | ||
|
|
||
| df = pq.read_table("./processed_docs").to_pandas() | ||
| print(df[["source_id", "content_type"]].value_counts()) | ||
|
|
||
| db = lancedb.connect("./lancedb") | ||
| tbl = db.open_table("nv-ingest") | ||
| print(tbl.to_pandas().head()) | ||
| ``` | ||
|
|
||
| ## Migrating the blueprint `pip install` cell | ||
|
|
||
| The blueprint also installs `nv-ingest-client==25.9.0`. For the `retriever` | ||
| path, install `nemo-retriever` instead (see `nemo_retriever/README.md` for | ||
| current pinned versions): | ||
|
|
||
| ```bash | ||
| pip install "nemo-retriever==26.3.0" \ | ||
| nv-ingest-client==26.3.0 nv-ingest==26.3.0 nv-ingest-api==26.3.0 \ | ||
| pymilvus[bulk_writer,model] \ | ||
| minio \ | ||
| tritonclient \ | ||
| langchain_milvus | ||
| ``` | ||
|
|
||
| ## Parity notes | ||
|
|
||
| - `client_host=host.docker.internal` / `client_port=7670` are irrelevant here: | ||
| `retriever pipeline run` is in-process, so the blueprint no longer needs a | ||
| running `nv-ingest-ms-runtime` container for the CLI cell. | ||
| - If you still want the blueprint to hit a live service (for example to | ||
| exercise the REST API), replace the CLI cell with a `retriever online serve` | ||
| container plus `retriever online stream-pdf` for per-page NDJSON output. | ||
| Note that `retriever online submit` is currently a stub. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.