NVIDIA · jperez999 · Apr 23, 2026 · Apr 17, 2026 · Apr 20, 2026 · Apr 20, 2026
@@ -0,0 +1,74 @@
+# Retriever CLI — Replacement Examples for `nv-ingest-cli`
+
+This folder contains `retriever` command-line examples that deliver the same
+end-user outcomes as the `nv-ingest-cli` examples in
+`nv-ingest/docs/`, `nv-ingest/api/`, `nv-ingest/client/`, and `nv-ingest/deploy/`.
+
+The original `nv-ingest-cli` documentation is **not removed** — these files sit
+alongside it as a new-CLI counterpart you can link to or migrate to.
+
+## Key shape difference
+
+`nv-ingest-cli` is a **single command that talks to a running REST service on
+`localhost:7670`** and composes work via repeated `--task extract|split|caption|embed|dedup|filter|udf`.
+
+`retriever` is a **multi-subcommand Typer app**. Most of the old CLI examples
+map to `retriever pipeline run INPUT_PATH`, which runs the graph pipeline
+locally (in-process or via Ray) and writes results to LanceDB and, optionally,
+to Parquet / object storage. Other subcommands cover focused tasks:
+
+| Old intent | New subcommand |
+|------------|----------------|
+| Extract + embed + store a batch of documents | `retriever pipeline run` |
+| Run an ad-hoc PDF extraction stage | `retriever pdf stage` |
+| Run an HTML / text / audio / chart stage | `retriever html run`, `retriever txt run`, `retriever audio extract`, `retriever chart run` |
+| Upload stage output to LanceDB | `retriever vector-store stage` |
+| Query LanceDB + compute recall@k | `retriever recall vdb-recall` |
+| Run a QA evaluation sweep | `retriever eval run` |
+| Serve / submit to the online REST API | `retriever online serve` / `retriever online stream-pdf` |
+| Benchmark stage throughput | `retriever benchmark {split,extract,audio-extract,page-elements,ocr,all}` |
+| Benchmark orchestration | `retriever harness {run,sweep,nightly,summary,compare}` |
+
+## Contents
+
+| New file | Replaces example(s) in |
+|----------|------------------------|
+| [`retriever_cli.md`](retriever_cli.md) | `nv-ingest/docs/docs/extraction/nv-ingest_cli.md` and the rebranded mirror `cli-reference.md` |
+| [`quickstart.md`](quickstart.md) | `nv-ingest/docs/docs/extraction/quickstart-guide.md` (the `nv-ingest-cli` section) |
+| [`pdf-split-tuning.md`](pdf-split-tuning.md) | `nv-ingest/docs/docs/extraction/v2-api-guide.md` (CLI example) |
+| [`smoke-test.md`](smoke-test.md) | `nv-ingest/api/api_tests/smoke_test.sh` |
+| [`cli-client-usage.md`](cli-client-usage.md) | `nv-ingest/client/client_examples/examples/cli_client_usage.ipynb` |
+| [`pdf-blueprint.md`](pdf-blueprint.md) | `nv-ingest/deploy/pdf-blueprint.ipynb` (CLI cell) |
+| [`benchmarking.md`](benchmarking.md) | `nv-ingest/docs/docs/extraction/benchmarking.md` and `nv-ingest/tools/harness/README.md` |
+
+## Gaps with no retriever-CLI equivalent (kept out of this folder)
+
+The following `nv-ingest-cli` examples are **not** migrated here because the
+new CLI does not yet expose an equivalent — continue to use `nv-ingest-cli`
+for these cases:
+
+- `--task 'udf:{…}'` — user-defined functions
+  (`nv-ingest/docs/docs/extraction/user-defined-functions.md`,
+  `nv-ingest/examples/udfs/README.md`). `retriever` does not expose UDFs.
+- `--task 'filter:{content_type:"image", min_size:…, min_aspect_ratio:…, max_aspect_ratio:…}'`.
+  The image scale/aspect-ratio filter stage is not reproduced in the new CLI.
+- Bare service submission (`nv-ingest-cli --doc foo.pdf` with no extract tasks
+  and full content-type metadata returned by the service). `retriever online submit`
+  is currently a stub — only `retriever online stream-pdf` is implemented.
+- `gen_dataset.py` dataset creation with enumeration and sampling.
+- `--collect_profiling_traces --zipkin_host --zipkin_port`. Use
+  `--runtime-metrics-dir` / `--runtime-metrics-prefix` instead for a different
+  metrics flavor.
+
+## Conventions used in the examples
+
+- Input paths assume you invoke `retriever` from the `nv-ingest/nemo_retriever`
+  directory (or point at absolute paths).
+- `--save-intermediate <dir>` writes the extraction DataFrame as Parquet for
+  inspection. LanceDB output goes to `--lancedb-uri` (defaults to `./lancedb`).
+- `--store-images-uri <uri>` stores extracted images to a local path or an
+  fsspec URI (e.g. `s3://bucket/prefix`).
+- `--run-mode inprocess` skips Ray and is ideal for single-file demos and CI;
+  `--run-mode batch` (the default) uses Ray Data for throughput.
+
+Run `retriever pipeline run --help` for the authoritative flag list.
@@ -0,0 +1,98 @@
+# Benchmarking with the `retriever` CLI
+
+This page is the `retriever`-CLI counterpart to
+`nv-ingest/docs/docs/extraction/benchmarking.md` and
+`nv-ingest/tools/harness/README.md`.
+
+The old benchmarking workflow is driven by `tools/harness` and
+`uv run nv-ingest-harness-run`. The `retriever` CLI exposes the harness (and
+per-stage micro-benchmarks) as first-class subcommands, so you can run
+benchmarks without `uv run` or a separate harness repo.
+
+## Harness (end-to-end benchmarks)
+
+Old:
+
+```bash
+cd tools/harness
+uv sync
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data
+```
+
+New — the harness is a subcommand on the main CLI (full parity):
+
+```bash
+retriever harness run --case=e2e --dataset=bo767
+retriever harness run --case=e2e --dataset=/path/to/your/data
+```
+
+Related commands (browse with `--help`):
+
+```bash
+retriever harness --help       # run, sweep, nightly, summary, compare
+retriever harness run --help
+retriever harness sweep --help
+retriever harness nightly --help
+retriever harness summary --help
+retriever harness compare --help
+```
+
+### Harness with image / text storage
+
+Old:
+
+```bash
+retriever harness run --dataset bo20 --preset single_gpu \
+  --override store_images_uri=stored_images --override store_text=true
+```
+
+New (unchanged — this form is already the `retriever` CLI):
+
+```bash
+retriever harness run --dataset bo20 --preset single_gpu \
+  --override store_images_uri=stored_images --override store_text=true
+```
+
+When `store_images_uri` is a relative path it resolves to
+`artifact_dir/stored_images/` per run; absolute paths and fsspec URIs
+(e.g. `s3://bucket/prefix`) are passed through unchanged.
+
+## Per-stage micro-benchmarks
+
+The new CLI also exposes stage-level throughput benchmarks that had no direct
+counterpart in `nv-ingest-cli`:
+
+```bash
+retriever benchmark --help           # split, extract, audio-extract, page-elements, ocr, all
+retriever benchmark split --help
+retriever benchmark extract --help
+retriever benchmark audio-extract --help
+retriever benchmark page-elements --help
+retriever benchmark ocr --help
+retriever benchmark all --help
+```
+
+Example — benchmark the PDF extraction actor:
+
+```bash
+retriever benchmark extract ./data/pdf_corpus \
+  --pdf-extract-batch-size 8 \
+  --pdf-extract-actors 4
+```
+
+Each benchmark reports rows/sec (or chunk rows/sec for audio) for its actor.
+Use these when you want focused numbers for a single stage instead of an
+end-to-end run.
+
+## Parity notes
+
+- The harness use-cases in the old docs (`--case=e2e`, `--dataset=bo767`,
+  `--dataset=/path/...`, `--override ...`) are preserved verbatim — only the
+  launcher changes (`retriever harness run …` instead of
+  `uv run nv-ingest-harness-run …`).
+- If you have a repo-local `uv` environment, `uv run retriever harness run …`
+  still works.
+- Stage benchmarks (`retriever benchmark …`) are net-new relative to the old
+  `nv-ingest-cli` examples — they are the recommended way to profile
+  individual actors before tuning `pipeline run` flags.
@@ -0,0 +1,126 @@
+# `retriever` CLI — Client-Usage Walk-through
+
+This page is the `retriever`-CLI counterpart to
+`nv-ingest/client/client_examples/examples/cli_client_usage.ipynb`.
+
+The original notebook walks through `nv-ingest-cli` by:
+
+1. Printing `--help`.
+2. Submitting a single PDF with `extract + dedup + filter` tasks.
+3. Submitting a dataset of PDFs with the same task set.
+
+The equivalent `retriever` workflow is shown below. You can drop these cells
+into a new notebook (e.g. `retriever_client_usage.ipynb`) alongside the old
+one.
+
+## 1. Help
+
+```bash
+retriever --help
+retriever pipeline run --help
+```
+
+Top-level `--help` lists the subcommand tree; `pipeline run --help` shows the
+ingest-specific flags you will actually use in this walk-through.
+
+## 2. Submit a single PDF
+
+Old notebook cell:
+
+```bash
+nv-ingest-cli \
+  --doc ${SAMPLE_PDF0} \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
+  --task='dedup:{"content_type": "image", "filter": true}' \
+  --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
+  --client_host=${REDIS_HOST} \
+  --client_port=${REDIS_PORT} \
+  --output_directory=${OUTPUT_DIRECTORY_SINGLE}
+```
+
+New:
+
+```bash
+retriever pipeline run "${SAMPLE_PDF0}" \
+  --input-type pdf \
+  --method pdfium \
+  --extract-text --extract-tables --extract-charts \
+  --dedup --dedup-iou-thres 0.45 \
+  --store-images-uri "${OUTPUT_DIRECTORY_SINGLE}/images" \
+  --strip-base64 \
+  --save-intermediate "${OUTPUT_DIRECTORY_SINGLE}"
+```
+
+### Parity notes
+
+- `extract_tables_method:"yolox"` is not a CLI selector — the pipeline picks
+  its table/structure detectors automatically. Tables are still extracted.
+- `dedup:{content_type:"image", filter:true}` maps to `--dedup` (with
+  `--dedup-iou-thres` for the IoU threshold).
+- `filter:{content_type:"image", min_size, min/max_aspect_ratio, filter:true}`
+  **has no parity.** There is no image scale/aspect-ratio filter in the
+  `retriever` CLI today. If that matters, drop to the Python API or keep the
+  old `nv-ingest-cli` for that example.
+- `extract_images:true` is implicitly satisfied by `--store-images-uri`
+  (images are extracted and persisted to the URI).
+
+## 3. Submit a dataset of PDFs
+
+Old notebook cell:
+
+```bash
+nv-ingest-cli \
+  --dataset ${BATCH_FILE} \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
+  --task='dedup:{"content_type": "image", "filter": true}' \
+  --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
+  --client_host=${REDIS_HOST} \
+  --client_port=${REDIS_PORT} \
+  --output_directory=${OUTPUT_DIRECTORY_BATCH}
+```
+
+New — point `retriever` at a directory of PDFs instead of a dataset JSON:
+
+```bash
+# Assume $PDF_DIR is a directory holding your batch of PDFs.
+retriever pipeline run "${PDF_DIR}" \
+  --input-type pdf \
+  --method pdfium \
+  --extract-text --extract-tables --extract-charts \
+  --dedup --dedup-iou-thres 0.45 \
+  --store-images-uri "${OUTPUT_DIRECTORY_BATCH}/images" \
+  --strip-base64 \
+  --save-intermediate "${OUTPUT_DIRECTORY_BATCH}"
+```
+
+### Parity notes
+
+- The `dataset.json` (`sampled_files`) format and `gen_dataset.py` sampler
+  are not reproduced. Materialize a directory (or glob) containing the files
+  you want to process.
+- The `--shuffle_dataset` knob is not present; set Ray block / batch sizes
+  via `--pdf-split-batch`, `--pdf-split-batch-size`, etc. for throughput.
+
+## 4. Inspect results
+
+```python
+import pyarrow.parquet as pq
+import lancedb
+
+# Parquet extraction dumps written by --save-intermediate:
+df = pq.read_table(OUTPUT_DIRECTORY_BATCH).to_pandas()
+print(df[["source_id", "text", "content_type"]].head())
+
+# LanceDB rows (default table name "nv-ingest"):
+db = lancedb.connect("./lancedb")
+tbl = db.open_table("nv-ingest")
+print(tbl.to_pandas().head())
+```
+
+## Migration summary
+
+| Old notebook cell | New `retriever` form | Parity |
+|-------------------|----------------------|--------|
+| `!nv-ingest-cli --help` | `!retriever --help` (plus `retriever pipeline run --help`) | Full |
+| Single-file extract + dedup + filter | `retriever pipeline run <file> … --dedup …` | Partial — no image-size/aspect filter, `extract_tables_method` auto-selected |
+| Dataset extract + dedup + filter | `retriever pipeline run <dir> …` | Partial — no `dataset.json` loader; use a directory |
@@ -0,0 +1,92 @@
+# PDF Blueprint — `retriever` CLI Replacement
+
+This page is the `retriever`-CLI counterpart to the CLI cell in
+`nv-ingest/deploy/pdf-blueprint.ipynb`.
+
+## Original blueprint cell
+
+```bash
+nv-ingest-cli \
+  --doc nv-ingest/data/multimodal_test.pdf \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true", "extract_charts": "true"}' \
+  --client_host=host.docker.internal \
+  --client_port=7670
+```
+
+This submits the blueprint's multimodal sample PDF to the running ingest
+service and asks for text + tables + charts + images.
+
+## `retriever` equivalent
+
+```bash
+retriever pipeline run nv-ingest/data/multimodal_test.pdf \
+  --input-type pdf \
+  --method pdfium \
+  --extract-text --extract-tables --extract-charts \
+  --store-images-uri ./processed_docs/images \
+  --strip-base64 \
+  --save-intermediate ./processed_docs
+```
+
+### What you get (end-user outcome)
+
+- The same multimodal content (text, table markdown, chart descriptions,
+  extracted images) is produced.
+- Text / table / chart rows land in LanceDB at `./lancedb/nv-ingest.lance`.
+- Parquet extraction rows are written under `./processed_docs/`.
+- Extracted images are written under `./processed_docs/images/`, referenced by
+  `content_url` in the row metadata.
+
+### Notebook-friendly form
+
+To keep the notebook self-contained, prefix the shell cell with `!`:
+
+```bash
+!retriever pipeline run nv-ingest/data/multimodal_test.pdf \
+    --input-type pdf \
+    --method pdfium \
+    --extract-text --extract-tables --extract-charts \
+    --store-images-uri ./processed_docs/images \
+    --strip-base64 \
+    --save-intermediate ./processed_docs
+```
+
+And inspect the results in the next cell:
+
+```python
+import pyarrow.parquet as pq
+import lancedb
+
+df = pq.read_table("./processed_docs").to_pandas()
+print(df[["source_id", "content_type"]].value_counts())
+
+db = lancedb.connect("./lancedb")
+tbl = db.open_table("nv-ingest")
+print(tbl.to_pandas().head())
+```
+
+## Migrating the blueprint `pip install` cell
+
+The blueprint also installs `nv-ingest-client==25.9.0`. For the `retriever`
+path, install `nemo-retriever` instead (see `nemo_retriever/README.md` for
+current pinned versions):
+
+```bash
+pip install "nemo-retriever==26.3.0" \
+    nv-ingest-client==26.3.0 nv-ingest==26.3.0 nv-ingest-api==26.3.0 \
+    pymilvus[bulk_writer,model] \
+    minio \
+    tritonclient \
+    langchain_milvus
+```
+
+## Parity notes
+
+- `client_host=host.docker.internal` / `client_port=7670` are irrelevant here:
+  `retriever pipeline run` is in-process, so the blueprint no longer needs a
+  running `nv-ingest-ms-runtime` container for the CLI cell.
+- If you still want the blueprint to hit a live service (for example to
+  exercise the REST API), replace the CLI cell with a `retriever online serve`
+  container plus `retriever online stream-pdf` for per-page NDJSON output.
+  Note that `retriever online submit` is currently a stub.