NRL only content for GitHub pages#1855
Conversation
Update all hardcoded version references from 26.1.2 to 26.3.0-RC1 across helm charts, docker-compose, FastAPI, docs, and examples. Made-with: Cursor
Co-authored-by: Kurt Heiss <[email protected]>
Co-authored-by: Jeremy Dyer <[email protected]>
…ing long VLM captioning Large PDFs with VLM captioning enabled can take 2-22+ hours depending on hardware. The previous defaults (STATE_TTL=7200s, RESULT_DATA_TTL=3600s) caused job state to expire mid-processing, resulting in 404 "Job ID not found or state has expired" errors even though the pipeline completed successfully. Raises both defaults to 172800s (48 hours), providing sufficient headroom for all observed workloads. Users can still override via RESULT_DATA_TTL_SECONDS and STATE_TTL_SECONDS environment variables. Fixes: Customer bug 5914605 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Made-with: Cursor
Greptile SummaryThis PR introduces a NRL-scoped MkDocs documentation build and GitHub Pages publishing pipeline, reorganizes the nav into 13 sections with new hub/topic pages, renames
|
| Filename | Overview |
|---|---|
| .github/workflows/nrl-docs-github-pages.yml | New workflow for NRL docs build/deploy; has structural issues (missing steps: in build job, duplicate checkout actions, mutable tag pins) and over-broad permissions at workflow level instead of job level. |
| docs/mkdocs.nrl-github-pages.yml | New MkDocs config for NRL-only GitHub Pages site; theme.announcement is not a recognized Material key so the staging banner will be silently dropped from the rendered site. |
| docs/scripts/print_nrl_mkdocs_nav.py | New utility script to print nav tree for pre-deploy review; missing required SPDX license header (flagged in prior thread). |
| docs/scripts/scan_non_nrl_doc_references.py | New scan script for legacy naming; missing SPDX header (prior thread) and has a dead NV-Ingest regex pattern shadowed by the preceding case-insensitive superset entry. |
| docs/overrides-nrl-staging/main.html | Minimal staging theme override; does not implement {% block announce %}, so the staging banner in mkdocs.nrl-github-pages.yml will not render. |
| docs/docs/extraction/releasenotes.md | Renamed from releasenotes-nv-ingest.md; nav reference in mkdocs config now correctly points to this file, resolving the prior broken-nav finding. |
| docs/docs/extraction/overview.md | Updated NRL product overview page; "NIVIDIA" typo was flagged in a prior thread and needs correction. |
Sequence Diagram
sequenceDiagram
participant Push as Push / Schedule
participant Build as build job
participant MkDocs as MkDocs --strict
participant Artifact as Pages artifact
participant Deploy as deploy job
participant Pages as GitHub Pages
Push->>Build: trigger (docs/**, schedule, dispatch)
Build->>Build: actions/checkout
Build->>Build: pip install mkdocs + nemo_retriever
Build->>Build: print_nrl_mkdocs_nav.py
Build->>Build: scan_non_nrl_doc_references.py
Build->>MkDocs: mkdocs build -f mkdocs.nrl-github-pages.yml --strict
MkDocs-->>Artifact: docs/site/
Build->>Artifact: upload-pages-artifact
Artifact-->>Deploy: needs: build
Deploy->>Pages: actions/deploy-pages
Pages-->>Deploy: page_url
Prompt To Fix All With AI
This is a comment left during a code review.
Path: .github/workflows/nrl-docs-github-pages.yml
Line: 18-21
Comment:
**Over-broad workflow-level permissions**
The `permissions:` block is declared at the workflow level, which grants `pages: write` and `id-token: write` to every job — including `build`, which needs neither. Under the `github-actions-security` rule, permissions should be scoped to the minimum required at the job level. Move the elevated permissions to the `deploy` job only and restrict `build` to `contents: read`:
```yaml
permissions:
contents: read # workflow-level default
jobs:
build:
permissions:
contents: read
...
deploy:
permissions:
pages: write
id-token: write
```
This limits the blast radius if the `build` job is ever compromised by a supply-chain attack.
**Rule Used:** GitHub Actions workflows must: pin third-party act... ([source](https://app.greptile.com/review/custom-context?memory=github-actions-security))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docs/mkdocs.nrl-github-pages.yml
Line: 33-34
Comment:
**`theme.announcement` is not a valid Material for MkDocs config key**
Material for MkDocs does not read an `announcement:` key from the `theme:` block — MkDocs will silently ignore it, and the staging warning banner will never appear on the site. The correct approach is to override the `{% block announce %}` in the theme's custom template.
Add the block to `docs/overrides-nrl-staging/main.html`:
```html
{% block announce %}
<strong>Staging (nightly):</strong> NeMo Retriever documentation only.
This site is not a production or release publication.
{% endblock %}
```
Without this, the site looks identical to a production build and could mislead readers into treating draft content as authoritative.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docs/scripts/scan_non_nrl_doc_references.py
Line: 19-21
Comment:
**Dead `NV-Ingest` pattern — always shadowed by the case-insensitive superset above it**
`PATTERNS` is evaluated in order with a `break` after the first match. The first entry uses `re.IGNORECASE`, which matches `NV-Ingest` (and any other casing), so the second entry `re.compile(r"NV-Ingest")` is unreachable — it will never fire. Both lines that match `NV-Ingest` will be reported under the `nv-ingest (substring)` label instead.
If you need to distinguish the mixed-case product name from the lowercase CLI artifact in the output, move the case-sensitive patterns before the case-insensitive superset, or remove the redundant entry.
How can I resolve this? If you propose a fix, please make it concise.Reviews (6): Last reviewed commit: "Update docs/docs/extraction/overview.md" | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- Fix mkdocs.nrl-github-pages.yml so Workflow: Document ingestion targets workflow-document-ingestion.md (not the V2 API guide); add workflow and metadata schema pages. - Add workflow-document-ingestion.md, workflow pages, and multimodal-metadata-schema.md; update workflow cross-links. - Align quick start links and GitHub URLs with NeMo-Retriever; use Python and CLI Quick Start Guide labels; refresh quickstart-guide examples and MIG references. - Insert the standard NVIDIA Ingest (nv-ingest) rename note after the H1 on every extraction topic page for consistent messaging. Made-with: Cursor
- Point environment and troubleshooting links to environment-config.md and troubleshoot.md. - In user-defined-functions, link to content-metadata, multimodal-metadata-schema, nimclient.md, and GitHub default_pipeline.yaml. - Add explicit HTML anchors in content-metadata.md so schema table fragment links resolve without macro/attr_list issues. Made-with: Cursor
- Replace "see" with "refer to" for consistency in linking to the Support matrix and Benchmarking documentation.
- Update cross-references and table notes to use refer to / Refer to for consistency. - Rename 'See also' to 'Related topics' in key-features.md. - Remove temporary docs/scripts/replace_see_with_refer_to.py helper. Made-with: Cursor
randerzander
left a comment
There was a problem hiding this comment.
Approving as an initial re-factor for the doc content
@jdye64 to review the scripts and gh pages changes
There was a problem hiding this comment.
What is the point of this new workflow? There is already a workflow that does the same thing
There was a problem hiding this comment.
Existing Pages workflows run the full Docker/Sphinx docs build; this one is a lightweight NRL-only MkDocs path for staging/nightly without that pipeline.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Summary
This PR adds a NeMo Retriever Library (NRL)–scoped documentation build and publishing path, reorganizes the table of contents to match the proposed NeMo Retriever doc structure, and introduces short hub pages so major topics have a clear landing page before deeper guides. It aligns naming with NeMo Retriever Library where we mean the library product, and applies NVIDIA style guidance (plain language, “and” instead of “&”, descriptive links, list lead-ins).
Use this for a rendered preview of the current draft content:
https://sw-docs-dgx-station.nvidia.com/nemo/retriever/latest/extraction/overview/
Discussion, scope, and merge status live here in this PR.
Feedback welcome on either link.
Note that this is draft / staging quality, not a final production publication.
Scope
Documentation and CI for the NRL-focused site; it does not replace the full multi-package Sphinx/API doc pipeline for the whole repo.
What’s included
Build and automation
.github/workflows/nrl-docs-github-pages.yml — Builds and deploys the NRL docs to GitHub Pages (staging/nightly: push to main under docs/**, schedule, workflow_dispatch).
docs/mkdocs.nrl-github-pages.yml — MkDocs config for the NRL-only site (Material theme, staging overrides, exclude_docs for pages outside this build).
docs/overrides-nrl-staging/main.html — Staging theme override.
docs/scripts/print_nrl_mkdocs_nav.py — Prints the nav tree from the NRL MkDocs config for review.
docs/scripts/scan_non_nrl_doc_references.py — Optional scan for legacy naming (informational).
Information architecture and content
Nav (13 sections) — Introduction → Get started → Choose deployment → Core workflows → Multimodal extraction → Embedding/indexing → Retrieval → Deployment → Customize → Integrations → Evaluation → Reference → Support; plus additional resources.
New hub / topic pages (examples): how to use this documentation, hosted vs self-hosted NIMs, workflows (query/rerank, agentic, video OCR), semantic/hybrid retrieval, reranking, production checklist, multimodal extraction stubs, embedding NIMs, vector DB partners, published metrics, and related links.
Supporting updates to existing pages (choose-your-path, getting-started-about, key-features, concepts, integrations, resources, small fixes across FAQ, data-store, audio, etc.).