Skip to content

NRL only content for GitHub pages#1855

Merged
jdye64 merged 115 commits into
mainfrom
kheiss/NRLonly
Apr 20, 2026
Merged

NRL only content for GitHub pages#1855
jdye64 merged 115 commits into
mainfrom
kheiss/NRLonly

Conversation

@kheiss-uwzoo
Copy link
Copy Markdown
Collaborator

@kheiss-uwzoo kheiss-uwzoo commented Apr 15, 2026

Summary

This PR adds a NeMo Retriever Library (NRL)–scoped documentation build and publishing path, reorganizes the table of contents to match the proposed NeMo Retriever doc structure, and introduces short hub pages so major topics have a clear landing page before deeper guides. It aligns naming with NeMo Retriever Library where we mean the library product, and applies NVIDIA style guidance (plain language, “and” instead of “&”, descriptive links, list lead-ins).

Use this for a rendered preview of the current draft content:

https://sw-docs-dgx-station.nvidia.com/nemo/retriever/latest/extraction/overview/

Discussion, scope, and merge status live here in this PR.

Feedback welcome on either link.

Note that this is draft / staging quality, not a final production publication.

Scope

Documentation and CI for the NRL-focused site; it does not replace the full multi-package Sphinx/API doc pipeline for the whole repo.

What’s included

Build and automation

  • .github/workflows/nrl-docs-github-pages.yml — Builds and deploys the NRL docs to GitHub Pages (staging/nightly: push to main under docs/**, schedule, workflow_dispatch).

  • docs/mkdocs.nrl-github-pages.yml — MkDocs config for the NRL-only site (Material theme, staging overrides, exclude_docs for pages outside this build).

  • docs/overrides-nrl-staging/main.html — Staging theme override.

  • docs/scripts/print_nrl_mkdocs_nav.py — Prints the nav tree from the NRL MkDocs config for review.

  • docs/scripts/scan_non_nrl_doc_references.py — Optional scan for legacy naming (informational).

  • Information architecture and content

  • Nav (13 sections) — Introduction → Get started → Choose deployment → Core workflows → Multimodal extraction → Embedding/indexing → Retrieval → Deployment → Customize → Integrations → Evaluation → Reference → Support; plus additional resources.

  • New hub / topic pages (examples): how to use this documentation, hosted vs self-hosted NIMs, workflows (query/rerank, agentic, video OCR), semantic/hybrid retrieval, reranking, production checklist, multimodal extraction stubs, embedding NIMs, vector DB partners, published metrics, and related links.

  • Supporting updates to existing pages (choose-your-path, getting-started-about, key-features, concepts, integrations, resources, small fixes across FAQ, data-store, audio, etc.).

kheiss-uwzoo and others added 30 commits February 19, 2026 10:36
Update all hardcoded version references from 26.1.2 to 26.3.0-RC1
across helm charts, docker-compose, FastAPI, docs, and examples.

Made-with: Cursor
…ing long VLM captioning

Large PDFs with VLM captioning enabled can take 2-22+ hours depending on hardware.
The previous defaults (STATE_TTL=7200s, RESULT_DATA_TTL=3600s) caused job state to
expire mid-processing, resulting in 404 "Job ID not found or state has expired" errors
even though the pipeline completed successfully.

Raises both defaults to 172800s (48 hours), providing sufficient headroom for all
observed workloads. Users can still override via RESULT_DATA_TTL_SECONDS and
STATE_TTL_SECONDS environment variables.

Fixes: Customer bug 5914605

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@kheiss-uwzoo kheiss-uwzoo marked this pull request as ready for review April 17, 2026 15:24
@kheiss-uwzoo kheiss-uwzoo requested review from a team as code owners April 17, 2026 15:25
@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Apr 17, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR introduces a NRL-scoped MkDocs documentation build and GitHub Pages publishing pipeline, reorganizes the nav into 13 sections with new hub/topic pages, renames releasenotes-nv-ingest.md to releasenotes.md, and adds two helper scripts for pre-deploy nav inspection and legacy-naming scanning.

  • Workflow permissions: is scoped too broadlypages: write and id-token: write are declared at the workflow level and inherited by the build job, which does not need them; these should be moved to the deploy job only.
  • Staging banner will not rendertheme.announcement is not a recognized Material for MkDocs YAML key; the warning will be silently dropped, and the {% block announce %} override is not implemented in main.html.

Confidence Score: 3/5

Not safe to merge until the broken build job structure is fixed and the over-broad token permissions are addressed.

The build job in the workflow is structurally invalid (missing steps: key — flagged in prior thread) and will fail immediately on push. The new security finding of workflow-level pages: write / id-token: write gives the build job unnecessary token access. The staging banner misconfiguration (theme.announcement) means the site will silently deploy without any staging indicator. Together these are two new P1s plus one unresolved P1 from a prior thread.

.github/workflows/nrl-docs-github-pages.yml (broken job structure + over-broad permissions) and docs/mkdocs.nrl-github-pages.yml + docs/overrides-nrl-staging/main.html (missing announcement block).

Security Review

  • Over-broad token grants (.github/workflows/nrl-docs-github-pages.yml): pages: write and id-token: write are granted at the workflow level, giving the build job OIDC and Pages-write permissions it does not need. A compromised third-party action in the build job (all six are pinned to mutable version tags, not full SHAs) could abuse these tokens. Scope both permissions to the deploy job only.

Important Files Changed

Filename Overview
.github/workflows/nrl-docs-github-pages.yml New workflow for NRL docs build/deploy; has structural issues (missing steps: in build job, duplicate checkout actions, mutable tag pins) and over-broad permissions at workflow level instead of job level.
docs/mkdocs.nrl-github-pages.yml New MkDocs config for NRL-only GitHub Pages site; theme.announcement is not a recognized Material key so the staging banner will be silently dropped from the rendered site.
docs/scripts/print_nrl_mkdocs_nav.py New utility script to print nav tree for pre-deploy review; missing required SPDX license header (flagged in prior thread).
docs/scripts/scan_non_nrl_doc_references.py New scan script for legacy naming; missing SPDX header (prior thread) and has a dead NV-Ingest regex pattern shadowed by the preceding case-insensitive superset entry.
docs/overrides-nrl-staging/main.html Minimal staging theme override; does not implement {% block announce %}, so the staging banner in mkdocs.nrl-github-pages.yml will not render.
docs/docs/extraction/releasenotes.md Renamed from releasenotes-nv-ingest.md; nav reference in mkdocs config now correctly points to this file, resolving the prior broken-nav finding.
docs/docs/extraction/overview.md Updated NRL product overview page; "NIVIDIA" typo was flagged in a prior thread and needs correction.

Sequence Diagram

sequenceDiagram
    participant Push as Push / Schedule
    participant Build as build job
    participant MkDocs as MkDocs --strict
    participant Artifact as Pages artifact
    participant Deploy as deploy job
    participant Pages as GitHub Pages

    Push->>Build: trigger (docs/**, schedule, dispatch)
    Build->>Build: actions/checkout
    Build->>Build: pip install mkdocs + nemo_retriever
    Build->>Build: print_nrl_mkdocs_nav.py
    Build->>Build: scan_non_nrl_doc_references.py
    Build->>MkDocs: mkdocs build -f mkdocs.nrl-github-pages.yml --strict
    MkDocs-->>Artifact: docs/site/
    Build->>Artifact: upload-pages-artifact
    Artifact-->>Deploy: needs: build
    Deploy->>Pages: actions/deploy-pages
    Pages-->>Deploy: page_url
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: .github/workflows/nrl-docs-github-pages.yml
Line: 18-21

Comment:
**Over-broad workflow-level permissions**

The `permissions:` block is declared at the workflow level, which grants `pages: write` and `id-token: write` to every job — including `build`, which needs neither. Under the `github-actions-security` rule, permissions should be scoped to the minimum required at the job level. Move the elevated permissions to the `deploy` job only and restrict `build` to `contents: read`:

```yaml
permissions:
  contents: read   # workflow-level default

jobs:
  build:
    permissions:
      contents: read
    ...
  deploy:
    permissions:
      pages: write
      id-token: write
```

This limits the blast radius if the `build` job is ever compromised by a supply-chain attack.

**Rule Used:** GitHub Actions workflows must: pin third-party act... ([source](https://app.greptile.com/review/custom-context?memory=github-actions-security))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/mkdocs.nrl-github-pages.yml
Line: 33-34

Comment:
**`theme.announcement` is not a valid Material for MkDocs config key**

Material for MkDocs does not read an `announcement:` key from the `theme:` block — MkDocs will silently ignore it, and the staging warning banner will never appear on the site. The correct approach is to override the `{% block announce %}` in the theme's custom template.

Add the block to `docs/overrides-nrl-staging/main.html`:

```html
{% block announce %}
  <strong>Staging (nightly):</strong> NeMo Retriever documentation only.
  This site is not a production or release publication.
{% endblock %}
```

Without this, the site looks identical to a production build and could mislead readers into treating draft content as authoritative.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/scripts/scan_non_nrl_doc_references.py
Line: 19-21

Comment:
**Dead `NV-Ingest` pattern — always shadowed by the case-insensitive superset above it**

`PATTERNS` is evaluated in order with a `break` after the first match. The first entry uses `re.IGNORECASE`, which matches `NV-Ingest` (and any other casing), so the second entry `re.compile(r"NV-Ingest")` is unreachable — it will never fire. Both lines that match `NV-Ingest` will be reported under the `nv-ingest (substring)` label instead.

If you need to distinguish the mixed-case product name from the lowercase CLI artifact in the output, move the case-sensitive patterns before the case-insensitive superset, or remove the redundant entry.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (6): Last reviewed commit: "Update docs/docs/extraction/overview.md" | Re-trigger Greptile

Comment thread .github/workflows/nrl-docs-github-pages.yml Outdated
Comment thread docs/scripts/print_nrl_mkdocs_nav.py
Comment thread docs/mkdocs.nrl-github-pages.yml
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread .github/workflows/nrl-docs-github-pages.yml
- Fix mkdocs.nrl-github-pages.yml so Workflow: Document ingestion targets workflow-document-ingestion.md (not the V2 API guide); add workflow and metadata schema pages.

- Add workflow-document-ingestion.md, workflow pages, and multimodal-metadata-schema.md; update workflow cross-links.

- Align quick start links and GitHub URLs with NeMo-Retriever; use Python and CLI Quick Start Guide labels; refresh quickstart-guide examples and MIG references.

- Insert the standard NVIDIA Ingest (nv-ingest) rename note after the H1 on every extraction topic page for consistent messaging.

Made-with: Cursor
- Point environment and troubleshooting links to environment-config.md and troubleshoot.md.

- In user-defined-functions, link to content-metadata, multimodal-metadata-schema, nimclient.md, and GitHub default_pipeline.yaml.

- Add explicit HTML anchors in content-metadata.md so schema table fragment links resolve without macro/attr_list issues.

Made-with: Cursor
Comment thread docs/docs/extraction/overview.md Outdated
- Replace "see" with "refer to" for consistency in linking to the Support matrix and Benchmarking documentation.
- Update cross-references and table notes to use refer to / Refer to for consistency.

- Rename 'See also' to 'Related topics' in key-features.md.

- Remove temporary docs/scripts/replace_see_with_refer_to.py helper.

Made-with: Cursor
Comment thread docs/mkdocs.nrl-github-pages.yml Outdated
Copy link
Copy Markdown
Collaborator

@randerzander randerzander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as an initial re-factor for the doc content

@jdye64 to review the scripts and gh pages changes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this new workflow? There is already a workflow that does the same thing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing Pages workflows run the full Docker/Sphinx docs build; this one is a lightweight NRL-only MkDocs path for staging/nightly without that pipeline.

kheiss-uwzoo and others added 3 commits April 20, 2026 12:49
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread .github/workflows/nrl-docs-github-pages.yml
@kheiss-uwzoo kheiss-uwzoo requested a review from jdye64 April 20, 2026 21:25
@jdye64 jdye64 merged commit e8ac134 into main Apr 20, 2026
5 checks passed
@kheiss-uwzoo kheiss-uwzoo mentioned this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants