Skip to content

Commit 883ff42

Browse files
committed
docs(k8s): document evaluation-only patterns and production alternatives
The k8s/README.md calls the manifest "experimental" but does not spell out which specific patterns are unsafe in production. A user deploying to a real cluster has no way to know that `privileged: true`, `DOCKER_TLS_CERTDIR=""`, `POLICY_MODE=skip`, `curl | bash` at pod start, the `dummy` placeholder API key, the absence of any NetworkPolicy, and the absence of resource limits are all *intentional* tradeoffs for a kubectl-apply-and-try-it-out flow — and not a production blueprint. Add k8s/SECURITY.md, walking through every risky pattern in the manifest, why it is unsafe in production, and what a production alternative would look like. Cross-link from k8s/README.md so the warning is discoverable from the existing entry point. Refs: #1442 Signed-off-by: ColinM-sys <cmcdonough@50words.com>
1 parent 8fac9d6 commit 883ff42

2 files changed

Lines changed: 190 additions & 0 deletions

File tree

k8s/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# NemoClaw on Kubernetes
22

33
> **⚠️ Experimental**: This deployment method is intended for **trying out NemoClaw on Kubernetes**, not for production use. It requires a **privileged pod** running **Docker-in-Docker (DinD)** to create isolated sandbox environments. Operational requirements (storage, runtime, security policies) vary by cluster configuration.
4+
>
5+
> See **[SECURITY.md](./SECURITY.md)** for the specific patterns that make this manifest evaluation-only and what a production-ready deployment would look like instead.
46
57
Run [NemoClaw](https://github.com/NVIDIA/NemoClaw) on Kubernetes with GPU inference powered by [Dynamo](https://github.com/ai-dynamo/dynamo) or any OpenAI-compatible endpoint.
68

k8s/SECURITY.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
2+
<!-- SPDX-License-Identifier: Apache-2.0 -->
3+
4+
# Kubernetes Deployment — Security Considerations
5+
6+
> **The manifest in [`nemoclaw-k8s.yaml`](./nemoclaw-k8s.yaml) is for evaluation only. Do not run it as-is in a production cluster.**
7+
8+
The existing `k8s/README.md` already calls the deployment "experimental",
9+
but the specific patterns that make it experimental are not spelled out.
10+
This page lists each one, why it is unsafe in production, and what a
11+
production-ready alternative would look like. It addresses the gap
12+
flagged in [#1442](https://github.com/NVIDIA/NemoClaw/issues/1442).
13+
14+
## What the evaluation manifest does
15+
16+
The pod runs **two containers** plus an init container:
17+
18+
| Container | Image | Purpose |
19+
|---|---|---|
20+
| `dind` | `docker:24-dind` | Docker-in-Docker daemon. Required because OpenShell sandboxes are Docker containers and a sandbox-on-sandbox needs a real daemon. |
21+
| `workspace` | `node:22` | Runs the official NemoClaw installer over the DinD socket. |
22+
| `init-docker-config` | `busybox` | Writes `daemon.json` so DinD uses host cgroup namespacing. |
23+
24+
That arrangement is the simplest possible way to get NemoClaw onto a
25+
Kubernetes cluster — and also the most dangerous one. The patterns
26+
below are intentional for an *evaluation* deployment but would be
27+
unacceptable in *production*.
28+
29+
## Security risks in the evaluation manifest
30+
31+
### 1. `privileged: true` on the DinD container
32+
33+
```yaml
34+
securityContext:
35+
privileged: true
36+
```
37+
38+
A privileged container has effectively **no isolation from the node**.
39+
It can load kernel modules, mount the host filesystem, access every
40+
device, and (with a single misstep) escalate to full node compromise.
41+
This is required to run a nested Docker daemon — the daemon needs
42+
unrestricted access to cgroups, namespaces, and `/var/lib/docker` —
43+
but it means a successful exploit inside the sandbox escalates not
44+
just to the pod but to the entire node.
45+
46+
**Production alternative:** run the sandbox container directly on the
47+
host's container runtime via a CSI driver or a runtime class
48+
(`runc`, `kata`, `gvisor`), and skip DinD entirely. NemoClaw's
49+
OpenShell runtime does not require Docker-in-Docker if the host
50+
already has a compatible runtime.
51+
52+
### 2. Docker TLS disabled
53+
54+
```yaml
55+
env:
56+
- name: DOCKER_TLS_CERTDIR
57+
value: ""
58+
```
59+
60+
Setting `DOCKER_TLS_CERTDIR=""` makes the DinD daemon listen on a
61+
plain Unix socket with no client authentication. Any process inside
62+
the workspace container that can reach `/var/run/docker.sock` can
63+
issue arbitrary Docker API calls — including `docker run -v /:/host`
64+
to escape the sandbox.
65+
66+
**Production alternative:** leave `DOCKER_TLS_CERTDIR` at its default
67+
so the daemon issues client certs, then mount only the certs (not the
68+
socket) into the workspace container.
69+
70+
### 3. `NEMOCLAW_POLICY_MODE=skip`
71+
72+
```yaml
73+
- name: NEMOCLAW_POLICY_MODE
74+
value: "skip"
75+
```
76+
77+
`POLICY_MODE=skip` disables NemoClaw's network policy enforcement
78+
inside the sandbox. The agent inside the sandbox can reach **any**
79+
host on the cluster network, exfiltrate data, or pivot to other
80+
services. Policies (`pypi`, `npm`, `github`, `huggingface`, etc.)
81+
have zero effect.
82+
83+
**Production alternative:** drop the env var (or set
84+
`NEMOCLAW_POLICY_MODE=enforce`) and pick the smallest set of policy
85+
presets the agent actually needs during onboard.
86+
87+
### 4. `curl | bash` installer over the network
88+
89+
```yaml
90+
command:
91+
- bash
92+
- -c
93+
- |
94+
...
95+
curl -fsSL https://nvidia.com/nemoclaw.sh | bash
96+
```
97+
98+
Pulling the installer over the network at pod start time means the
99+
deployed version of NemoClaw is whatever is live on
100+
`nvidia.com/nemoclaw.sh` at the moment the pod boots. There is no
101+
checksum verification, no version pinning, and no offline path. A
102+
compromise of the installer URL or a transient redirect is a one-shot
103+
supply-chain compromise of every pod that ever restarts.
104+
105+
**Production alternative:** build a NemoClaw image at a known tag,
106+
publish it to your own registry pinned by digest (see #1438), and
107+
deploy that image instead of running the installer at pod start.
108+
109+
### 5. Placeholder API key
110+
111+
```yaml
112+
- name: COMPATIBLE_API_KEY
113+
value: "dummy"
114+
```
115+
116+
The manifest hardcodes a placeholder credential. In a production
117+
deployment this needs to be a real key, sourced from a Kubernetes
118+
`Secret`, not an environment variable in plain YAML.
119+
120+
**Production alternative:**
121+
122+
```yaml
123+
- name: COMPATIBLE_API_KEY
124+
valueFrom:
125+
secretKeyRef:
126+
name: nemoclaw-credentials
127+
key: compatible-api-key
128+
```
129+
130+
### 6. No `NetworkPolicy`
131+
132+
The pod has no Kubernetes `NetworkPolicy` attached. With the default
133+
"allow all" cluster behavior, the workspace container can reach any
134+
service in the cluster — including the kube-apiserver — via the
135+
node's cluster network, and `POLICY_MODE=skip` removes the
136+
NemoClaw-side guardrail too.
137+
138+
**Production alternative:** ship a default-deny `NetworkPolicy` for
139+
the `nemoclaw` namespace and explicitly allow only the inference
140+
endpoint and DNS.
141+
142+
### 7. No `limits` (only `requests`)
143+
144+
```yaml
145+
resources:
146+
requests:
147+
memory: "8Gi"
148+
cpu: "2"
149+
```
150+
151+
Without `resources.limits`, a runaway agent or a memory leak in the
152+
sandbox can consume unbounded CPU and memory on the node, causing
153+
OOMKills of unrelated workloads. This is the gap flagged in
154+
[#1447](https://github.com/NVIDIA/NemoClaw/issues/1447).
155+
156+
**Production alternative:**
157+
158+
```yaml
159+
resources:
160+
requests:
161+
memory: "8Gi"
162+
cpu: "2"
163+
limits:
164+
memory: "16Gi"
165+
cpu: "4"
166+
```
167+
168+
## Minimum bar for production
169+
170+
If you need to run NemoClaw on a real Kubernetes cluster, none of the
171+
above is acceptable as-is. At a minimum:
172+
173+
1. **Drop `privileged: true`.** Use a runtime class instead of DinD.
174+
2. **Build and pin a NemoClaw image** by digest. Do not `curl | bash`
175+
at pod start.
176+
3. **Source credentials from `Secret` resources**, not env vars.
177+
4. **Set `NEMOCLAW_POLICY_MODE=enforce`** and select only the policy
178+
presets the agent actually needs.
179+
5. **Attach a default-deny `NetworkPolicy`** to the `nemoclaw`
180+
namespace.
181+
6. **Set `resources.limits`** so a sandbox cannot starve the node.
182+
7. **Add `livenessProbe` / `readinessProbe`** so kubelet can detect
183+
and restart unhealthy pods.
184+
185+
The current manifest deliberately ships **none** of those because it
186+
optimizes for "kubectl apply and try it out". That tradeoff is fine
187+
for evaluation, dangerous for production, and the reason this page
188+
exists.

0 commit comments

Comments
 (0)