Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,63 @@
- otelcol_exporter_sent_log_records_total
- otelcol_process_cpu_seconds_total
```

---

## Memory Metrics: What We Measure and How

### `docker_component` — `container.memory.usage`

- **Source**: Docker Python SDK — `container.stats(stream=False)["memory_stats"]["usage"]`
- **What it is**: The total memory charged to the container's **Linux cgroup**.
This includes the process's own allocations **plus** kernel page cache, buffers,
and other memory the kernel charged to the cgroup.
- **Unit**: bytes
- **OTel instrument**: `Gauge("container.memory.usage")`
- **Polling**: Background thread at a configurable interval (default 1 s).

### `process_component` — `process.memory.usage`

- **Source**: `psutil.Process(pid).memory_info().rss` (summed over the process and
all its children, recursively).
- **What it is**: The **Resident Set Size (RSS)** — physical RAM pages currently
mapped into the process's address space. Does *not* include swap or
file-backed pages that have been evicted.
- **Unit**: bytes
- **OTel instrument**: `Gauge("process.memory.usage")`
- **Polling**: Background thread at a configurable interval (default 1 s).

### How These Compare to Production Observability Tools

| Tool / Metric | Underlying value | Relation to our metrics |
|---|---|---|

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:13 MD060/table-column-style Table column style [Table pipe is missing space to the left for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:9 MD060/table-column-style Table column style [Table pipe is missing space to the right for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:9 MD060/table-column-style Table column style [Table pipe is missing space to the left for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:5 MD060/table-column-style Table column style [Table pipe is missing space to the right for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:5 MD060/table-column-style Table column style [Table pipe is missing space to the left for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md

Check failure on line 160 in tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md

View workflow job for this annotation

GitHub Actions / markdownlint

Table column style

tools/pipeline_perf_test/orchestrator/docs/plugins/monitoring_strategies.md:160:1 MD060/table-column-style Table column style [Table pipe is missing space to the right for style "compact"] https://github.com/DavidAnson/markdownlint/blob/v0.40.0/doc/md060.md
| `kubectl top pods` | `container_memory_working_set_bytes` = cgroup usage − inactive file cache | **Lower** than `container.memory.usage`, **close to but ≥** RSS |
| `docker stats` (MEM USAGE column) | cgroup `usage` (same as ours) | **Equal** to `container.memory.usage` |
| `htop` / `ps rss` | RSS | **Equal** to `process.memory.usage` |
| Kubernetes OOM killer | triggers on cgroup `usage` hitting the limit | **Equal** to `container.memory.usage` threshold |
| Prometheus `container_memory_rss` (cAdvisor) | cgroup-level RSS (sum of all processes in cgroup) | Close to `process.memory.usage` but may differ slightly |

### TODO / Open Questions

- **Emit working-set alongside cgroup usage for Docker components.**
The Docker stats response contains `memory_stats.stats.inactive_file`, so we
could compute `working_set = usage - inactive_file` to mirror what
`kubectl top pods` reports. This would make perf-test numbers directly
comparable to what users see in Kubernetes dashboards.

- **Emit RSS for Docker components too.**
`memory_stats.stats.rss` is available in the Docker stats payload and would
give a number comparable to `process.memory.usage`, enabling apples-to-apples
comparison between Docker-deployed and process-deployed components.

- **Clarify which metric matters most for different audiences.**
Platform teams care about cgroup usage (it determines OOM kills and resource
quota). Developers optimizing code care about RSS (isolates their code's
footprint from kernel caching). Should we surface both in reports, or pick a
primary and make the other opt-in?

- **Peak vs. average.**
We currently report instantaneous samples at the polling interval. Reports
aggregate min/mean/max over observation windows. Consider whether we need a
high-water-mark (peak RSS or peak cgroup usage) captured at a finer
granularity than the polling interval.
Loading