Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
fbb2bc1
feat: integrate Axiom MetricsDB/MPL querying and prioritize over Grafana
Licenser Feb 12, 2026
d875a9b
refactor: extract range_to_rfc3339 into shared scripts/lib-time
Licenser Feb 12, 2026
dd4ec0d
fix: file_mtime tries GNU stat before BSD to avoid stdout pollution o…
Licenser Feb 12, 2026
dd64dfd
fix: add auth headers to --spec OPTIONS request
Licenser Feb 12, 2026
f5daae2
fix: error on partial --start/--end in axiom-metrics-discover
Licenser Feb 12, 2026
5d7177d
fix(discover-axiom): tag MetricsDB datasets inline in fallback path
Licenser Feb 12, 2026
89cf6ba
fix(lib-time): pluralize time units for GNU date compatibility
Licenser Feb 12, 2026
bd15d04
fix(metrics): validate time conversion results before API calls
Licenser Feb 12, 2026
73730bc
fix: consolidate axiom-link for APL+MPL, fix discover-axiom dataset kind
tsenart Feb 16, 2026
b4f242f
Merge main into axiom-metrics — resolve conflicts in SKILL.core.md, s…
Feb 18, 2026
11bfe16
fix: copy metrics scripts to skill/scripts/ for distribution
Licenser Feb 18, 2026
365e372
fix: suppress SIGPIPE in discover-axiom Strategy 2 fallback
Licenser Feb 18, 2026
b13e261
fix: remove priority encoding from grafana.md
Licenser Feb 18, 2026
c7154ea
fix: remove priority encoding from SKILL.core.md
Licenser Feb 18, 2026
0d46335
fix: SIGPIPE in test-build frontmatter name check
Licenser Feb 18, 2026
2be3ead
fix: URL-encode timestamps in axiom-metrics-discover query string
Licenser Feb 18, 2026
cd5bfb2
fix: add SIGPIPE guard to discover-axiom unlisted datasets pipeline
Licenser Feb 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ You're welcome.
- **Finds root causes** — Hypothesis-driven investigation. No hunches. No vibes. Data.
- **Systematic triage** — Golden signals, USE/RED methods. The stuff you should already know.
- **Remembers everything** — Persistent memory for patterns, queries, and incidents. Unlike you, I learn.
- **Metrics querying** — OTel metrics via MPL. Logs via APL. One agent, both engines.
- **Unified observability** — One config, all your tools. Because having four config files is amateur hour.

## Installation
Expand Down Expand Up @@ -67,9 +68,12 @@ Auth options per deployment:
## Usage

```bash
# Query logs
# Query logs (APL)
scripts/axiom-query prod "['dataset'] | where _time > ago(1h) | where status >= 500 | project _time, message, status | take 10"

# Query metrics (MPL)
scripts/axiom-metrics-query prod --range 1h <<< "otel-metrics:http.server.request.duration | align to 5m using avg | group by service.name"

# Check what's on fire
scripts/grafana-alerts prod firing

Expand All @@ -84,7 +88,7 @@ scripts/slack default chat.postMessage channel=incidents text="Fixed. You're wel

| Category | Scripts |
|----------|---------|
| **Axiom** | `axiom-query`, `axiom-api`, `axiom-link`, `axiom-deployments` |
| **Axiom** | `axiom-query`, `axiom-metrics-query`, `axiom-api`, `axiom-link`, `axiom-deployments` |
| **Grafana** | `grafana-query`, `grafana-alerts`, `grafana-datasources`, `grafana-api` |
| **Pyroscope** | `pyroscope-flamegraph`, `pyroscope-diff`, `pyroscope-services`, `pyroscope-api` |
| **Slack** | `slack` |
Expand All @@ -105,7 +109,7 @@ scripts/test-config-toml # TOML parsing with indented sections
2. **State facts.** "The logs show X" not "this is probably X."
3. **Disprove, don't confirm.** Design queries to falsify your hypothesis.
4. **Time filter first.** Always. No exceptions.
5. **Discover schema.** Run `getschema` before querying unfamiliar datasets.
5. **Discover schema.** Run `getschema` (APL) or `--spec` (MPL) before querying unfamiliar datasets.

## Memory

Expand Down
29 changes: 18 additions & 11 deletions SKILL.core.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,8 +155,10 @@ Follow this loop strictly.

### D. EXECUTE (Query)
- **Select methodology:** Golden Signals (customer-facing health), RED (request-driven services), USE (infrastructure resources)
- **Select telemetry:** Use whatever's available—metrics, logs, traces, profiles
- **Run query:** `scripts/axiom-query` (logs), `scripts/grafana-query` (metrics), `scripts/pyroscope-diff` (profiles)
- **Metrics:** Axiom MetricsDB (`[MPL]` datasets from `scripts/init`), Grafana/PromQL, alerts/dashboards via Grafana
- **Discover metrics:** `scripts/axiom-metrics-discover` (list metrics, tags, tag values in MetricsDB datasets)
- **Alerts & dashboards:** Grafana only — `scripts/grafana-alerts`, `scripts/grafana-dashboards`
- **Run query:** `scripts/axiom-query` (logs/APL), `scripts/axiom-metrics-query` (metrics/MPL), `scripts/grafana-query` (PromQL), `scripts/pyroscope-diff` (profiles)

### E. VERIFY & REFLECT
- **Methodology check:** Service → RED. Resource → USE.
Expand Down Expand Up @@ -309,7 +311,7 @@ For request-driven services. Measures the *work* the service does.
| **Errors** | Error rate (5xx / total) |
| **Duration** | Latency percentiles (p50, p95, p99) |

Measure via logs (APL — see `reference/apl.md`) or metrics (PromQL — see `reference/grafana.md`).
Measure via logs (APL — see `reference/apl.md`), OTel metrics (MPLsee `reference/metrics.md`), or PromQL fallback (see `reference/grafana.md`).

### C. USE METHOD (Resources)

Expand All @@ -321,7 +323,7 @@ For infrastructure resources (CPU, memory, disk, network). Measures the *capacit
| **Saturation** | Queue depth, load average, waiting threads |
| **Errors** | Hardware/network errors |

Typically measured via metrics. See `reference/grafana.md` for PromQL patterns.
Check Axiom MetricsDB first (OTel resource metrics). Fall back to Grafana/PromQL if not available. See `reference/grafana.md` for PromQL patterns.

### D. DIFFERENTIAL ANALYSIS

Expand Down Expand Up @@ -358,6 +360,8 @@ See `reference/apl.md` for full operator, function, and pattern reference.
- **Avoid `search`**—scans ALL fields. Last resort only.
- **Field escaping**—dots need `\\.`: `['kubernetes.node_labels.nodepool\\.axiom\\.co/name']`

**MetricsDB/MPL:** For OTel metrics (`[MPL]` datasets), discover with `scripts/axiom-metrics-discover`, query with `scripts/axiom-metrics-query`. See `reference/metrics.md`.

**Need more?** Open `reference/apl.md` for operators/functions, `reference/query-patterns.md` for ready-to-use investigation queries.

---
Expand All @@ -374,15 +378,16 @@ Every finding must link to its source — dashboards, queries, error reports, PR
5. **Data responses**—Any answer citing tool-derived numbers (e.g. burn rates, error counts, usage stats, etc). Questions don't require investigation, but if you cite numbers from a query, include the source link.

**Rule: If you ran a query and cite its results, generate a permalink.** Run the appropriate link tool for every query whose results appear in your response:
- **Axiom:** `scripts/axiom-link`
- **Axiom:** `scripts/axiom-link` (works for both APL and MPL queries)
- **Grafana:** `scripts/grafana-link`
- **Pyroscope:** `scripts/pyroscope-link`
- **Sentry:** `scripts/sentry-link`

**Permalinks:**
```bash
# Axiom
# Axiom (APL or MPL — same script handles both)
scripts/axiom-link <env> "['logs'] | where status >= 500 | take 100" "1h"
scripts/axiom-link <env> "dataset:metric.name | align to 5m using avg" "1h"
# Grafana (metrics)
scripts/grafana-link <env> <datasource-uid> "rate(http_requests_total[5m])" "1h"
# Pyroscope (profiling)
Expand Down Expand Up @@ -480,20 +485,21 @@ See `reference/postmortem-template.md` for retrospective format.

## 15. TOOL REFERENCE

### Axiom (Logs & Events)
### Axiom (Logs & Events — APL)
```bash
scripts/axiom-query <env> <<< "['dataset'] | getschema"
scripts/axiom-query <env> <<< "['dataset'] | where _time > ago(1h) | project _time, message, level | take 5"
scripts/axiom-query <env> --ndjson <<< "['dataset'] | where _time > ago(1h) | project _time, message | take 1"
```

### Grafana (Metrics)
### Axiom (MetricsDB — MPL)
```bash
scripts/grafana-query <env> prometheus 'rate(http_requests_total[5m])'
scripts/axiom-metrics-discover <env> <dataset> metrics|tags|tag-values|search
scripts/axiom-metrics-query <env> --range 1h <<< "dataset:metric.name | align to 5m using avg"
```

### Pyroscope (Profiling)
### Grafana (PromQL fallback) / Pyroscope / Slack
```bash
scripts/grafana-query <env> prometheus 'rate(http_requests_total[5m])'
scripts/pyroscope-diff <env> <app_name> -2h -1h -1h now
```

Expand All @@ -518,6 +524,7 @@ scripts/slack-upload <env> <channel> ./file.png --comment "Description" --thread

- `reference/apl.md`—APL operators, functions, and spotlight analysis
- `reference/axiom.md`—Axiom API endpoints (70+)
- `reference/metrics.md`—MetricsDB MPL querying, discovery, and patterns
- `reference/blocks.md`—Slack Block Kit formatting
- `reference/failure-modes.md`—Common failure patterns
- `reference/grafana.md`—Grafana queries and PromQL patterns
Expand Down
161 changes: 161 additions & 0 deletions scripts/axiom-metrics-discover
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
#!/usr/bin/env bash
# Axiom MetricsDB info endpoint helper - discover metrics, tags, and tag values
#
# Usage: axiom-metrics-discover <deployment> <dataset> [options] <command> [args...]
#
# Commands:
# metrics List all metrics in dataset
# tags List all tags in dataset
# tag-values <tag> List values for a tag
# metric-tags <metric> List tags for a metric
# metric-tag-values <metric> <tag> List tag values for metric+tag
# search <value> Find metrics matching a tag value (POST)
#
# Options:
# --range <r> Time range from now (e.g. 1h, 24h, 7d). Default: 1h
# --start <ts> Start time (RFC3339)
# --end <ts> End time (RFC3339)
#
# Examples:
# axiom-metrics-discover prod otel-metrics metrics
# axiom-metrics-discover prod otel-metrics --range 24h tags
# axiom-metrics-discover prod otel-metrics tag-values service.name
# axiom-metrics-discover prod otel-metrics metric-tags http.server.request.duration
# axiom-metrics-discover prod otel-metrics metric-tag-values http.server.request.duration service.name
# axiom-metrics-discover prod otel-metrics search "api-gateway"

set -euo pipefail

if [[ $# -lt 3 ]]; then
echo "Usage: axiom-metrics-discover <deployment> <dataset> [options] <command> [args...]" >&2
exit 1
fi

DEPLOYMENT="$1"
DATASET="$2"
shift 2

START_TIME="${START_TIME:-}"
END_TIME="${END_TIME:-}"
RANGE="${RANGE:-}"

# Parse options before command
while [[ $# -gt 0 ]]; do
case "$1" in
--start)
START_TIME="$2"
shift 2
;;
--end)
END_TIME="$2"
shift 2
;;
--range)
RANGE="$2"
shift 2
;;
-*)
echo "Error: Unknown option '$1'." >&2
exit 1
;;
*)
break
;;
esac
done

if [[ $# -lt 1 ]]; then
echo "Error: No command specified. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2
exit 1
fi

COMMAND="$1"
shift

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# shellcheck disable=SC1091
source "$SCRIPT_DIR/lib-time"

# Validate time arguments
if [[ -n "$RANGE" && ( -n "$START_TIME" || -n "$END_TIME" ) ]]; then
echo "Error: --range cannot be combined with --start/--end." >&2
exit 1
fi

if [[ -n "$RANGE" ]]; then
START_TIME=$(range_to_rfc3339 "$RANGE") || exit 1
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1
if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then
echo "Error: Failed to compute time range from '$RANGE'." >&2
exit 1
fi
elif [[ -n "$START_TIME" && -n "$END_TIME" ]]; then
: # explicit start/end provided
elif [[ -n "$START_TIME" || -n "$END_TIME" ]]; then
echo "Error: Both --start and --end are required when specifying explicit times." >&2
exit 1
else
# Default to 1h
START_TIME=$(range_to_rfc3339 "1h") || exit 1
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1
if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then
echo "Error: Failed to compute default time range." >&2
exit 1
fi
fi

# URL-encode a path segment
uriencode() {
jq -rn --arg x "$1" '$x|@uri'
}

DATASET_ENC=$(uriencode "$DATASET")
BASE="/v1/query/metrics/info/datasets/${DATASET_ENC}"
QS="start=${START_TIME}&end=${END_TIME}"

case "$COMMAND" in
metrics)
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics?${QS}" | jq .
;;
tags)
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags?${QS}" | jq .
;;
tag-values)
if [[ $# -lt 1 ]]; then
echo "Error: tag-values requires a <tag> argument." >&2
exit 1
fi
TAG_ENC=$(uriencode "$1")
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags/${TAG_ENC}/values?${QS}" | jq .
;;
metric-tags)
if [[ $# -lt 1 ]]; then
echo "Error: metric-tags requires a <metric> argument." >&2
exit 1
fi
METRIC_ENC=$(uriencode "$1")
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags?${QS}" | jq .
;;
metric-tag-values)
if [[ $# -lt 2 ]]; then
echo "Error: metric-tag-values requires <metric> and <tag> arguments." >&2
exit 1
fi
METRIC_ENC=$(uriencode "$1")
TAG_ENC=$(uriencode "$2")
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags/${TAG_ENC}/values?${QS}" | jq .
;;
search)
if [[ $# -lt 1 ]]; then
echo "Error: search requires a <value> argument." >&2
exit 1
fi
BODY=$(jq -nc --arg v "$1" '{"value": $v}')
"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" POST "${BASE}/metrics?${QS}" "$BODY" | jq .
;;
*)
echo "Error: Unknown command '$COMMAND'. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2
exit 1
;;
esac
Loading
Loading