diff --git a/README.md b/README.md index 0e2e0ea..ea684b6 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ You're welcome. - **Finds root causes** — Hypothesis-driven investigation. No hunches. No vibes. Data. - **Systematic triage** — Golden signals, USE/RED methods. The stuff you should already know. - **Remembers everything** — Persistent memory for patterns, queries, and incidents. Unlike you, I learn. +- **Metrics querying** — OTel metrics via MPL. Logs via APL. One agent, both engines. - **Unified observability** — One config, all your tools. Because having four config files is amateur hour. ## Installation @@ -67,9 +68,12 @@ Auth options per deployment: ## Usage ```bash -# Query logs +# Query logs (APL) scripts/axiom-query prod "['dataset'] | where _time > ago(1h) | where status >= 500 | project _time, message, status | take 10" +# Query metrics (MPL) +scripts/axiom-metrics-query prod --range 1h <<< "otel-metrics:http.server.request.duration | align to 5m using avg | group by service.name" + # Check what's on fire scripts/grafana-alerts prod firing @@ -84,7 +88,7 @@ scripts/slack default chat.postMessage channel=incidents text="Fixed. You're wel | Category | Scripts | |----------|---------| -| **Axiom** | `axiom-query`, `axiom-api`, `axiom-link`, `axiom-deployments` | +| **Axiom** | `axiom-query`, `axiom-metrics-query`, `axiom-api`, `axiom-link`, `axiom-deployments` | | **Grafana** | `grafana-query`, `grafana-alerts`, `grafana-datasources`, `grafana-api` | | **Pyroscope** | `pyroscope-flamegraph`, `pyroscope-diff`, `pyroscope-services`, `pyroscope-api` | | **Slack** | `slack` | @@ -105,7 +109,7 @@ scripts/test-config-toml # TOML parsing with indented sections 2. **State facts.** "The logs show X" not "this is probably X." 3. **Disprove, don't confirm.** Design queries to falsify your hypothesis. 4. **Time filter first.** Always. No exceptions. -5. **Discover schema.** Run `getschema` before querying unfamiliar datasets. +5. **Discover schema.** Run `getschema` (APL) or `--spec` (MPL) before querying unfamiliar datasets. ## Memory diff --git a/SKILL.core.md b/SKILL.core.md index e69c7ff..325b70d 100644 --- a/SKILL.core.md +++ b/SKILL.core.md @@ -155,8 +155,10 @@ Follow this loop strictly. ### D. EXECUTE (Query) - **Select methodology:** Golden Signals (customer-facing health), RED (request-driven services), USE (infrastructure resources) -- **Select telemetry:** Use whatever's available—metrics, logs, traces, profiles -- **Run query:** `scripts/axiom-query` (logs), `scripts/grafana-query` (metrics), `scripts/pyroscope-diff` (profiles) +- **Metrics:** Axiom MetricsDB (`[MPL]` datasets from `scripts/init`), Grafana/PromQL, alerts/dashboards via Grafana +- **Discover metrics:** `scripts/axiom-metrics-discover` (list metrics, tags, tag values in MetricsDB datasets) +- **Alerts & dashboards:** Grafana only — `scripts/grafana-alerts`, `scripts/grafana-dashboards` +- **Run query:** `scripts/axiom-query` (logs/APL), `scripts/axiom-metrics-query` (metrics/MPL), `scripts/grafana-query` (PromQL), `scripts/pyroscope-diff` (profiles) ### E. VERIFY & REFLECT - **Methodology check:** Service → RED. Resource → USE. @@ -309,7 +311,7 @@ For request-driven services. Measures the *work* the service does. | **Errors** | Error rate (5xx / total) | | **Duration** | Latency percentiles (p50, p95, p99) | -Measure via logs (APL — see `reference/apl.md`) or metrics (PromQL — see `reference/grafana.md`). +Measure via logs (APL — see `reference/apl.md`), OTel metrics (MPL — see `reference/metrics.md`), or PromQL fallback (see `reference/grafana.md`). ### C. USE METHOD (Resources) @@ -321,7 +323,7 @@ For infrastructure resources (CPU, memory, disk, network). Measures the *capacit | **Saturation** | Queue depth, load average, waiting threads | | **Errors** | Hardware/network errors | -Typically measured via metrics. See `reference/grafana.md` for PromQL patterns. +Check Axiom MetricsDB first (OTel resource metrics). Fall back to Grafana/PromQL if not available. See `reference/grafana.md` for PromQL patterns. ### D. DIFFERENTIAL ANALYSIS @@ -358,6 +360,8 @@ See `reference/apl.md` for full operator, function, and pattern reference. - **Avoid `search`**—scans ALL fields. Last resort only. - **Field escaping**—dots need `\\.`: `['kubernetes.node_labels.nodepool\\.axiom\\.co/name']` +**MetricsDB/MPL:** For OTel metrics (`[MPL]` datasets), discover with `scripts/axiom-metrics-discover`, query with `scripts/axiom-metrics-query`. See `reference/metrics.md`. + **Need more?** Open `reference/apl.md` for operators/functions, `reference/query-patterns.md` for ready-to-use investigation queries. --- @@ -374,15 +378,16 @@ Every finding must link to its source — dashboards, queries, error reports, PR 5. **Data responses**—Any answer citing tool-derived numbers (e.g. burn rates, error counts, usage stats, etc). Questions don't require investigation, but if you cite numbers from a query, include the source link. **Rule: If you ran a query and cite its results, generate a permalink.** Run the appropriate link tool for every query whose results appear in your response: -- **Axiom:** `scripts/axiom-link` +- **Axiom:** `scripts/axiom-link` (works for both APL and MPL queries) - **Grafana:** `scripts/grafana-link` - **Pyroscope:** `scripts/pyroscope-link` - **Sentry:** `scripts/sentry-link` **Permalinks:** ```bash -# Axiom +# Axiom (APL or MPL — same script handles both) scripts/axiom-link "['logs'] | where status >= 500 | take 100" "1h" +scripts/axiom-link "dataset:metric.name | align to 5m using avg" "1h" # Grafana (metrics) scripts/grafana-link "rate(http_requests_total[5m])" "1h" # Pyroscope (profiling) @@ -480,20 +485,21 @@ See `reference/postmortem-template.md` for retrospective format. ## 15. TOOL REFERENCE -### Axiom (Logs & Events) +### Axiom (Logs & Events — APL) ```bash scripts/axiom-query <<< "['dataset'] | getschema" scripts/axiom-query <<< "['dataset'] | where _time > ago(1h) | project _time, message, level | take 5" -scripts/axiom-query --ndjson <<< "['dataset'] | where _time > ago(1h) | project _time, message | take 1" ``` -### Grafana (Metrics) +### Axiom (MetricsDB — MPL) ```bash -scripts/grafana-query prometheus 'rate(http_requests_total[5m])' +scripts/axiom-metrics-discover metrics|tags|tag-values|search +scripts/axiom-metrics-query --range 1h <<< "dataset:metric.name | align to 5m using avg" ``` -### Pyroscope (Profiling) +### Grafana (PromQL fallback) / Pyroscope / Slack ```bash +scripts/grafana-query prometheus 'rate(http_requests_total[5m])' scripts/pyroscope-diff -2h -1h -1h now ``` @@ -518,6 +524,7 @@ scripts/slack-upload ./file.png --comment "Description" --thread - `reference/apl.md`—APL operators, functions, and spotlight analysis - `reference/axiom.md`—Axiom API endpoints (70+) +- `reference/metrics.md`—MetricsDB MPL querying, discovery, and patterns - `reference/blocks.md`—Slack Block Kit formatting - `reference/failure-modes.md`—Common failure patterns - `reference/grafana.md`—Grafana queries and PromQL patterns diff --git a/scripts/axiom-metrics-discover b/scripts/axiom-metrics-discover new file mode 100755 index 0000000..ab4ca14 --- /dev/null +++ b/scripts/axiom-metrics-discover @@ -0,0 +1,163 @@ +#!/usr/bin/env bash +# Axiom MetricsDB info endpoint helper - discover metrics, tags, and tag values +# +# Usage: axiom-metrics-discover [options] [args...] +# +# Commands: +# metrics List all metrics in dataset +# tags List all tags in dataset +# tag-values List values for a tag +# metric-tags List tags for a metric +# metric-tag-values List tag values for metric+tag +# search Find metrics matching a tag value (POST) +# +# Options: +# --range Time range from now (e.g. 1h, 24h, 7d). Default: 1h +# --start Start time (RFC3339) +# --end End time (RFC3339) +# +# Examples: +# axiom-metrics-discover prod otel-metrics metrics +# axiom-metrics-discover prod otel-metrics --range 24h tags +# axiom-metrics-discover prod otel-metrics tag-values service.name +# axiom-metrics-discover prod otel-metrics metric-tags http.server.request.duration +# axiom-metrics-discover prod otel-metrics metric-tag-values http.server.request.duration service.name +# axiom-metrics-discover prod otel-metrics search "api-gateway" + +set -euo pipefail + +if [[ $# -lt 3 ]]; then + echo "Usage: axiom-metrics-discover [options] [args...]" >&2 + exit 1 +fi + +DEPLOYMENT="$1" +DATASET="$2" +shift 2 + +START_TIME="${START_TIME:-}" +END_TIME="${END_TIME:-}" +RANGE="${RANGE:-}" + +# Parse options before command +while [[ $# -gt 0 ]]; do + case "$1" in + --start) + START_TIME="$2" + shift 2 + ;; + --end) + END_TIME="$2" + shift 2 + ;; + --range) + RANGE="$2" + shift 2 + ;; + -*) + echo "Error: Unknown option '$1'." >&2 + exit 1 + ;; + *) + break + ;; + esac +done + +if [[ $# -lt 1 ]]; then + echo "Error: No command specified. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2 + exit 1 +fi + +COMMAND="$1" +shift + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# shellcheck disable=SC1091 +source "$SCRIPT_DIR/lib-time" + +# Validate time arguments +if [[ -n "$RANGE" && ( -n "$START_TIME" || -n "$END_TIME" ) ]]; then + echo "Error: --range cannot be combined with --start/--end." >&2 + exit 1 +fi + +if [[ -n "$RANGE" ]]; then + START_TIME=$(range_to_rfc3339 "$RANGE") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute time range from '$RANGE'." >&2 + exit 1 + fi +elif [[ -n "$START_TIME" && -n "$END_TIME" ]]; then + : # explicit start/end provided +elif [[ -n "$START_TIME" || -n "$END_TIME" ]]; then + echo "Error: Both --start and --end are required when specifying explicit times." >&2 + exit 1 +else + # Default to 1h + START_TIME=$(range_to_rfc3339 "1h") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute default time range." >&2 + exit 1 + fi +fi + +# URL-encode a path segment +uriencode() { + jq -rn --arg x "$1" '$x|@uri' +} + +DATASET_ENC=$(uriencode "$DATASET") +START_ENC=$(uriencode "$START_TIME") +END_ENC=$(uriencode "$END_TIME") +BASE="/v1/query/metrics/info/datasets/${DATASET_ENC}" +QS="start=${START_ENC}&end=${END_ENC}" + +case "$COMMAND" in + metrics) + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics?${QS}" | jq . + ;; + tags) + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags?${QS}" | jq . + ;; + tag-values) + if [[ $# -lt 1 ]]; then + echo "Error: tag-values requires a argument." >&2 + exit 1 + fi + TAG_ENC=$(uriencode "$1") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags/${TAG_ENC}/values?${QS}" | jq . + ;; + metric-tags) + if [[ $# -lt 1 ]]; then + echo "Error: metric-tags requires a argument." >&2 + exit 1 + fi + METRIC_ENC=$(uriencode "$1") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags?${QS}" | jq . + ;; + metric-tag-values) + if [[ $# -lt 2 ]]; then + echo "Error: metric-tag-values requires and arguments." >&2 + exit 1 + fi + METRIC_ENC=$(uriencode "$1") + TAG_ENC=$(uriencode "$2") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags/${TAG_ENC}/values?${QS}" | jq . + ;; + search) + if [[ $# -lt 1 ]]; then + echo "Error: search requires a argument." >&2 + exit 1 + fi + BODY=$(jq -nc --arg v "$1" '{"value": $v}') + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" POST "${BASE}/metrics?${QS}" "$BODY" | jq . + ;; + *) + echo "Error: Unknown command '$COMMAND'. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2 + exit 1 + ;; +esac diff --git a/scripts/axiom-metrics-query b/scripts/axiom-metrics-query new file mode 100755 index 0000000..1318575 --- /dev/null +++ b/scripts/axiom-metrics-query @@ -0,0 +1,159 @@ +#!/usr/bin/env bash +# Axiom MetricsDB MPL query helper - reads query from stdin +# +# Usage: axiom-metrics-query [options] <<< "mpl query" +# +# Options: +# --start Start time (RFC3339, e.g. 2025-01-01T00:00:00Z) +# --end End time (RFC3339, e.g. 2025-01-02T00:00:00Z) +# --range Convenience range from now (e.g. 1h, 24h, 7d) +# --trace Print x-axiom-trace-id on success +# --spec Fetch MPL language specification (no query needed) +# +# Time: Either (--start + --end) or --range is required (not both). +# MPL does NOT support relative time expressions — RFC3339 only. +# +# Examples: +# axiom-metrics-query prod --range 1h <<< "dataset:metric.name | align to 5m using avg" +# axiom-metrics-query prod --start 2025-01-01T00:00:00Z --end 2025-01-02T00:00:00Z <<< "dataset:cpu.usage" +# axiom-metrics-query prod --spec + +set -euo pipefail + +if [[ $# -lt 1 ]]; then + echo "Usage: axiom-metrics-query [options] <<< 'mpl query'" >&2 + exit 1 +fi + +DEPLOYMENT="$1" +shift + +PRINT_TRACE=false +FETCH_SPEC=false +START_TIME="${START_TIME:-}" +END_TIME="${END_TIME:-}" +RANGE="${RANGE:-}" + +while [[ $# -gt 0 ]]; do + case "$1" in + --start) + START_TIME="$2" + shift 2 + ;; + --end) + END_TIME="$2" + shift 2 + ;; + --range) + RANGE="$2" + shift 2 + ;; + --trace) + PRINT_TRACE=true + shift + ;; + --spec) + FETCH_SPEC=true + shift + ;; + *) + echo "Error: Unknown argument '$1'. Queries must be passed via stdin." >&2 + exit 1 + ;; + esac +done + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Load config from unified config file +# shellcheck disable=SC1090 +eval "$("$SCRIPT_DIR/config" axiom "$DEPLOYMENT")" + +RESP_HEADERS=$(mktemp) +RESP_BODY=$(mktemp) +cleanup() { + rm -f "$RESP_HEADERS" "$RESP_BODY" +} +trap cleanup EXIT + +# --spec: fetch MPL language specification via OPTIONS and exit +if [[ "$FETCH_SPEC" == true ]]; then + HTTP_CODE=$(curl -sS -o "$RESP_BODY" -D "$RESP_HEADERS" -w "%{http_code}" \ + -X OPTIONS "$AXIOM_URL/v1/query/_metrics" \ + -H "Authorization: Bearer $AXIOM_TOKEN" \ + -H "X-Axiom-Org-Id: $AXIOM_ORG_ID") + + if [[ "$HTTP_CODE" -lt 200 || "$HTTP_CODE" -ge 300 ]]; then + msg=$(jq -r '.message // empty' "$RESP_BODY" 2>/dev/null) + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + echo "error: ${msg:-http $HTTP_CODE}" >&2 + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi + exit 1 + fi + + cat "$RESP_BODY" + exit 0 +fi + +# Require query from stdin +if [[ -t 0 ]]; then + echo "Error: No query provided. Pipe a query to stdin." >&2 + echo "" >&2 + echo "Examples:" >&2 + echo " axiom-metrics-query $DEPLOYMENT --range 1h <<< \"dataset:metric.name | align to 5m using avg\"" >&2 + exit 1 +fi + +# shellcheck disable=SC1091 +source "$SCRIPT_DIR/lib-time" + +# Validate time arguments +if [[ -n "$RANGE" && ( -n "$START_TIME" || -n "$END_TIME" ) ]]; then + echo "Error: --range cannot be combined with --start/--end." >&2 + exit 1 +fi + +if [[ -n "$RANGE" ]]; then + START_TIME=$(range_to_rfc3339 "$RANGE") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute time range from '$RANGE'." >&2 + exit 1 + fi +elif [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Either (--start + --end) or --range is required." >&2 + exit 1 +fi + +APL=$(cat) +APL_JSON=$(printf '%s' "$APL" | jq -Rs .) +START_JSON=$(printf '%s' "$START_TIME" | jq -Rs .) +END_JSON=$(printf '%s' "$END_TIME" | jq -Rs .) + +HTTP_CODE=$(curl -sS -o "$RESP_BODY" -D "$RESP_HEADERS" -w "%{http_code}" \ + -X POST "$AXIOM_URL/v1/query/_metrics?format=metrics-v1" \ + -H "Authorization: Bearer $AXIOM_TOKEN" \ + -H "X-Axiom-Org-Id: $AXIOM_ORG_ID" \ + -H "Content-Type: application/json" \ + -d "{\"apl\": $APL_JSON, \"startTime\": $START_JSON, \"endTime\": $END_JSON}") + +if [[ "$HTTP_CODE" -lt 200 || "$HTTP_CODE" -ge 300 ]]; then + msg=$(jq -r '.message // empty' "$RESP_BODY" 2>/dev/null) + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + echo "error: ${msg:-http $HTTP_CODE}" >&2 + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi + exit 1 +fi + +if [[ "$PRINT_TRACE" == true ]]; then + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi +fi + +cat "$RESP_BODY" diff --git a/scripts/lib-time b/scripts/lib-time new file mode 100755 index 0000000..3d2001a --- /dev/null +++ b/scripts/lib-time @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# Shared time utilities for Gilfoyle scripts +# Source this file: source "$SCRIPT_DIR/lib-time" + +# range_to_rfc3339 converts a human range (e.g. 1h, 24h, 7d) to an RFC3339 timestamp that many seconds ago +range_to_rfc3339() { + local range="$1" + local value="${range%[smhd]}" + local suffix="${range: -1}" + + if ! [[ "$value" =~ ^[0-9]+$ ]]; then + echo "Error: Invalid range value '$range'. Expected number + suffix (s/m/h/d)." >&2 + return 1 + fi + + local label + case "$suffix" in + s) label="second" ;; + h) label="hour" ;; + d) label="day" ;; + m) label="minute" ;; + *) + echo "Error: Invalid range suffix '$suffix'. Use s (seconds), m (minutes), h (hours), or d (days)." >&2 + return 1 + ;; + esac + + # Pluralize for values other than 1 + if [[ "$value" -ne 1 ]]; then + label="${label}s" + fi + + # Try GNU date first (linux, or gdate on macOS), then fall back to macOS date + if date -u -d "1 hour ago" +%Y-%m-%dT%H:%M:%SZ &>/dev/null; then + # GNU date + date -u -d "$value $label ago" +%Y-%m-%dT%H:%M:%SZ + else + # macOS date: -v flag with uppercase suffix + local date_flag + case "$suffix" in + s) date_flag="-v-${value}S" ;; + h) date_flag="-v-${value}H" ;; + d) date_flag="-v-${value}d" ;; + m) date_flag="-v-${value}M" ;; + esac + date -u "$date_flag" +%Y-%m-%dT%H:%M:%SZ + fi +} diff --git a/scripts/sync-to-skills b/scripts/sync-to-skills index f83997e..cadca98 100755 --- a/scripts/sync-to-skills +++ b/scripts/sync-to-skills @@ -88,6 +88,7 @@ find "$TARGET/scripts" "$TARGET/reference" "$TARGET/templates" -type f | while r # Skip binary files if file "$file" | grep -q "text"; then sedi \ + -e 's|GILFOYLE_NO_CACHE|SRE_NO_CACHE|g' \ -e 's|GILFOYLE_INIT_TIMEOUT|SRE_INIT_TIMEOUT|g' \ -e 's|GILFOYLE_CONFIG_DIR|SRE_CONFIG_DIR|g' \ -e 's|GILFOYLE_CONFIG|SRE_CONFIG|g' \ diff --git a/scripts/test-build b/scripts/test-build index 88903c4..14ab47f 100755 --- a/scripts/test-build +++ b/scripts/test-build @@ -41,7 +41,7 @@ else fi # Test 4: correct frontmatter name -if echo "$OUTPUT" | head -3 | grep -q "name: axiom-sre"; then +if echo "$OUTPUT" | grep -m1 -q "name: axiom-sre"; then pass "axiom-sre frontmatter name correct" else fail "axiom-sre frontmatter name wrong" diff --git a/skill/SKILL.md b/skill/SKILL.md index aaf9f95..c58a780 100644 --- a/skill/SKILL.md +++ b/skill/SKILL.md @@ -181,8 +181,10 @@ Follow this loop strictly. ### D. EXECUTE (Query) - **Select methodology:** Golden Signals (customer-facing health), RED (request-driven services), USE (infrastructure resources) -- **Select telemetry:** Use whatever's available—metrics, logs, traces, profiles -- **Run query:** `scripts/axiom-query` (logs), `scripts/grafana-query` (metrics), `scripts/pyroscope-diff` (profiles) +- **Metrics:** Axiom MetricsDB (`[MPL]` datasets from `scripts/init`), Grafana/PromQL, alerts/dashboards via Grafana +- **Discover metrics:** `scripts/axiom-metrics-discover` (list metrics, tags, tag values in MetricsDB datasets) +- **Alerts & dashboards:** Grafana only — `scripts/grafana-alerts`, `scripts/grafana-dashboards` +- **Run query:** `scripts/axiom-query` (logs/APL), `scripts/axiom-metrics-query` (metrics/MPL), `scripts/grafana-query` (PromQL), `scripts/pyroscope-diff` (profiles) ### E. VERIFY & REFLECT - **Methodology check:** Service → RED. Resource → USE. @@ -335,7 +337,7 @@ For request-driven services. Measures the *work* the service does. | **Errors** | Error rate (5xx / total) | | **Duration** | Latency percentiles (p50, p95, p99) | -Measure via logs (APL — see `reference/apl.md`) or metrics (PromQL — see `reference/grafana.md`). +Measure via logs (APL — see `reference/apl.md`), OTel metrics (MPL — see `reference/metrics.md`), or PromQL fallback (see `reference/grafana.md`). ### C. USE METHOD (Resources) @@ -347,7 +349,7 @@ For infrastructure resources (CPU, memory, disk, network). Measures the *capacit | **Saturation** | Queue depth, load average, waiting threads | | **Errors** | Hardware/network errors | -Typically measured via metrics. See `reference/grafana.md` for PromQL patterns. +Check Axiom MetricsDB first (OTel resource metrics). Fall back to Grafana/PromQL if not available. See `reference/grafana.md` for PromQL patterns. ### D. DIFFERENTIAL ANALYSIS @@ -384,6 +386,8 @@ See `reference/apl.md` for full operator, function, and pattern reference. - **Avoid `search`**—scans ALL fields. Last resort only. - **Field escaping**—dots need `\\.`: `['kubernetes.node_labels.nodepool\\.axiom\\.co/name']` +**MetricsDB/MPL:** For OTel metrics (`[MPL]` datasets), discover with `scripts/axiom-metrics-discover`, query with `scripts/axiom-metrics-query`. See `reference/metrics.md`. + **Need more?** Open `reference/apl.md` for operators/functions, `reference/query-patterns.md` for ready-to-use investigation queries. --- @@ -400,15 +404,16 @@ Every finding must link to its source — dashboards, queries, error reports, PR 5. **Data responses**—Any answer citing tool-derived numbers (e.g. burn rates, error counts, usage stats, etc). Questions don't require investigation, but if you cite numbers from a query, include the source link. **Rule: If you ran a query and cite its results, generate a permalink.** Run the appropriate link tool for every query whose results appear in your response: -- **Axiom:** `scripts/axiom-link` +- **Axiom:** `scripts/axiom-link` (works for both APL and MPL queries) - **Grafana:** `scripts/grafana-link` - **Pyroscope:** `scripts/pyroscope-link` - **Sentry:** `scripts/sentry-link` **Permalinks:** ```bash -# Axiom +# Axiom (APL or MPL — same script handles both) scripts/axiom-link "['logs'] | where status >= 500 | take 100" "1h" +scripts/axiom-link "dataset:metric.name | align to 5m using avg" "1h" # Grafana (metrics) scripts/grafana-link "rate(http_requests_total[5m])" "1h" # Pyroscope (profiling) @@ -506,20 +511,21 @@ See `reference/postmortem-template.md` for retrospective format. ## 15. TOOL REFERENCE -### Axiom (Logs & Events) +### Axiom (Logs & Events — APL) ```bash scripts/axiom-query <<< "['dataset'] | getschema" scripts/axiom-query <<< "['dataset'] | where _time > ago(1h) | project _time, message, level | take 5" -scripts/axiom-query --ndjson <<< "['dataset'] | where _time > ago(1h) | project _time, message | take 1" ``` -### Grafana (Metrics) +### Axiom (MetricsDB — MPL) ```bash -scripts/grafana-query prometheus 'rate(http_requests_total[5m])' +scripts/axiom-metrics-discover metrics|tags|tag-values|search +scripts/axiom-metrics-query --range 1h <<< "dataset:metric.name | align to 5m using avg" ``` -### Pyroscope (Profiling) +### Grafana (PromQL fallback) / Pyroscope / Slack ```bash +scripts/grafana-query prometheus 'rate(http_requests_total[5m])' scripts/pyroscope-diff -2h -1h -1h now ``` @@ -544,6 +550,7 @@ scripts/slack-upload ./file.png --comment "Description" --thread - `reference/apl.md`—APL operators, functions, and spotlight analysis - `reference/axiom.md`—Axiom API endpoints (70+) +- `reference/metrics.md`—MetricsDB MPL querying, discovery, and patterns - `reference/blocks.md`—Slack Block Kit formatting - `reference/failure-modes.md`—Common failure patterns - `reference/grafana.md`—Grafana queries and PromQL patterns diff --git a/skill/reference/grafana.md b/skill/reference/grafana.md index a6c2914..d657ec3 100644 --- a/skill/reference/grafana.md +++ b/skill/reference/grafana.md @@ -67,14 +67,14 @@ Summary view shows: Samples, Range, **Min/Max with timestamps**, Avg ## Integration with Axiom -Use Grafana alongside Axiom queries for complete incident investigation. Axiom provides logs, Grafana provides infrastructure metrics. +Grafana covers Prometheus-native metrics not shipped to Axiom and provides alerts/dashboards. For OTel metrics (application and infrastructure), Axiom MetricsDB (`[MPL]` datasets) is available. -### Typical Workflow +### Available Data Sources -1. **Axiom**: Find errors/anomalies in application logs -2. **Grafana**: Correlate with infrastructure metrics from Prometheus -3. **Grafana**: Check what alerts fired during the incident window -4. **Pyroscope**: If CPU/memory issue, get flame graphs +- **Axiom MetricsDB**: OTel metrics — application and infrastructure (MPL) +- **Axiom EventDB**: Logs, traces, error events (APL) +- **Grafana**: Prometheus-native metrics, alerts, dashboards +- **Pyroscope**: CPU and memory flame graphs ### Example: Investigating High Latency diff --git a/skill/reference/metrics.md b/skill/reference/metrics.md new file mode 100644 index 0000000..7055d66 --- /dev/null +++ b/skill/reference/metrics.md @@ -0,0 +1,178 @@ +# MetricsDB Reference + +## MetricsDB vs EventDB + +Axiom has two query engines with distinct query languages and endpoints. + +| | EventDB | MetricsDB | +|--|---------|-----------| +| **Data** | Logs, traces, spans | OTel metrics (counters, gauges, histograms) | +| **Datasets** | Standard datasets | `otel-metrics-v1` datasets | +| **Query language** | APL | MPL | +| **Query script** | `scripts/axiom-query` | `scripts/axiom-metrics-query` | +| **API endpoint** | `POST /v1/datasets/_apl` | `POST /v1/query/_metrics` | +| **Time expressions** | `ago()`, `now()`, absolute | RFC3339 timestamps only — no relative expressions | + +EventDB is general-purpose event storage. MetricsDB is purpose-built for time-series metrics — optimized for aggregation, alignment, and high-cardinality tag queries on counter/gauge/histogram data. + +Do not query MetricsDB datasets with APL. Do not query EventDB datasets with MPL. They are separate systems. + +--- + +## MPL Basics + +### Self-Describing Spec + +MPL's query endpoint documents itself. Always fetch the spec before writing queries: + +```bash +scripts/axiom-metrics-query --spec +``` + +This calls `OPTIONS /v1/query/_metrics` and returns the complete MPL language specification — syntax, operators, and examples. + +### Query Format + +``` +DATASET_NAME:METRIC_NAME | operator1 | operator2 | ... +``` + +The dataset and metric are specified as a single identifier separated by `:`, followed by a pipeline of operators. + +### Key Operators + +| Operator | Purpose | Example | +|----------|---------|---------| +| `align` | Align data to time buckets | `align to 5m using avg` | +| `group` | Group by tag values | `group by service.name` | +| `filter` | Filter by tag values | `filter service.name == "api"` | +| `map` | Transform values | `map value * 100` | +| `bucket` | Histogram bucket operations | `bucket percentile(0.99)` | + +### Time Constraint (CRITICAL) + +MPL requires RFC3339 timestamps. Relative expressions like `ago()`, `now()`, or `now-1h` are **not supported**. + +```bash +# Correct: RFC3339 timestamps +scripts/axiom-metrics-query prod --start "2025-06-01T00:00:00Z" --end "2025-06-01T01:00:00Z" <<< "my-dataset:cpu.usage | align to 5m using avg" + +# Wrong: relative time (will fail) +scripts/axiom-metrics-query prod --start "now-1h" <<< "my-dataset:cpu.usage | align to 5m using avg" +``` + +Always use `--range` or explicit `--start`/`--end` with the query script. + +--- + +## Discovery + +Use `scripts/axiom-metrics-discover` to explore metrics, tags, and tag values. Defaults to last 1 hour. + +```bash +# List all metrics +scripts/axiom-metrics-discover metrics + +# List all tags +scripts/axiom-metrics-discover tags + +# List values for a tag +scripts/axiom-metrics-discover tag-values service.name + +# List tags for a specific metric +scripts/axiom-metrics-discover metric-tags http.server.request.duration + +# List tag values for a specific metric+tag +scripts/axiom-metrics-discover metric-tag-values http.server.request.duration service.name + +# Find metrics matching a tag value (fastest path from "I know the service" to "what metrics exist") +scripts/axiom-metrics-discover search "api-gateway" + +# Custom time range +scripts/axiom-metrics-discover --range 24h metrics +scripts/axiom-metrics-discover --start 2025-06-01T00:00:00Z --end 2025-06-02T00:00:00Z tags +``` + +Under the hood this calls `/v1/query/metrics/info/` endpoints via `scripts/axiom-api`. For raw access, see the API paths in the script header. + +--- + +## Query Patterns + +### CPU usage by service + +```mpl +otel-metrics:system.cpu.utilization | align to 5m using avg | group by service.name +``` + +### Request rate + +```mpl +otel-metrics:http.server.request.duration | align to 1m using count | group by service.name +``` + +### Error rate from metrics + +```mpl +otel-metrics:http.server.request.duration | filter http.status_code >= 500 | align to 5m using count | group by service.name +``` + +### Memory utilization + +```mpl +otel-metrics:process.runtime.go.mem.heap_alloc | align to 5m using avg | group by service.name +``` + +### Histogram percentiles (p99 latency) + +```mpl +otel-metrics:http.server.request.duration | align to 5m using avg | bucket percentile(0.99) | group by service.name +``` + +### Filter by service.name + +```mpl +otel-metrics:http.server.request.duration | filter service.name == "api-gateway" | align to 1m using avg +``` + +### Combine filter and group + +```mpl +otel-metrics:http.server.request.duration | filter service.namespace == "production" | align to 5m using count | group by service.name, http.method +``` + +Note: Metric and tag names depend on the OTel instrumentation. Use the discovery endpoints to find the actual names in your datasets. + +--- + +## Error Handling + +| Code | Meaning | Action | +|------|---------|--------| +| 400 | Bad query syntax or invalid dataset | Check MPL syntax via `--spec` flag | +| 401 | Missing or invalid authentication | Verify `AXIOM_TOKEN` is set and valid | +| 403 | No permission to query this dataset | Check token scopes | +| 404 | Dataset not found | Verify dataset name via `scripts/init` | +| 429 | Rate limited | Back off and retry | +| 500 | Internal server error | Report `x-axiom-trace-id` to backend team | + +On **500 errors**: the query script captures the `x-axiom-trace-id` response header automatically. Report this trace ID — it is essential for backend debugging. + +On **400 errors**: the most common cause is invalid MPL syntax. Fetch the spec (`--spec`) and compare your query against it. Common mistakes: +- Using relative time expressions (`ago()`, `now()`) +- Missing `align` operator (most queries need one) +- Wrong metric or tag names (use discovery endpoints to verify) + +--- + +## Workflow + +1. **Identify metrics datasets.** Run `scripts/init` — Axiom deployments list their datasets, including `otel-metrics-v1` types. + +2. **Learn MPL syntax.** Run `scripts/axiom-metrics-query --spec` to get the full language specification. Read it before writing queries. + +3. **Discover available metrics.** Use info endpoints via `scripts/axiom-api` to list metrics and tags in the target dataset. If you know a service name, use the search endpoint to find matching metrics. + +4. **Compose and execute MPL query.** Build the query incrementally — start with the metric, add `align`, then `filter`/`group` as needed. + +5. **Iterate.** Refine filters, aggregations, and time ranges based on results. Narrow the time window for faster responses. diff --git a/skill/scripts/axiom-metrics-discover b/skill/scripts/axiom-metrics-discover new file mode 100755 index 0000000..ab4ca14 --- /dev/null +++ b/skill/scripts/axiom-metrics-discover @@ -0,0 +1,163 @@ +#!/usr/bin/env bash +# Axiom MetricsDB info endpoint helper - discover metrics, tags, and tag values +# +# Usage: axiom-metrics-discover [options] [args...] +# +# Commands: +# metrics List all metrics in dataset +# tags List all tags in dataset +# tag-values List values for a tag +# metric-tags List tags for a metric +# metric-tag-values List tag values for metric+tag +# search Find metrics matching a tag value (POST) +# +# Options: +# --range Time range from now (e.g. 1h, 24h, 7d). Default: 1h +# --start Start time (RFC3339) +# --end End time (RFC3339) +# +# Examples: +# axiom-metrics-discover prod otel-metrics metrics +# axiom-metrics-discover prod otel-metrics --range 24h tags +# axiom-metrics-discover prod otel-metrics tag-values service.name +# axiom-metrics-discover prod otel-metrics metric-tags http.server.request.duration +# axiom-metrics-discover prod otel-metrics metric-tag-values http.server.request.duration service.name +# axiom-metrics-discover prod otel-metrics search "api-gateway" + +set -euo pipefail + +if [[ $# -lt 3 ]]; then + echo "Usage: axiom-metrics-discover [options] [args...]" >&2 + exit 1 +fi + +DEPLOYMENT="$1" +DATASET="$2" +shift 2 + +START_TIME="${START_TIME:-}" +END_TIME="${END_TIME:-}" +RANGE="${RANGE:-}" + +# Parse options before command +while [[ $# -gt 0 ]]; do + case "$1" in + --start) + START_TIME="$2" + shift 2 + ;; + --end) + END_TIME="$2" + shift 2 + ;; + --range) + RANGE="$2" + shift 2 + ;; + -*) + echo "Error: Unknown option '$1'." >&2 + exit 1 + ;; + *) + break + ;; + esac +done + +if [[ $# -lt 1 ]]; then + echo "Error: No command specified. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2 + exit 1 +fi + +COMMAND="$1" +shift + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# shellcheck disable=SC1091 +source "$SCRIPT_DIR/lib-time" + +# Validate time arguments +if [[ -n "$RANGE" && ( -n "$START_TIME" || -n "$END_TIME" ) ]]; then + echo "Error: --range cannot be combined with --start/--end." >&2 + exit 1 +fi + +if [[ -n "$RANGE" ]]; then + START_TIME=$(range_to_rfc3339 "$RANGE") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute time range from '$RANGE'." >&2 + exit 1 + fi +elif [[ -n "$START_TIME" && -n "$END_TIME" ]]; then + : # explicit start/end provided +elif [[ -n "$START_TIME" || -n "$END_TIME" ]]; then + echo "Error: Both --start and --end are required when specifying explicit times." >&2 + exit 1 +else + # Default to 1h + START_TIME=$(range_to_rfc3339 "1h") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute default time range." >&2 + exit 1 + fi +fi + +# URL-encode a path segment +uriencode() { + jq -rn --arg x "$1" '$x|@uri' +} + +DATASET_ENC=$(uriencode "$DATASET") +START_ENC=$(uriencode "$START_TIME") +END_ENC=$(uriencode "$END_TIME") +BASE="/v1/query/metrics/info/datasets/${DATASET_ENC}" +QS="start=${START_ENC}&end=${END_ENC}" + +case "$COMMAND" in + metrics) + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics?${QS}" | jq . + ;; + tags) + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags?${QS}" | jq . + ;; + tag-values) + if [[ $# -lt 1 ]]; then + echo "Error: tag-values requires a argument." >&2 + exit 1 + fi + TAG_ENC=$(uriencode "$1") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/tags/${TAG_ENC}/values?${QS}" | jq . + ;; + metric-tags) + if [[ $# -lt 1 ]]; then + echo "Error: metric-tags requires a argument." >&2 + exit 1 + fi + METRIC_ENC=$(uriencode "$1") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags?${QS}" | jq . + ;; + metric-tag-values) + if [[ $# -lt 2 ]]; then + echo "Error: metric-tag-values requires and arguments." >&2 + exit 1 + fi + METRIC_ENC=$(uriencode "$1") + TAG_ENC=$(uriencode "$2") + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "${BASE}/metrics/${METRIC_ENC}/tags/${TAG_ENC}/values?${QS}" | jq . + ;; + search) + if [[ $# -lt 1 ]]; then + echo "Error: search requires a argument." >&2 + exit 1 + fi + BODY=$(jq -nc --arg v "$1" '{"value": $v}') + "$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" POST "${BASE}/metrics?${QS}" "$BODY" | jq . + ;; + *) + echo "Error: Unknown command '$COMMAND'. Use: metrics, tags, tag-values, metric-tags, metric-tag-values, search." >&2 + exit 1 + ;; +esac diff --git a/skill/scripts/axiom-metrics-query b/skill/scripts/axiom-metrics-query new file mode 100755 index 0000000..1318575 --- /dev/null +++ b/skill/scripts/axiom-metrics-query @@ -0,0 +1,159 @@ +#!/usr/bin/env bash +# Axiom MetricsDB MPL query helper - reads query from stdin +# +# Usage: axiom-metrics-query [options] <<< "mpl query" +# +# Options: +# --start Start time (RFC3339, e.g. 2025-01-01T00:00:00Z) +# --end End time (RFC3339, e.g. 2025-01-02T00:00:00Z) +# --range Convenience range from now (e.g. 1h, 24h, 7d) +# --trace Print x-axiom-trace-id on success +# --spec Fetch MPL language specification (no query needed) +# +# Time: Either (--start + --end) or --range is required (not both). +# MPL does NOT support relative time expressions — RFC3339 only. +# +# Examples: +# axiom-metrics-query prod --range 1h <<< "dataset:metric.name | align to 5m using avg" +# axiom-metrics-query prod --start 2025-01-01T00:00:00Z --end 2025-01-02T00:00:00Z <<< "dataset:cpu.usage" +# axiom-metrics-query prod --spec + +set -euo pipefail + +if [[ $# -lt 1 ]]; then + echo "Usage: axiom-metrics-query [options] <<< 'mpl query'" >&2 + exit 1 +fi + +DEPLOYMENT="$1" +shift + +PRINT_TRACE=false +FETCH_SPEC=false +START_TIME="${START_TIME:-}" +END_TIME="${END_TIME:-}" +RANGE="${RANGE:-}" + +while [[ $# -gt 0 ]]; do + case "$1" in + --start) + START_TIME="$2" + shift 2 + ;; + --end) + END_TIME="$2" + shift 2 + ;; + --range) + RANGE="$2" + shift 2 + ;; + --trace) + PRINT_TRACE=true + shift + ;; + --spec) + FETCH_SPEC=true + shift + ;; + *) + echo "Error: Unknown argument '$1'. Queries must be passed via stdin." >&2 + exit 1 + ;; + esac +done + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Load config from unified config file +# shellcheck disable=SC1090 +eval "$("$SCRIPT_DIR/config" axiom "$DEPLOYMENT")" + +RESP_HEADERS=$(mktemp) +RESP_BODY=$(mktemp) +cleanup() { + rm -f "$RESP_HEADERS" "$RESP_BODY" +} +trap cleanup EXIT + +# --spec: fetch MPL language specification via OPTIONS and exit +if [[ "$FETCH_SPEC" == true ]]; then + HTTP_CODE=$(curl -sS -o "$RESP_BODY" -D "$RESP_HEADERS" -w "%{http_code}" \ + -X OPTIONS "$AXIOM_URL/v1/query/_metrics" \ + -H "Authorization: Bearer $AXIOM_TOKEN" \ + -H "X-Axiom-Org-Id: $AXIOM_ORG_ID") + + if [[ "$HTTP_CODE" -lt 200 || "$HTTP_CODE" -ge 300 ]]; then + msg=$(jq -r '.message // empty' "$RESP_BODY" 2>/dev/null) + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + echo "error: ${msg:-http $HTTP_CODE}" >&2 + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi + exit 1 + fi + + cat "$RESP_BODY" + exit 0 +fi + +# Require query from stdin +if [[ -t 0 ]]; then + echo "Error: No query provided. Pipe a query to stdin." >&2 + echo "" >&2 + echo "Examples:" >&2 + echo " axiom-metrics-query $DEPLOYMENT --range 1h <<< \"dataset:metric.name | align to 5m using avg\"" >&2 + exit 1 +fi + +# shellcheck disable=SC1091 +source "$SCRIPT_DIR/lib-time" + +# Validate time arguments +if [[ -n "$RANGE" && ( -n "$START_TIME" || -n "$END_TIME" ) ]]; then + echo "Error: --range cannot be combined with --start/--end." >&2 + exit 1 +fi + +if [[ -n "$RANGE" ]]; then + START_TIME=$(range_to_rfc3339 "$RANGE") || exit 1 + END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) || exit 1 + if [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Failed to compute time range from '$RANGE'." >&2 + exit 1 + fi +elif [[ -z "$START_TIME" || -z "$END_TIME" ]]; then + echo "Error: Either (--start + --end) or --range is required." >&2 + exit 1 +fi + +APL=$(cat) +APL_JSON=$(printf '%s' "$APL" | jq -Rs .) +START_JSON=$(printf '%s' "$START_TIME" | jq -Rs .) +END_JSON=$(printf '%s' "$END_TIME" | jq -Rs .) + +HTTP_CODE=$(curl -sS -o "$RESP_BODY" -D "$RESP_HEADERS" -w "%{http_code}" \ + -X POST "$AXIOM_URL/v1/query/_metrics?format=metrics-v1" \ + -H "Authorization: Bearer $AXIOM_TOKEN" \ + -H "X-Axiom-Org-Id: $AXIOM_ORG_ID" \ + -H "Content-Type: application/json" \ + -d "{\"apl\": $APL_JSON, \"startTime\": $START_JSON, \"endTime\": $END_JSON}") + +if [[ "$HTTP_CODE" -lt 200 || "$HTTP_CODE" -ge 300 ]]; then + msg=$(jq -r '.message // empty' "$RESP_BODY" 2>/dev/null) + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + echo "error: ${msg:-http $HTTP_CODE}" >&2 + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi + exit 1 +fi + +if [[ "$PRINT_TRACE" == true ]]; then + trace=$(grep -i '^x-axiom-trace-id:' "$RESP_HEADERS" | tail -1 | awk '{print $2}' | tr -d '\r') + if [[ -n "$trace" ]]; then + echo "trace_id: $trace" >&2 + fi +fi + +cat "$RESP_BODY" diff --git a/skill/scripts/discover-axiom b/skill/scripts/discover-axiom index b92f060..f324b22 100755 --- a/skill/scripts/discover-axiom +++ b/skill/scripts/discover-axiom @@ -30,6 +30,48 @@ echo -e "${BLUE}=== Axiom Deployments ===${NC}" TMP_DIR=$(mktemp -d) trap 'rm -rf "$TMP_DIR"' EXIT +# Cache config +CACHE_DIR="${GILFOYLE_CONFIG_DIR:-$HOME/.config/gilfoyle}/cache/axiom" +CACHE_TTL=600 # 10 minutes +mkdir -p "$CACHE_DIR" + +# Get file mtime as epoch seconds (Linux first, then macOS) +# GNU stat -f means --file-system, not format — must try GNU form first +file_mtime() { + local f="$1" + stat -c %Y "$f" 2>/dev/null || stat -f %m "$f" 2>/dev/null +} + +# Fetch /v1/datasets with per-deployment caching +get_catalog() { + local dep="$1" + local cache_file="$CACHE_DIR/$dep/datasets.json" + + if [[ "${GILFOYLE_NO_CACHE:-}" != "1" && -f "$cache_file" ]]; then + local now mtime age + now=$(date +%s) + mtime=$(file_mtime "$cache_file") + age=$(( now - mtime )) + if [[ "$age" -lt "$CACHE_TTL" ]]; then + cat "$cache_file" + return + fi + fi + + local data + data=$("$SCRIPT_DIR/axiom-api" "$dep" GET "/v1/datasets" 2>/dev/null || echo "") + + if [[ -n "$data" ]]; then + mkdir -p "$CACHE_DIR/$dep" + local tmp_file="$cache_file.tmp.$$" + printf '%s' "$data" > "$tmp_file" + chmod 600 "$tmp_file" + mv "$tmp_file" "$cache_file" + fi + + printf '%s' "$data" +} + # Helper for millisecond timestamp using Bash built-in current_time_ms() { # EPOCHREALTIME is available in Bash 5.0+ @@ -60,11 +102,47 @@ discover_dep() { if [[ -n "$POPULAR_DATASETS" ]]; then count=$(echo "$POPULAR_DATASETS" | grep -c .) - echo -e " ${GREEN}Top datasets found ($count)${NC} (${DURATION_QUERY}ms)" - echo "$POPULAR_DATASETS" | sed 's/^/ - /' + + # Fetch dataset catalog to identify MetricsDB datasets + catalog=$(get_catalog "$dep") + metrics_set=$(echo "$catalog" | jq -r '.[] | select(.kind == "otel:metrics:v1") | .name' 2>/dev/null || echo "") + + END_CATALOG=$(current_time_ms) + DURATION_CATALOG=$(( END_CATALOG - END_QUERY )) + + echo -e " ${GREEN}Top datasets found ($count)${NC} (query: ${DURATION_QUERY}ms, catalog: ${DURATION_CATALOG}ms)" + + # Tag popular datasets: [MPL] for MetricsDB, plain for EventDB + while IFS= read -r ds; do + if echo "$metrics_set" | grep -qxF "$ds"; then + echo " - [MPL] $ds" + else + echo " - $ds" + fi + done <<< "$POPULAR_DATASETS" + + # Surface MetricsDB datasets not in the popular list + if [[ -n "$metrics_set" ]]; then + unlisted="" + while IFS= read -r mds; do + if ! echo "$POPULAR_DATASETS" | grep -qxF "$mds"; then + unlisted="${unlisted:+$unlisted +}$mds" + fi + done <<< "$metrics_set" + + metrics_total=$(echo "$metrics_set" | grep -c .) + if [[ -n "$unlisted" ]]; then + unlisted_count=$(echo "$unlisted" | grep -c .) + echo -e " ${GREEN}MetricsDB datasets ($metrics_total total, $unlisted_count not in top):${NC}" + echo "$unlisted" | sort | head -n 10 | sed 's/^/ - [MPL] /' || true + else + echo -e " ${GREEN}MetricsDB datasets ($metrics_total total, all in top list)${NC}" + fi + fi else # Strategy 2: Fallback - response=$("$SCRIPT_DIR/axiom-api" "$dep" GET "/v1/datasets" 2>/dev/null || echo "") + response=$(get_catalog "$dep") END_FALLBACK=$(current_time_ms) DURATION_FALLBACK=$(( END_FALLBACK - END_QUERY )) @@ -72,11 +150,27 @@ discover_dep() { if [[ "$count" -gt 0 ]]; then echo -e " ${GREEN}$count datasets found${NC} (query: ${DURATION_QUERY}ms, fallback: ${DURATION_FALLBACK}ms)" - echo "$response" | jq -r '.[] | " - " + .name' | sort | head -n 10 + + # Identify MetricsDB datasets (otel-metrics-v1) + metrics_datasets=$(echo "$response" | jq -r '.[] | select(.kind == "otel:metrics:v1") | .name' 2>/dev/null || echo "") + + # Tag MetricsDB datasets inline, consistent with Strategy 1 + echo "$response" | jq -r '.[] | .name' | sort | while IFS= read -r ds; do + if [[ -n "$metrics_datasets" ]] && echo "$metrics_datasets" | grep -qxF "$ds"; then + echo " - [MPL] $ds" + else + echo " - $ds" + fi + done | head -n 10 || true if [[ "$count" -gt 10 ]]; then echo " - ... (and $((count - 10)) more)" echo -e " ${BOLD}To search:${NC} scripts/axiom-api $dep GET \"/v1/datasets\" | jq -r '.[].name' | grep \"pattern\"" fi + + if [[ -n "$metrics_datasets" ]]; then + metrics_count=$(echo "$metrics_datasets" | grep -c .) + echo -e " ${GREEN}MetricsDB datasets ($metrics_count total)${NC}" + fi else echo -e " ${RED}No datasets found or auth failed${NC} (total: $((DURATION_QUERY + DURATION_FALLBACK))ms)" fi diff --git a/skill/scripts/lib-time b/skill/scripts/lib-time new file mode 100755 index 0000000..3d2001a --- /dev/null +++ b/skill/scripts/lib-time @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +# Shared time utilities for Gilfoyle scripts +# Source this file: source "$SCRIPT_DIR/lib-time" + +# range_to_rfc3339 converts a human range (e.g. 1h, 24h, 7d) to an RFC3339 timestamp that many seconds ago +range_to_rfc3339() { + local range="$1" + local value="${range%[smhd]}" + local suffix="${range: -1}" + + if ! [[ "$value" =~ ^[0-9]+$ ]]; then + echo "Error: Invalid range value '$range'. Expected number + suffix (s/m/h/d)." >&2 + return 1 + fi + + local label + case "$suffix" in + s) label="second" ;; + h) label="hour" ;; + d) label="day" ;; + m) label="minute" ;; + *) + echo "Error: Invalid range suffix '$suffix'. Use s (seconds), m (minutes), h (hours), or d (days)." >&2 + return 1 + ;; + esac + + # Pluralize for values other than 1 + if [[ "$value" -ne 1 ]]; then + label="${label}s" + fi + + # Try GNU date first (linux, or gdate on macOS), then fall back to macOS date + if date -u -d "1 hour ago" +%Y-%m-%dT%H:%M:%SZ &>/dev/null; then + # GNU date + date -u -d "$value $label ago" +%Y-%m-%dT%H:%M:%SZ + else + # macOS date: -v flag with uppercase suffix + local date_flag + case "$suffix" in + s) date_flag="-v-${value}S" ;; + h) date_flag="-v-${value}H" ;; + d) date_flag="-v-${value}d" ;; + m) date_flag="-v-${value}M" ;; + esac + date -u "$date_flag" +%Y-%m-%dT%H:%M:%SZ + fi +}