Skip to content

feat: add attestation aggregate coverage metrics#386

Merged
MegaRedHand merged 10 commits into
mainfrom
add-attestation-aggregate-coverage-metrics
May 27, 2026
Merged

feat: add attestation aggregate coverage metrics#386
MegaRedHand merged 10 commits into
mainfrom
add-attestation-aggregate-coverage-metrics

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

@pablodeymo pablodeymo commented May 20, 2026

Description / Motivation

Ports two upstream changes that together add attestation aggregate coverage observability:

After this PR, all 14 coverage series receive real per-slot updates from chain activity.

What Changed

Registration (crates/blockchain/src/metrics.rs)

  • Two pub const &[&str] label-set constants — single source of truth for sections and directions.
  • Three IntGaugeVec statics:
    • lean_attestation_aggregate_coverage_validators{section, subnet}subnet="combined" is the section total; subnet="subnet_N" is per-subnet coverage (lazy).
    • lean_attestation_aggregate_coverage_subnets{section} — covered-subnet count per section.
    • lean_attestation_aggregate_coverage_diff_validators{direction} — symmetric difference between block-included and locally-aggregated pre-merge aggregates.
  • init() forces the statics and seeds 14 default zero-valued series.

Sections: timely, late, block, combined, agg_start_new, proposal_combined.
Directions: block_only, timely_only.

Emission (4 sites)

Every per-slot section is keyed by the attestation data.slot (the voting round being reported), so timely/late/block/combined describe the same cohort (matching zeam #876).

  • accept_new_attestations (blockchain/store.rs) — before each promote, snapshots new_payloads participant bits, tagged per entry with their data.slot, stashed on the Store. Read at the next slot boundary to populate the timely section ("prev_new" in zeam).
  • on_tick interval 1 (blockchain/lib.rs) — emits the post-block report for slot - 1. Fired at interval 1, not 0, so the block carrying that round's votes (proposed at interval 0 of this slot) has typically been received and processed, letting the block section observe the same round. Computes timely (pre-merge snapshot filtered to the round), late (current new_payloads for the round), block (attestations in the canonical head block filtered to the round — canonical by construction, so a fork block can't poison it), and combined (their union); then emits diff_validators as the symmetric difference between block and timely. Emits nothing when no section has coverage for the round, so a missed slot doesn't surface as a misleading block_only=0, timely_only=N.
  • start_aggregation_session (blockchain/lib.rs) — emits agg_start_new from current new_payloads right before fork-choice aggregation at interval 2.
  • propose_block (blockchain/lib.rs) — emits proposal_combined (the full set of validators included across the block's aggregated attestations) after the block is built. (A payload-vs-gossip split was considered but dropped: ethlambda builds blocks exclusively from known_aggregated_payloads, so every included validator is payload-sourced by construction — proposal_gossip would always be 0 and proposal_payloads would always equal proposal_combined.)

Supporting changes

  • The three emitters live as private free fns at the bottom of blockchain/src/lib.rs. They use Vec<bool> locals (seen for validators, has_subnet for subnets) plus three small helpers (cov_add, cov_record, or_into). Subnet derivation is vid % committee_count, matching the gossip subnet assignment in crates/net/p2p/src/lib.rs.
  • Store gets a single CoverageSnapshot field (pre_merge_new_coverage) holding (data.slot, AggregationBits) entries captured before each promote — only participant bits, no proof duplication. The block section needs no stored state: it is read from the canonical head block at report time.
  • BlockChain::spawn now takes attestation_committee_count (resolved in main.rs from CLI > validator-config.yaml > 1; previously only known to P2P for subnet subscriptions).

Operator interpretation of diff_validators

The aggregation pipeline produces two pools for the same slot: timely (locally aggregated pre-merge) and block (what the proposer included). The diff counts validators in the symmetric difference:

  • block_only persistently high → this node was slow to receive/aggregate via gossip; proposer had a better view.
  • timely_only persistently high → proposer omitted attestations the network had time to gossip.
  • Both near zero → local aggregation tracks proposers; the network is converging.

Correctness / Behavior Guarantees

  • No fork-choice or state-transition change. Coverage snapshots are pure observability and do not feed back into block processing.
  • The diff_validators help text intentionally diverges from upstream's terse phrasing to spell out the symmetric-difference semantics. Metric name, labels, and values are unchanged; dashboards built against any client's schema work as-is.
  • Cardinality: 14 series at registration; per-subnet series appear lazily when written. Even with full per-subnet instrumentation at 64 subnets, the validators gauge tops out at 6 × 65 = 390 series.

Tests Added / Run

No new unit tests. The earlier draft of this PR included unit tests for an internal Coverage bitset wrapper; once the wrapper was inlined as Vec<bool> locals, those tests were exercising trivial iterator ops rather than the emission contract, so they were dropped.

Commands run:

  • cargo fmt --all -- --check — clean
  • make lint (clippy -D warnings) — clean
  • cargo test -p ethlambda-blockchain --release --lib --bins — 21 passed
  • cargo test -p ethlambda-storage --release — 38 passed
  • cargo test -p ethlambda-blockchain --release --test forkchoice_spectests — 62 passed, 8 failed (all AttestationTooFarInFuture, pre-existing fixture flakes on main, unrelated to this PR).

Related Issues / PRs

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • Ran cargo test --workspace --release — passing modulo 8 pre-existing forkchoice fixture flakes documented above (unrelated to this PR)

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR adds attestation aggregate coverage metrics following the existing patterns in the codebase. The implementation is correct and idiomatic.

Assessment:

  1. Code correctness: The use of std::sync::LazyLock for static initialization and the force() pattern in init() is consistent with the existing metrics in the file. Label value construction (&[section, "combined"]) is correct.

  2. Cardinality management: Item 1 (lines 469-488) correctly seeds only the combined subnet series at startup, deferring per-subnet series to lazy creation as noted in the comment. This prevents unbounded cardinality explosions in Prometheus.

  3. Error handling: The .unwrap() calls on register_int_gauge_vec! (lines 131, 141, 152) match the existing codebase's fail-fast approach for metrics registration failures. This is acceptable for initialization-time panics.

  4. Documentation: Metric help strings (lines 129-130, 143-144, 165-168) clearly describe the label semantics and update frequency, including the important distinction between subnet=combined and subnet=subnet_N.

Minor observation:

  • The ATTESTATION_AGGREGATE_COVERAGE_SECTIONS slice (lines 10-20) contains 8 sections. Ensure the instrumentation code that actually writes these metrics uses the same string literals to avoid label mismatch errors at runtime.

No security vulnerabilities or consensus risks identified—this is pure instrumentation with no state transition logic.

Verdict: LGTM. Clean, well-documented addition that follows established patterns.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No findings.

This PR is limited to new Prometheus label constants and gauge registration in metrics.rs and startup seeding in metrics.rs. I don’t see any impact on fork choice, attestation validation, justification/finalization, state transition, XMSS handling, or SSZ behavior. The added code is straightforward, uses the existing LazyLock/Prometheus patterns consistently, and doesn’t introduce new memory-safety or error-handling concerns beyond the existing .unwrap() registration style already used in this module.

One residual note: this patch only registers/seeds the new metrics; there are no writer APIs or call sites in this PR, so the metrics will stay at their seeded values until follow-up instrumentation lands. I also couldn’t complete cargo check in this environment because Cargo/Rustup cannot write to /home/runner/.cargo and /home/runner/.rustup, so the review is based on static inspection rather than a successful local build.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is the full code review:


PR #386feat: add attestation aggregate coverage metrics

Overview

Ports leanSpec PR #735: registers three IntGaugeVec Prometheus metrics for attestation aggregate coverage, pre-seeding 18 zero-valued series in init() so dashboards render from node startup. Registration-only — no instrumentation/setter functions yet (coming in a follow-up). Single file changed: crates/blockchain/src/metrics.rs.


Code Quality

Correct and idiomatic. The three statics follow the established LazyLock + register_*_vec! + .unwrap() pattern exactly. IntGaugeVec is the right choice — integer counts, consistent with all other labeled gauges (LEAN_NODE_INFO, LEAN_TABLE_BYTES, LEAN_NODE_SYNC_STATUS). Labels are compact and match the upstream schema, so cross-client Grafana dashboards remain compatible.


Specific Observations

1. Redundant force() calls before seeding loops (lines 472–474)

std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_VALIDATORS);
std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_SUBNETS);
std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_DIFF_VALIDATORS);
for &section in ATTESTATION_AGGREGATE_COVERAGE_SECTIONS {
    LEAN_ATTESTATION_AGGREGATE_COVERAGE_VALIDATORS
        .with_label_values(&[section, "combined"])
        .set(0);
    ...
}

Calling .with_label_values(...).set(0) forces the LazyLock internally, so the preceding explicit force() calls are redundant. They're harmless and consistent with the existing style (every static gets an explicit force() in init()), so this is fine to keep — just worth knowing it doesn't add behavior.

2. pub const label sets vs. SyncStatus::ALL inconsistency

SyncStatus::ALL is a private const on the enum and is only used internally in set_node_sync_status(). The new ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS are pub const at module scope. The pub is justified because setter functions in the follow-up PR may need these from call sites outside metrics.rs (e.g., store.rs) — so this is forward-looking but reasonable. Consider keeping them pub(crate) if they won't be exported from the crate's public API, to limit exposure.

3. LEAN_NODE_SYNC_STATUS is not seeded in init() — pre-existing inconsistency

The existing LEAN_NODE_SYNC_STATUS is force()d in init() but its label values are not pre-seeded with zero; they appear only after the first set_node_sync_status() call. The new coverage metrics are seeded up front, which is the correct behavior per the PR's stated goal ("appear at /metrics from node startup"). No action needed in this PR, but the existing LEAN_NODE_SYNC_STATUS inconsistency could be addressed separately.

4. The three force() calls for all three statics happen before either seeding loop

LEAN_ATTESTATION_AGGREGATE_COVERAGE_DIFF_VALIDATORS is forced on line 474, but its seeding loop doesn't start until line 483. No bug, but co-locating each force() with its respective seeding block would make the grouping clearer. Again, a trivial style point given the existing init() structure.


Documentation

Excellent. The doc comments are precise and operationally useful:

  • The distinction between subnet=combined (section total) and subnet=subnet_N (per-subnet) is clearly stated.
  • The diff_validators help text spelling out symmetric-difference semantics is better than the upstream terse version and doesn't break dashboard compatibility.
  • The init() comment correctly explains the lazy-seeding strategy.

Correctness / Safety

  • No behavior change: confirmed. State transition, fork choice, attestation processing — all untouched.
  • Cardinality: 18 seeded series now; 520 max at 64 subnets — well within Prometheus limits.
  • No panics introduced: the register_*!(...).unwrap() calls are consistent with all existing statics and fail only at program startup (acceptable for misconfigured registries).
  • No security concerns.

Verdict

Approved. The implementation is correct, follows project conventions faithfully, and is appropriately scoped. The two actionable points worth considering before merge:

  • Consider pub(crate) instead of pub for the label-set constants if they won't be part of the crate's public API.
  • Track the LEAN_NODE_SYNC_STATUS zero-seeding gap as a follow-up (not a blocker for this PR).

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

Registers three new IntGaugeVec Prometheus metrics for attestation aggregate coverage, porting leanSpec#735. The change is registration-only — no code path yet produces or consumes the new gauges.

  • Three new statics follow the exact same LazyLock<IntGaugeVec> pattern used by every other labeled gauge in the file.
  • init() forces all three and seeds 18 zero-valued series, keeping per-subnet series lazy for the follow-up instrumentation PR.
  • Two pub const &[&str] label-set constants are exported so the follow-up setter functions can reference them without hard-coding strings.

Confidence Score: 5/5

This PR is safe to merge — it is purely additive metric registration with no changes to any logic, state, or existing code paths.

The change adds three new IntGaugeVec statics and seeds 18 default zero series in init(). No existing code is modified, no consumers of the new metrics exist yet, and the implementation closely mirrors the patterns already established in the file.

No files require special attention. The single changed file is straightforward metric registration.

Important Files Changed

Filename Overview
crates/blockchain/src/metrics.rs Adds three IntGaugeVec statics and two pub label-set constants; seeds 18 default zero series in init(). Follows all existing patterns; no logic, no consumers yet.

Reviews (1): Last reviewed commit: "feat: add attestation aggregate coverage..." | Re-trigger Greptile

Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to emit the metrics

@pablodeymo
Copy link
Copy Markdown
Collaborator Author

Instrumentation added in 855f56d. Five emission sites now write to the 18 series registered in e9a04f7:

  • timely / late / block / combined / diff_validators — emitted at the slot boundary (on_tick interval 0) for slot - 1, sourcing from a pre-merge snapshot of new_payloads, current late arrivals, and the last-imported block's aggregation bits.
  • agg_start_new — emitted right before fork-choice aggregation runs (interval 2).
  • proposal_payloads / proposal_gossip / proposal_combined — emitted from propose_block after produce_block_with_signatures, classifying validators in the proposed block as known-payload-covered vs. gossip-only.

Coverage lives in a new crates/blockchain/src/coverage.rs (~280 lines, 6 unit tests covering add_bits, merge_from, diff_counts, plus empty/zero/out-of-range edge cases). Subnet derivation uses vid % committee_count to match the gossip subnet assignment in crates/net/p2p/src/lib.rs:241.

Store gets a small CoverageSnapshot type and two Arc<Mutex<Option<…>>> fields — no proofs duplicated, only AggregationBits. The pre-merge capture happens inside accept_new_attestations just before promote_new_aggregated_payloads, so consumer-side timing concerns stay in the existing tick path.

BlockChain::spawn gains an attestation_committee_count: u64 parameter — only bin/ethlambda/src/main.rs calls it, and the value was already resolved there (CLI > validator-config.yaml > 1).

Ready for another look.

@pablodeymo pablodeymo requested a review from MegaRedHand May 22, 2026 20:57
@pablodeymo
Copy link
Copy Markdown
Collaborator Author

@MegaRedHand please re-review the PR

  Port leanSpec PR #735: register three IntGaugeVec metrics describing
  attestation aggregate coverage, with default zero-valued series so
  dashboards render from a fresh node startup.

    - lean_attestation_aggregate_coverage_validators (section, subnet)
    - lean_attestation_aggregate_coverage_subnets    (section)
    - lean_attestation_aggregate_coverage_diff_validators (direction)

  ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and
  ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS are exported as the
  single source of truth for label sets. init() forces the new statics
  and seeds combined-subnet, section, and direction series to 0 (18
  default series total). Per-subnet (subnet="subnet_N") series appear
  lazily when instrumentation writes them.

  Registration only. The producer side (per-slot coverage computation,
  de-duplication across payloads, and the chain-status log line) ports
  zeam #876 and lands in a follow-up PR.

  The diff_validators help text diverges from upstreams terse phrasing
  to spell out the symmetric-difference semantics (block_only: in block
  but not in local timely pool; timely_only: the reverse). Metric name,
  labels, and values are unchanged.
@pablodeymo pablodeymo force-pushed the add-attestation-aggregate-coverage-metrics branch from b879490 to 6622da5 Compare May 26, 2026 16:26
  Port the producer side of zeam #876 on top of the metrics registered in
  the previous commit. After this commit, all 18 coverage series receive
  real per-slot updates from chain activity.

  Five emission sites:

    - accept_new_attestations (store.rs): captures `new_payloads`
      participant bits BEFORE promote and stashes them as a
      CoverageSnapshot on the Store. Read at the next slot boundary to
      populate the `timely` section ("prev_new" in zeam).
    - on_block_core (store.rs): mirrors the imported block s per-AttData
      aggregation bits into Store::last_block_coverage. Observability-only;
      fork choice is unchanged.
    - on_tick interval 0 (lib.rs): emits the post-block-merge report for
      `slot - 1`. Computes `timely`/`late`/`block`/`combined` from the
      stashed snapshots and the current `new_payloads`, then emits the
      diff_validators direction counts as the symmetric difference between
      `block` and `timely`.
    - start_aggregation_session (lib.rs): emits `agg_start_new` from the
      current `new_payloads` right before fork-choice aggregation runs at
      interval 2.
    - propose_block (lib.rs): emits `proposal_payloads`,
      `proposal_gossip`, and `proposal_combined` after the block is built.
      Each validator set in the block is classified by whether the
      AttestationData has a matching known-payload proof.

  New module crates/blockchain/src/coverage.rs holds the Coverage type
  (seen + has_subnet bitsets, derived subnet via vid % committee_count to
  match the gossip subnet assignment) plus the 3 emission helpers and 6
  unit tests covering add_bits, merge_from, diff_counts, empty/zero/out-
  of-range edge cases.

  Storage gets a CoverageSnapshot type and two Arc<Mutex<Option<…>>>
  fields on Store. No proofs are duplicated — only AggregationBits are
  captured, keeping the per-slot allocation in the tens of bytes per
  entry. The pre-merge capture happens inside accept_new_attestations
  just before promote_new_aggregated_payloads, so consumer-side timing
  concerns stay in the existing tick path.

  BlockChain::spawn now takes attestation_committee_count as a
  parameter; bin/ethlambda/src/main.rs already resolves the value
  (CLI > validator-config.yaml > 1) and passes it through. The number
  of attestation committees was previously only known to P2P (for
  subnet subscriptions); the coverage emitters need it to derive
  subnet ids.
  Delete crates/blockchain/src/coverage.rs and move the three coverage
  emitters (post_block, agg_start_new, proposal) into lib.rs as private
  free functions. The Coverage struct is replaced with Vec<bool> locals
  (seen for validators, has_subnet for subnets), plus three small
  helpers (cov_add, cov_record, or_into).

  Each emit_* function was called from exactly one site, so the previous
  abstraction added no DRY benefit. The 6 unit tests in coverage.rs were
  exercising the wrappers bitset ops, not the emission contract, so they
  go away with the wrapper. Behavior, metric names, labels, and values
  are unchanged.

  Net effect on the PR: roughly -200 lines.
…sip coverage

  emit_proposal_coverage classified block-included validators as payload- vs
  gossip-sourced by checking known_aggregated_payloads. But ethlambda builds
  blocks exclusively from that same pool (build_block -> extend_proofs_greedily),
  so every included validator is by construction payload-seen: proposal_gossip
  was always 0 and proposal_payloads always equalled proposal_combined.

  Keep only proposal_combined (the set of validators in the proposed block).
  This also removes the per-attestation hash_tree_root lookup on the block-
  building path and the cross-AttestationData payload_seen contamination the
  old classification carried.

  Default-seeded series drop from 18 to 14.
Comment thread crates/storage/src/store.rs Outdated
…gate-coverage-metrics

# Conflicts:
#	crates/blockchain/src/store.rs
  Key every coverage section by the attestation data.slot (voting round)
  and fire the per-slot report at interval 1, matching zeam #876:

  - timely: CoverageSnapshot now stores (data.slot, bits) per entry and the
    report filters by data.slot, removing the non-deterministic slot picked
    from HashMap iteration order.
  - block: read from the canonical head block at report time (filtered to
    data.slot) instead of an unconditionally-overwritten per-block snapshot,
    so fork blocks cant poison it. Removes last_block_coverage from Store.
  - trigger: emit at interval 1 (not 0) so the block carrying the reporting
    slots votes has been received and processed.
  - has_any gate: emit nothing when no section has coverage for the round,
    instead of a misleading block_only=0 / timely_only=N reading.
  The storage Store carried observability-only state for the attestation
  aggregate coverage report (the CoverageSnapshot type and a
  pre_merge_new_coverage field), which has no fork-choice or
  state-transition role. Move it into the blockchain crate behind a new
  PreMergeCoverage handle owned by the BlockChainServer actor, threaded
  through on_tick / accept_new_attestations / get_proposal_head /
  produce_block_with_signatures as an Option so the snapshot is captured
  at promotion and read at the next slot boundary exactly as before.
  Tests and the RPC test-driver pass None. Pure refactor, no metric
  behavior change.
Comment thread crates/blockchain/src/lib.rs Outdated
Comment thread crates/blockchain/src/lib.rs Outdated
Comment thread crates/blockchain/src/store.rs Outdated
Comment thread crates/blockchain/src/store.rs Outdated
… module

Address review feedback: relocate all attestation aggregate coverage
observability into a new crates/blockchain/src/coverage.rs. The pre-merge
new_payloads snapshot is now captured by the actor in BlockChainServer::on_tick
(when proposing or at the end of the slot) instead of being threaded as a
parameter through store::on_tick / accept_new_attestations /
get_proposal_head / produce_block_with_signatures. Since the snapshot is owned
solely by the actor and only touched from its single-threaded message loop,
the Arc<Mutex> handle is replaced by a plain Option<CoverageSnapshot> field.
No behavior change - coverage emission is observability-only.
@pablodeymo
Copy link
Copy Markdown
Collaborator Author

@MegaRedHand addressed all the review comments in c751426 — one refactor that pulls the coverage observability out of the storage/STF path:

  • Moved all the coverage code into a new crates/blockchain/src/coverage.rs moduleCoverageSnapshot, the cov_*/or_into helpers, the three emit_* functions, and the snapshot fn (renamed snapshot_new_payloads). Nothing coverage-related lives at the bottom of lib.rs anymore.
  • Removed the coverage parameter from store::on_tick / accept_new_attestations / get_proposal_head / produce_block_with_signatures. The pre-merge snapshot is now taken in BlockChainServer::on_tick itself, gated on interval == 4 || (interval == 0 && proposing) — i.e. when proposing or at the end of the slot, exactly as you suggested. It runs right before store::on_tick, so the captured new_payloads set is identical to the old in-store capture.
  • Dropped the Arc<Mutex<…>>. Now that the snapshot is owned by the actor and only touched from the single-threaded message loop, it's a plain Option<CoverageSnapshot> field.

No behavior change — coverage emission stays observability-only. fmt/clippy -D warnings clean; blockchain/storage/rpc/signature tests green, forkchoice unchanged (same 8 pre-existing AttestationTooFarInFuture fixture flakes as main).

Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the attestation aggregate coverage metrics. All findings are in observability-only code (no fork-choice or state-transition impact), but two of them make the new metrics misleading under specific conditions. Inline comments below. Findings 1 and 2 are the ones worth addressing before merge; 3 is an efficiency cleanup and 5 is a schema/doc gap.

# Finding Severity
1 Proposer interval-0 snapshot overwrites the reported round's timely cohort Medium
2 Emit-gate doesn't suppress the missed-slot case its comment promises Medium-low
3 new_aggregated_payloads clones 1 MiB proof blobs to keep only bits (×2/slot) Medium (perf)
5 subnet=subnet_N series never written despite help text Low (doc)

Comment thread crates/blockchain/src/lib.rs Outdated
// the end of the slot (interval 4); skip empty snapshots so a missed
// round keeps the last set we saw. Pure observability.
let will_promote = interval == 4 || (interval == 0 && proposer_validator_id.is_some());
if will_promote && let Some(snapshot) = coverage::snapshot_new_payloads(&self.store) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 1 (Medium): proposer interval-0 snapshot can corrupt the timely cohort it's about to report.

Walking the timeline for a slot S this node proposes:

  1. Slot S-1 interval 4: snapshot_new_payloads captures round S-1 aggregates into pre_merge_coverage, then store::on_tick promotes (drains new_payloads).
  2. Between then and slot S interval 0, gossiped aggregates arrive — store.rs insert_new_aggregated_payload pushes into new_payloads at arbitrary network-driven times.
  3. Slot S interval 0: since we're proposer, will_promote fires and snapshot_new_payloads returns Some (non-empty due to those stragglers), overwriting the good S-1 snapshot with a partial set.
  4. Slot S interval 1: emit_post_block_coverage filters the overwritten snapshot to data.slot == S-1 and finds only the stragglers.

Result: timely under-reports and diff_validators{block_only} is inflated, but only on slots this node proposes, so the bias is intermittent and proposer-correlated. The interval-0 snapshot doesn't actually serve the round it's reporting; consider only refreshing pre_merge_coverage when the new snapshot covers the round not yet reported, or skip the interval-0 capture entirely.

Comment thread crates/blockchain/src/coverage.rs Outdated
// Emit nothing when there is no coverage for this round, rather than
// pushing a misleading all-zero report (e.g. `block_only=0, timely_only=N`
// on a missed slot). Gauges retain their previous value.
if !combined_v.iter().any(|&b| b) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 2 (Medium-low): this guard doesn't suppress the case the comment promises.

The comment above says it emits nothing to avoid "a misleading all-zero report (e.g. block_only=0, timely_only=N on a missed slot)". But the guard keys on combined, and combined = timely | late | block.

On a missed slot, store.get_block(store.head()) returns an older block whose att.data.slot != reporting_slot, so block_v is all-false — but the timely snapshot for S-1 is populated, so combined is non-empty and the report is emitted, pushing block_only=0, timely_only=N: exactly the case the comment claims to suppress.

To match the stated intent, gate on block_v (or require both block_v and timely_v non-empty) rather than combined_v.

Comment thread crates/storage/src/store.rs Outdated
let buf = self.new_payloads.lock().unwrap();
buf.data
.iter()
.map(|(root, entry)| (*root, (entry.data.clone(), entry.proofs.clone())))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 3 (Medium, perf): clones full proof blobs only to keep the participant bits.

entry.proofs.clone() deep-clones Vec<AggregatedSignatureProof>, and each proof carries proof_data: ByteList<1_048_576> (block.rs:88). The only consumers — snapshot_new_payloads and the late loop in emit_post_block_coverage — keep just proof.participants and drop the bytes. Worse, emit_post_block_coverage calls new_aggregated_payloads() a second time within the same tick, so a proposer-slot tick deep-clones the entire pending proof pool twice purely to read bitfields.

A participants-only accessor (e.g. returning Vec<(u64 /*data.slot*/, AggregationBits)>) eliminates both clones.

fn cov_record(section: &str, seen: &[bool], has_subnet: &[bool]) {
metrics::set_attestation_aggregate_coverage_validators(
section,
"combined",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 5 (Low, doc/schema): per-subnet subnet=subnet_N series are never written.

cov_record only ever calls set_attestation_aggregate_coverage_validators(section, "combined", ...). The has_subnet bitsets are computed but only feed the subnet-count gauge, never the per-subnet validator series. Yet the lean_attestation_aggregate_coverage_validators help text advertises "subnet=subnet_N is per-subnet coverage", and the gauge carries a subnet label dimension for it.

As written, querying subnet=subnet_N returns nothing — the schema overstates what's emitted. Either wire up per-subnet emission or trim the help text/label so dashboards don't expect a series that never appears.

Address MegaRedHand's review findings on the coverage observability:

- Drop the proposer interval-0 pre-merge snapshot. By interval 0 the round's
  votes are already promoted, so new_payloads holds only stragglers; capturing
  them overwrote the good interval-4 snapshot and made the timely section
  under-report on slots this node proposes. Snapshot only at interval 4;
  stragglers now surface in the late section as intended.
- Gate the post-block report on block_v (the round's votes appearing in the
  canonical head block) instead of combined_v. Gating on combined still fired
  on a missed slot, pushing the misleading block_only=0, timely_only=N the diff
  is meant to avoid.
- Add Store::new_aggregated_payload_participants() returning only the participant
  bitfields, replacing new_aggregated_payloads() which deep-cloned the
  multi-megabyte proof_data blobs on every read just to extract bits.
- Correct the validators-gauge help text: only subnet=combined is emitted; the
  per-subnet subnet_N breakdown is reserved and not yet populated.

No fork-choice or state-transition change - coverage stays observability-only.
@pablodeymo
Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough pass @MegaRedHand — all four addressed in 6fcefec (rebased onto the latest main merge):

Finding 1 (proposer interval-0 snapshot corrupts timely). Spot on, and it was worse than "catch-up only" — it biased every slot this node proposes. Dropped the interval-0 capture entirely; the pre-merge snapshot is now taken only at the interval-4 promote, which is the one that actually serves the reported round. The stragglers that used to clobber the snapshot now show up in the late section, where they belong. Gate simplified to if interval == 4.

Finding 2 (emit-gate doesn't suppress the missed-slot case). Right — combined includes timely, so a missed slot still emitted block_only=0, timely_only=N. Now gating on block_v (the round's votes appearing in the canonical head block), so we only report a round once there's actually a block carrying it. I went with block_v alone rather than "both non-empty" so the genuine block_only=N, timely_only=0 signal (node was slow to aggregate) still gets reported.

Finding 3 (1 MiB proof clones to read bits). Replaced new_aggregated_payloads() with new_aggregated_payload_participants() -> Vec<(u64, AggregationBits)>, which clones only the bitfields and never touches proof_data. All three consumers (snapshot_new_payloads, the late loop, agg_start_new) use it now, so no proof blob is deep-copied anywhere in the coverage path.

Finding 5 (subnet=subnet_N never written). Per-subnet validator emission isn't wired (only the subnet count gauge uses has_subnet), so I corrected the help text to state only subnet=combined is currently populated and the per-subnet breakdown is reserved. Kept the subnet label for schema parity with the upstream/zeam dashboards rather than dropping a dimension they expect.

fmt/clippy -D warnings clean; blockchain/storage/rpc/signature tests green, forkchoice unchanged (same 8 pre-existing AttestationTooFarInFuture fixture flakes).

@MegaRedHand MegaRedHand merged commit 873ddf8 into main May 27, 2026
2 checks passed
@MegaRedHand MegaRedHand deleted the add-attestation-aggregate-coverage-metrics branch May 27, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants