Skip to content

Add pruning aggregate functions#7934

Merged
gatesn merged 4 commits into
ngates/stats-7707/typed-stats-rewrite-apifrom
ngates/stats-7707/min-max-aggregate-fns
May 19, 2026
Merged

Add pruning aggregate functions#7934
gatesn merged 4 commits into
ngates/stats-7707/typed-stats-rewrite-apifrom
ngates/stats-7707/min-max-aggregate-fns

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented May 14, 2026

Summary

  • add separate Min and Max aggregate function modules that delegate to the existing min_max implementation for now
  • return nullable min/max aggregate partials and results so empty or all-null groups can be represented
  • add AllNull and AllNonNull aggregate functions for null pruning proofs
  • register the new aggregate functions and expose stats::all_null(...) / stats::all_non_null(...)
  • bridge legacy stats into aggregate accumulation, including nullability-only dtype differences
  • bridge AllNull / AllNonNull through legacy NullCount stats when the proof is sound

Notes

Min and Max currently delegate through min_max, so each computes the unused bound. A future dedicated implementation can avoid that extra compute.

AllNull and AllNonNull are proof-oriented null aggregates. They let rewrites express null pruning without introducing new row_count() placeholders, while existing legacy NullCount stats can still feed them when exact enough.

Tests

  • cargo +nightly fmt --all
  • cargo test -p vortex-array min_
  • cargo test -p vortex-array max_
  • cargo test -p vortex-array stat_expr
  • ./scripts/public-api.sh
  • cargo clippy --all-targets --all-features

@gatesn gatesn added the changelog/feature A new feature label May 14, 2026
@gatesn gatesn changed the title 4: Add Min and Max aggregate functions Add Min and Max aggregate functions May 14, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 14, 2026

Merging this PR will degrade performance by 13.89%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 2 regressed benchmarks
✅ 1235 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation new_alp_prim_test_between[f32, 32768] 153.3 µs 181.9 µs -15.71%
Simulation new_alp_prim_test_between[f32, 16384] 103.9 µs 118.1 µs -12.02%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ngates/stats-7707/min-max-aggregate-fns (e09432d) with develop (ba5064a)1

Open in CodSpeed

Footnotes

  1. No successful run was found on ngates/stats-7707/typed-stats-rewrite-api (5c04e3d) during the generation of this report, so develop (ba5064a) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@gatesn gatesn force-pushed the ngates/stats-7707/min-max-aggregate-fns branch from 988447b to b22ebfe Compare May 14, 2026 22:32
@gatesn gatesn mentioned this pull request May 14, 2026
42 tasks
@gatesn gatesn force-pushed the ngates/stats-7707/min-max-aggregate-fns branch from b22ebfe to c1e72a3 Compare May 14, 2026 23:02
@gatesn gatesn changed the title Add Min and Max aggregate functions Add pruning aggregate functions May 14, 2026
@gatesn gatesn force-pushed the ngates/stats-7707/min-max-aggregate-fns branch 3 times, most recently from 78f93f1 to aa967fc Compare May 15, 2026 09:23
@gatesn gatesn marked this pull request as ready for review May 15, 2026 11:32
Base automatically changed from ngates/stats-7707/stats-rewrite-session-api to develop May 15, 2026 11:47
@gatesn gatesn enabled auto-merge (squash) May 15, 2026 11:48
@joseph-isaacs
Copy link
Copy Markdown
Contributor

@claude cna you rteview this pr, waht are system changes here

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Claude finished @joseph-isaacs's task in 3m 12s —— View job


PR Review: Add pruning aggregate functions

Tasks:

  • Understand the request (PR review + system changes overview)
  • Read changed files and gather context
  • Analyze new aggregate function modules
  • Review stats/expr/rewrite plumbing
  • Check accumulator and session changes
  • Evaluate scalar_fn stat bridging
  • Look at performance regression context
  • Post final review

System-Level Summary

This PR wires in a new layer of pruning aggregate functions that let expression rewrites prove predicates false or true at zone-map granularity without needing ad-hoc row_count() placeholders. Here's a map of the moving parts:

1. New Aggregate Function Modules (6 files)

Module Semantics Partial type
fns/min Min non-null value; null for empty/all-null groups Option<Scalar> (MinPartial)
fns/max Max non-null value; null for empty/all-null groups Option<Scalar> (MaxPartial)
fns/all_null true iff every value is null bool
fns/all_non_null true iff no value is null bool
fns/all_nan true iff every value is NaN bool
fns/all_non_nan true iff no value is NaN bool

Min and Max deliberately return nullable output (as_nullable()) so empty groups serialize as null instead of propagating a sentinel value.

2. Stats Expression Plumbing (stats/expr.rs)

Adds public expression constructors stats::all_null(expr), stats::all_non_null(expr), etc. that call stat(expr, AggFn). A key design choice: AllNull/AllNonNull bridge from legacy NullCount stats in stat_array() with explicit precision semantics:

  • AllNull: NullCount == len (exact) → true; NullCount < len (inexact) → false; NullCount == len (inexact) → null (unknown)
  • AllNonNull: NullCount == 0 (exact or inexact) → true; non-zero inexact → null

This gives safe access to legacy file stats without recomputing.

3. Rewrite Rule System (stats/rewrite.rs, stats/session.rs)

  • StatsRewriteRule trait: extensible per-ScalarFnId plugin with falsify / satisfy methods
  • StatsRewriteSession: session-scoped registry keyed by ScalarFnId, multiple rules OR'd together
  • New public Expression::falsify(&session) and Expression::satisfy(&session) methods — the external API for callers that need pruning proofs

4. Legacy Stats Bridge in Accumulator (accumulator.rs)

Step 0 of accumulation now handles the case where the legacy stat dtype differs from the new partial dtype by nullability onlycast()-ing from non-nullable to nullable. This is critical for Min/Max since legacy Stat::Min/Max store scalars without nullability on their dtype.

5. Session Wiring

StatsRewriteSession is wired into VortexSessionDefault in vortex/src/lib.rs, and all 6 new aggregates are registered in AggregateFnSession::default().


Review Findings

🔴 Issue: AllNan/AllNonNan vacuous truth inconsistency for empty float arrays

AllNan::empty_partial initializes to has_nans(input_dtype)true for floats, false for integers. This means an empty f32 array will report all_nan = true AND all_non_nan = true (both empty_partial = true). While vacuously defensible, this is a surprising edge case:

  • all_nan(empty_f32) == true suggests "the column could be all-NaN" — safe for falsification
  • all_non_nan(empty_f32) == true — also vacuously true, safe for satisfaction

The inconsistency is that all_null for an empty array is also true, but all_nan for an empty integer array is false. This asymmetry between the NaN and null families could confuse future rewrite rule authors. There's no test covering an empty float array. Fix this →

🟡 Issue: has_nans is duplicated

has_nans(dtype: &DType) -> bool is defined identically in both fns/all_nan/mod.rs:117 and scalar_fn/fns/stat.rs:210. This should be extracted to a shared location (e.g., dtype utilities or a nan helper module). Fix this →

🟡 Issue: Min/Max compute both bounds

Both Min::accumulate and Max::accumulate call min_max(), paying for both bounds. The PR acknowledges this and the comment is clear, but the performance regression report showing 10-25% regressions in fastlanes between benchmarks may be related — between typically exercises both min/max bounds simultaneously. Worth investigating whether the regression pre-dates this PR or is caused by it triggering more min_max calls than before.

🟡 Issue: StatsRewriteRule and StatsRewriteSession methods have #[allow(dead_code)]

register and register_ref on StatsRewriteSession, and the StatsRewriteRule trait itself carry #[allow(dead_code)]. This is fine as foundation plumbing, but signals these APIs aren't yet exercised in production paths. Consider adding a note in the module doc about when they become active.

🟢 Design: MinMax::serialize fix (unimplemented!Ok(None))

Good call fixing this — Min and Max delegate to MinMax and inherit the assumption that serialization works, so leaving a panic there would be a latent correctness issue.

🟢 Design: Nullable output for Min/Max

Returning nullable output that becomes null for empty groups is the correct SQL semantics and avoids the need for sentinel values or separate "saw any values" flags.

🟢 Design: Precision-aware bridging from NullCount to AllNull/AllNonNull

The asymmetry between the two proofs (AllNull returns false when inexact count definitively doesn't equal len, vs. AllNonNull returning true when inexact count is zero) is well-reasoned and tested. The test matrix covering exact/inexact with various counts is thorough.

🟢 Design: Legacy stats cast in accumulator

The dtype-normalizing cast (eq_ignore_nullability check + cast()) is correct and the vortex_ensure! guard ensures it only fires when the dtypes differ only in nullability, not in kind.

Minor nitpick: test naming convention

STYLE.md (and CLAUDE.md) note the convention for test module names is tests (not test). All new test modules here use mod tests { ... } ✓.


Performance Regression

The CodSpeed report flags 6 benchmarks in fastlanes (bitpack/alp between) regressing 10–25%. These benchmarks compare against ngates/stats-7707/stats-rewrite-session-api, not develop. If those regressions also appear against develop, they may be pre-existing or caused by an earlier PR in this series. Worth a quick git bisect or comparison against develop directly.


Overall: Solid architecture with clear abstractions and good test coverage for the precision semantics. Main asks before merge: confirm the performance regression source, add a test for empty float AllNan/AllNonNan, and deduplicate has_nans.

Comment thread vortex-array/src/aggregate_fn/fns/all_nan/mod.rs Outdated
@gatesn gatesn added the tldr label May 17, 2026
@gatesn gatesn force-pushed the ngates/stats-7707/min-max-aggregate-fns branch from aa967fc to d963a44 Compare May 17, 2026 15:51
@tl-dr-review tl-dr-review Bot added tldr and removed tldr labels May 17, 2026
@gatesn gatesn requested a review from joseph-isaacs May 17, 2026 18:41
@gatesn
Copy link
Copy Markdown
Contributor Author

gatesn commented May 17, 2026

@claude give this another pass

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

Claude finished @gatesn's task in 6m 7s —— View job


PR Review: Add pruning aggregate functions (second pass)

Tasks:

  • Understand the request (fresh full review pass)
  • Read new aggregate function modules (Min, Max, AllNull, AllNonNull, AllNan, AllNonNan)
  • Read BoundedMin / BoundedMax
  • Read accumulator changes
  • Read stats expr / rewrite / session
  • Read scalar_fn stat bridging
  • Synthesize and post findings

Overview

The PR is in good shape overall. The legacy stats bridge, nullable Min/Max, and precision-aware NullCount→AllNull/AllNonNull are all correctly implemented with solid tests. Below are the new findings from this pass.


🟡 BoundedMax conflates Empty and Unknown during cross-shard merging

BoundedMaxPartial::to_scalar returns Scalar::null for both BoundedMaxState::Empty (no values seen yet) and BoundedMaxState::Unknown (no valid upper bound). When that null is later received via combine_partialsmerge(null), the state becomes Unknown:

// bounded_max/mod.rs
fn merge(&mut self, max: Scalar) {
    if max.is_null() {
        self.state = BoundedMaxState::Unknown;  // <-- Empty and Unknown both land here
        return;
    }
    ...
}

This means merging an empty-group shard (e.g. a file row group with all-null values, which serializes as null) poisons a valid upper bound on the other shards. For example:

  • Shard A: values ["aardvark"], BoundedMax = "aardvas" (valid upper bound)
  • Shard B: all-null, BoundedMax = Empty → serialized as null
  • After merge: A's state becomes Unknown → result is null (no upper bound)

The existing test bounded_max_empty_does_not_poison_later_values only covers the single-accumulator path (where the min_max early-return handles empty batches before merge is called). The multi-shard merge path has no test coverage.

BoundedMin is correctly exempt because empty/null scalar merges are silently ignored (lower bounds are always computable, so None from lower_bound never occurs, and empty-group partials correctly contribute nothing).

The fix would require a 3-way serialized form — e.g., Option<Option<Scalar>> where None = Empty, Some(None) = Unknown, Some(Some(v)) = Value — or tagging Unknown differently in to_scalar. Fix this →


🟡 AllNan / AllNonNan accumulate bodies are unreachable dead code

All four pruning boolean aggregates (AllNull, AllNonNull, AllNan, AllNonNan) override try_accumulate to always return Ok(true):

// all_nan/mod.rs
fn try_accumulate(&self, state: &mut Self::Partial, batch: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<bool> {
    ...
    *state &= nan_count(batch, ctx)? == batch.len();
    Ok(true)  // always short-circuits
}

Because DynAccumulator::accumulate returns early if try_accumulate returns true, the accumulate(&Columnar) bodies are never reached via the standard dispatch path. The dead bodies are 10–15 lines each, with non-trivial logic that doesn't match try_accumulate (e.g. AllNull::accumulate has a special Columnar::Constant fast path that try_accumulate doesn't, and AllNan::accumulate re-checks the dtype in a slightly different way).

A comment like // try_accumulate always short-circuits; this body is a fallback if dispatch is bypassed — or an unreachable!() — would prevent future confusion and avoid drift between the two implementations. Fix this →


🟢 Non-float guards in stat_array for AllNan/AllNonNan are dead code

In scalar_fn/fns/stat.rs, stat_array has explicit non-float guards for AllNan and AllNonNan:

} else if aggregate_fn.is::<AllNan>() {
    if !matches!(array.dtype(), DType::Primitive(ptype, _) if ptype.is_float()) {
        Some(false)   // unreachable
    } else { ... }
} else if aggregate_fn.is::<AllNonNan>() {
    if !matches!(array.dtype(), DType::Primitive(ptype, _) if ptype.is_float()) {
        Some(true)    // unreachable
    } else { ... }
}

These branches are unreachable because stat_dtype (called before stat_array in execute) bails for non-float input — the test stat_expr_rejects_all_nan_for_non_float confirms the error originates there.

The more interesting observation: AllNonNan returning true for integers is semantically correct (integers can never be NaN). If it's desired that stat(integer_col, all_non_nan) succeeds and returns true, then AllNonNan::return_dtype should not return None for non-float types. Currently it's inconsistent: the dead guard says "yes, this is true for integers" but the gate says "unsupported." Fix this →


🟢 BoundedMin/BoundedMax missing comment on min_max double-compute

BoundedMin::accumulate and BoundedMax::accumulate both delegate to min_max() (computing the unused bound). min.rs and max.rs have an explicit comment:

"Delegate to the existing min_max implementation for now. A dedicated min aggregate would avoid computing max when only min is needed."

bounded_min.rs and bounded_max.rs have the same performance concern but no matching comment. Minor, but worth keeping consistent. Fix this →


Confirmed from previous review

  • has_nans duplication: Not present in the current code — both files use inline matches!(). Previous review appears to have been looking at a different revision.
  • StatsRewriteRule/StatsRewriteSession dead_code allows: Still present; consistent with the "foundation plumbing" intent and the tests in rewrite.rs exercise the machinery end-to-end.
  • Min/Max double-compute via min_max: Noted in the PR comments and documented inline.

Strengths

  • Precision-aware NullCount → AllNull/AllNonNull bridging is correct and thoroughly tested across all four combinations (exact/inexact × zero/nonzero/len).
  • Min/Max nullable output for empty groups is correct SQL semantics.
  • Legacy stats cast in accumulator.rs (eq_ignore_nullability + cast()) is sound with the right guard.
  • BoundedMax has good edge-case test coverage for the single-accumulator path (empty prefix, unknown poison, round-trip options).
  • StatsRewriteCtx OR-collects multiple rule rewrites correctly and has tests verifying both falsifier and satisfier combination.

Copy link
Copy Markdown
Contributor

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mostly nits but I also feel like this could use self review

})
.map(ScalarValue::Bool)
} else if aggregate_fn.is::<AllNan>() {
let len = u64::try_from(len)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do this if you only check the ptype afterwards?

} else if aggregate_fn.is::<AllNonNull>() {
array
.statistics()
.with_typed_stats_set(|stats| stats.get_as::<u64>(Stat::NullCount))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a array.statistics().get_as that's shorter and does the same thing?

root(),
AggregateFn::new(AllNonNull, EmptyOptions).erased(),
))?
.execute::<Canonical>(&mut LEGACY_SESSION.create_execution_ctx())?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new usages of legacy session allowed, add a static to the test

.set(Stat::NullCount, Precision::inexact(ScalarValue::from(2u64)));

let result = array
.apply(&super::all_null(root()))?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those super look ugly but I guess they're not explicitly forbidden by lint rules

Comment on lines +249 to +252
.apply(&stat(
root(),
AggregateFn::new(AllNull, EmptyOptions).erased(),
))?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is inconsistent, sometimes you use the shorthand and sometimes you don't?

Comment thread vortex-array/src/expr/expression.rs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you now have semantic conflict with develop

}
}

/// Compute a byte-bounded upper bound for the maximum non-null value of a UTF8/Binary array.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am missing something, this is Max in the context of pruning. Why do we need another expression? Is this because we don't have aggregate -> stats expression conversion logic we talked about?


/// Partial accumulator state for the bounded minimum aggregate.
pub struct BoundedMinPartial {
min: Option<Scalar>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should change this to be an enum of enum Value { min(Scalar), null(dtype) }

let batch = PrimitiveArray::from_option_iter([Some(1i32), None, Some(3)]).into_array();
acc.accumulate(&batch, &mut ctx)?;

assert!(!bool::try_from(&acc.finish()?)?);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are as bool casts forbidden? This looks weird

#[test]
fn all_non_nan_unsupported_for_non_float() -> VortexResult<()> {
let dtype = DType::Primitive(PType::I32, Nullability::Nullable);
assert!(Accumulator::try_new(AllNonNan, EmptyOptions, dtype).is_err());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: we need some shorthand for this but that's the current way to do this


/// Compute a byte-bounded upper bound for the maximum non-null value of a UTF8/Binary array.
#[derive(Clone, Debug)]
pub struct BoundedMax;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this truncated max, but this does feel like a stat/pruning concept not an aggregation one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the whole epic is saying Stat == AggregateFn...

gatesn added 3 commits May 19, 2026 16:21
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn force-pushed the ngates/stats-7707/min-max-aggregate-fns branch from 30b42c6 to e09432d Compare May 19, 2026 20:46
@gatesn gatesn changed the base branch from develop to ngates/stats-7707/typed-stats-rewrite-api May 19, 2026 20:49
@gatesn gatesn merged commit cbc5046 into ngates/stats-7707/typed-stats-rewrite-api May 19, 2026
61 checks passed
@gatesn gatesn deleted the ngates/stats-7707/min-max-aggregate-fns branch May 19, 2026 20:49
@gatesn gatesn restored the ngates/stats-7707/min-max-aggregate-fns branch May 19, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants