Add pruning aggregate functions#7934
Conversation
Merging this PR will degrade performance by 13.89%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | new_alp_prim_test_between[f32, 32768] |
153.3 µs | 181.9 µs | -15.71% |
| ❌ | Simulation | new_alp_prim_test_between[f32, 16384] |
103.9 µs | 118.1 µs | -12.02% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ngates/stats-7707/min-max-aggregate-fns (e09432d) with develop (ba5064a)1
Footnotes
988447b to
b22ebfe
Compare
b22ebfe to
c1e72a3
Compare
78f93f1 to
aa967fc
Compare
|
@claude cna you rteview this pr, waht are system changes here |
|
Claude finished @joseph-isaacs's task in 3m 12s —— View job PR Review: Add pruning aggregate functionsTasks:
System-Level SummaryThis PR wires in a new layer of pruning aggregate functions that let expression rewrites prove predicates false or true at zone-map granularity without needing ad-hoc 1. New Aggregate Function Modules (6 files)
2. Stats Expression Plumbing (
|
aa967fc to
d963a44
Compare
|
@claude give this another pass |
|
Claude finished @gatesn's task in 6m 7s —— View job PR Review: Add pruning aggregate functions (second pass)Tasks:
OverviewThe PR is in good shape overall. The legacy stats bridge, nullable Min/Max, and precision-aware NullCount→AllNull/AllNonNull are all correctly implemented with solid tests. Below are the new findings from this pass. 🟡
|
robert3005
left a comment
There was a problem hiding this comment.
I think it's mostly nits but I also feel like this could use self review
| }) | ||
| .map(ScalarValue::Bool) | ||
| } else if aggregate_fn.is::<AllNan>() { | ||
| let len = u64::try_from(len)?; |
There was a problem hiding this comment.
why do this if you only check the ptype afterwards?
| } else if aggregate_fn.is::<AllNonNull>() { | ||
| array | ||
| .statistics() | ||
| .with_typed_stats_set(|stats| stats.get_as::<u64>(Stat::NullCount)) |
There was a problem hiding this comment.
there's a array.statistics().get_as that's shorter and does the same thing?
| root(), | ||
| AggregateFn::new(AllNonNull, EmptyOptions).erased(), | ||
| ))? | ||
| .execute::<Canonical>(&mut LEGACY_SESSION.create_execution_ctx())? |
There was a problem hiding this comment.
no new usages of legacy session allowed, add a static to the test
| .set(Stat::NullCount, Precision::inexact(ScalarValue::from(2u64))); | ||
|
|
||
| let result = array | ||
| .apply(&super::all_null(root()))? |
There was a problem hiding this comment.
those super look ugly but I guess they're not explicitly forbidden by lint rules
| .apply(&stat( | ||
| root(), | ||
| AggregateFn::new(AllNull, EmptyOptions).erased(), | ||
| ))? |
There was a problem hiding this comment.
this is inconsistent, sometimes you use the shorthand and sometimes you don't?
There was a problem hiding this comment.
I think you now have semantic conflict with develop
| } | ||
| } | ||
|
|
||
| /// Compute a byte-bounded upper bound for the maximum non-null value of a UTF8/Binary array. |
There was a problem hiding this comment.
I think I am missing something, this is Max in the context of pruning. Why do we need another expression? Is this because we don't have aggregate -> stats expression conversion logic we talked about?
|
|
||
| /// Partial accumulator state for the bounded minimum aggregate. | ||
| pub struct BoundedMinPartial { | ||
| min: Option<Scalar>, |
There was a problem hiding this comment.
I think you should change this to be an enum of enum Value { min(Scalar), null(dtype) }
| let batch = PrimitiveArray::from_option_iter([Some(1i32), None, Some(3)]).into_array(); | ||
| acc.accumulate(&batch, &mut ctx)?; | ||
|
|
||
| assert!(!bool::try_from(&acc.finish()?)?); |
There was a problem hiding this comment.
are as bool casts forbidden? This looks weird
| #[test] | ||
| fn all_non_nan_unsupported_for_non_float() -> VortexResult<()> { | ||
| let dtype = DType::Primitive(PType::I32, Nullability::Nullable); | ||
| assert!(Accumulator::try_new(AllNonNan, EmptyOptions, dtype).is_err()); |
There was a problem hiding this comment.
note to self: we need some shorthand for this but that's the current way to do this
|
|
||
| /// Compute a byte-bounded upper bound for the maximum non-null value of a UTF8/Binary array. | ||
| #[derive(Clone, Debug)] | ||
| pub struct BoundedMax; |
There was a problem hiding this comment.
I would call this truncated max, but this does feel like a stat/pruning concept not an aggregation one?
There was a problem hiding this comment.
But the whole epic is saying Stat == AggregateFn...
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
30b42c6 to
e09432d
Compare
cbc5046
into
ngates/stats-7707/typed-stats-rewrite-api
Summary
MinandMaxaggregate function modules that delegate to the existingmin_maximplementation for nowAllNullandAllNonNullaggregate functions for null pruning proofsstats::all_null(...)/stats::all_non_null(...)AllNull/AllNonNullthrough legacyNullCountstats when the proof is soundNotes
MinandMaxcurrently delegate throughmin_max, so each computes the unused bound. A future dedicated implementation can avoid that extra compute.AllNullandAllNonNullare proof-oriented null aggregates. They let rewrites express null pruning without introducing newrow_count()placeholders, while existing legacyNullCountstats can still feed them when exact enough.Tests
cargo +nightly fmt --allcargo test -p vortex-array min_cargo test -p vortex-array max_cargo test -p vortex-array stat_expr./scripts/public-api.shcargo clippy --all-targets --all-features