Switch average implementation to utilize running average to prevent overflow by rgruener · Pull Request #1066 · airbnb/chronon

rgruener · 2025-11-12T14:19:48Z

Summary

Currently the average implementation uses a sum / count. For large aggregations this can cause an overflow of sum. This changes the implementation to utilize a running average that will not overflow.

Why / Goal

We utilize large (global) aggregations when doing tensor computations. This guarantees average will work even with these larger aggregations

Test Plan

[] Added Unit Tests
Covered by existing CI
Integration tested

Reviewers

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

nikhil-zlai · 2025-11-12T21:46:45Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

    StructType(
      "AvgIr",
-      Array(StructField("sum", DoubleType), StructField("count", IntType))
+      Array(StructField("running_average", DoubleType), StructField("count", IntType))


unfortunately, this might break pipelines that are already in prod.

best way to deal with this is to add a new aggregation and leave this as is.

The concern is that changing the implementation would introduce skew?

As a minimal fix, changing count to a Long would be helpful

Though I do think adding another implementation is warranted to unblock certain use cases

there is avro encoded data that is sitting in kvStore in the old format. the new aggregation logic will probably fail to parse it.

echo Nikhil's comment.

I see, after a bit of further investigation I believe I will add this under RunningAverage since I believe we will hit overflow issues (especially with count being an INT)

…verflow

rgruener · 2025-11-14T17:31:15Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

  override def isDeletable: Boolean = true
 }

+class RunningAverage extends SimpleAggregator[Double, Array[Any], Double] {


Instead of a new operator would it make sense to add an argument prevent_overflow or running_average to the Average operator that defaults to false?

yeah i like that! defaults to None that gets interpreted as false - to keep the semantic hashes as they were

rgruener · 2025-11-14T17:59:20Z

Will update docs assuming the change looks ok

nikhil-zlai

i found some weird behavior

nikhil-zlai · 2025-11-14T23:02:45Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

  override def isDeletable: Boolean = true
 }

+class RunningAverage extends SimpleAggregator[Double, Array[Any], Double] {


yeah i like that! defaults to None that gets interpreted as false - to keep the semantic hashes as they were

nikhil-zlai · 2025-11-14T23:06:16Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+    * Uses a more stable online algorithm which should be suitable for large numbers of records similar to:
+    * http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
+    */
+  private def computeRunningAverage(ir: Array[Any], right: Double, rightWeight: Double): Array[Any] = {


there is apparently already a getCombinedMean below in the moments stuff

Good to know

nikhil-zlai · 2025-11-14T23:10:11Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+          val scaling = rightWeight / newCount
+          if (scaling < STABILITY_CONSTANT) {
+            left + (right - left) * scaling
+          } else {
+            (leftWeight * left + rightWeight * right) / newCount
+          }


I tried to do this in-place (replace logic in average directly). and it actually makes the tests fail - the average operation stops being commutative due to slight errors in the double multiply and double division. I also tried the (lw*la + rw*ra) / (lw + rw) - without luck.

we ended up merging the following change instead: zipline-ai/chronon#1292

nikhil-zlai

approving to negate my request changes. given the loss of commutativity with the running average computation (due to double multiple / divide errors) - i don't know what is the right thing to do :-/

we ended up just changing the denominator to long in our fork.

rgruener · 2025-11-17T20:12:15Z

approving to negate my request changes. given the loss of commutativity with the running average computation (due to double multiple / divide errors) - i don't know what is the right thing to do :-/

we ended up just changing the denominator to long in our fork.

I understand how this isnt strictly commutative (I hit that issue in the tests originally and introduced tolerance to the tests to get them to pass). Are there larger implications with that?

rgruener force-pushed the running-average branch 2 times, most recently from fdbde1e to 89e4735 Compare November 12, 2025 16:04

nikhil-zlai requested changes Nov 12, 2025

View reviewed changes

rgruener force-pushed the running-average branch from 89e4735 to daa4ae3 Compare November 12, 2025 21:47

Robbie Gruener added 2 commits November 14, 2025 10:29

Switch average implementation to utilize running average to prevent o…

9939a52

…verflow

numeric stability

ef6e06d

rgruener force-pushed the running-average branch from daa4ae3 to bab464a Compare November 14, 2025 17:30

rgruener commented Nov 14, 2025

View reviewed changes

separate out running average

0e78450

rgruener force-pushed the running-average branch from bab464a to 0e78450 Compare November 14, 2025 17:55

rgruener requested review from nikhil-zlai and pengyu-hou November 14, 2025 17:55

nikhil-zlai reviewed Nov 14, 2025

View reviewed changes

nikhil-zlai approved these changes Nov 14, 2025

View reviewed changes

Conversation

rgruener commented Nov 12, 2025

Summary

Why / Goal

Test Plan

Reviewers

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgruener commented Nov 14, 2025

Uh oh!

nikhil-zlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

rgruener commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikhil-zlai left a comment •

edited

Loading