Chronon reads data partitions based on window sizes to compute features for event sources. For example
sum_clicks_past_30days ← 30 day window count_visits_past_90days ← 90 day window
When computing aggregations, one of the major time taking tasks is reading data from disk. If a feature needs 90 days lookback, Chronon reads 90 days worth of data to compute. When this feature is computed in production everyday, consecutive days have 89 partitions in common.
This CHIP is to add support for incremental aggregation which will avoid reading 89 common partitions between consecutive days and take advantage of previous day’s computation in feature store.
Eg: how many listings purchased by the user since signup.
These features are not windowed, which means the feature calculation happens over the lifetime.
These features are windowed and has a inverse operator
Eg: Number of listings(Count Aggregation) purchased by the user in the past 90 days
GroupBy(
sources=[table_name],
keys=["key_col1", "key_col2"],
aggregations=[
Aggregation(
input_column="inp_col",
operation=Operation.COUNT,
windows=[
Window(length=3, timeUnit=TimeUnit.DAYS),
Window(length=10, timeUnit=TimeUnit.DAYS)
]
)],
accuracy=Accuracy.SNAPSHOT
)To compute above groupBy incrementally
- Read the output table from groupby to get previous day’s aggregated values
- Read day 0 to add the latest activity
- Read user activity on day 4, day 11 to delete the unwanted user data
These features are windowed and does not have an inverse operator
Eg: What is the max purchase price by the user in the past 30 days
For non-deletable operators, we will go with the current behavior of Chronon where we load all the data/partitions needed to compute feature.
Add incremental=True if the feature compute needs to happen in incremental way in GroupBy API
GroupBy(
sources=[table_name],
keys=["key_col1", "key_col2"],
aggregations=[....],
incremental_agg=True,
accuracy=Accuracy.SNAPSHOT
)Need thrift changes for the flag


