Add LocalTestingPlatform for local PySpark feature execution by camkweston · Pull Request #1104 · airbnb/chronon

camkweston · 2026-04-01T05:32:09Z

Summary

This PR introduces a LocalTestingPlatform — a BYOD (Bring Your Own Data) local testing platform that enables executing Chronon GroupBy, Join, and StagingQuery definitions against user-provided DataFrames entirely in-process, without any external infrastructure (no Hive metastore, no HDFS, no Airflow).

Why this matters: Enabling agentic feature engineering

The local testing platform is designed to close the feedback loop for AI-assisted feature engineering. Today, iterating on Chronon feature definitions requires compiling configs, submitting Spark jobs, and waiting for cluster results — a cycle that can take minutes to hours. This latency makes it impractical for an AI coding agent (like Claude Code) to autonomously iterate on feature definitions.

With the LocalTestingPlatform, an agent can:

Read existing GroupBy/Join definitions from the repo
Create or modify feature definitions in Python
Generate synthetic or sample data as DataFrames
Execute the full Chronon computation engine locally in seconds
Inspect results, validate correctness, and iterate — all without leaving the development environment

This turns feature engineering from a slow, infrastructure-dependent process into a rapid local loop where an AI agent can explore, prototype, and validate features autonomously. The BYOD model means the agent only needs to provide DataFrames mapped to table names — the platform handles SparkSession management, Hive table registration, JAR discovery, and JVM orchestration transparently.

What's included

Core platform (api/py/ai/chronon/pyspark/local.py):

Three convenience functions: run_local_group_by(), run_local_join(), run_local_staging_query()
reset_local_session() for clean-slate testing
Singleton SparkSession with in-memory Derby metastore (no external dependencies)
Automatic JAR discovery via CHRONON_SPARK_JAR env var (Bazel) or SBT build dirs
register_tables() creates properly partitioned Hive tables from DataFrames
LocalTestingPlatform subclass with correct catalog.TableUtils path resolution

Tests (api/py/test/test_local_platform.py, api/py/test/test_pyspark.py):

Integration tests using the existing quickstart sample definitions (group_bys/quickstart/purchases, joins/quickstart/training_set)
Tests use import_module_set_name() to derive metaData.name and metaData.team via the standard Chronon module naming convention
Removed test_helpers.py — its run_group_by_with_inputs() is superseded by the local platform which supports GroupBy, Join, and StagingQuery (not just TEMPORAL GroupBy)

Build integration:

spark/BUILD.bazel: Added filegroup target exposing the deploy JAR cross-package
api/py/BUILD.bazel: Added pyspark_test target with JAR dependency and sample data
requirements.txt: Added pyspark==3.5.5 and typing-extensions

Bug fix (api/py/ai/chronon/utils.py):

Fixed get_max_window_for_gb_in_days() crash on aggregations with windows=None (e.g. LAST_K) — pre-existing bug exposed by running the quickstart samples through the local platform

Usage example

from ai.chronon.pyspark.local import run_local_group_by

# Use any existing Chronon GroupBy definition
from group_bys.quickstart.purchases import v1 as purchases_gb

# Provide your own data
result = run_local_group_by(
    group_by=purchases_gb,
    tables={"data.purchases": my_purchases_df},
    start_date="20240101",
    end_date="20240131",
)
result.show()

Test plan

test_run_local_group_by — Executes quickstart purchases GroupBy against synthetic data
test_run_local_join — Executes quickstart training_set Join (purchases + returns + users) against synthetic data
test_session_reuse — Verifies singleton SparkSession pattern
test_register_tables_missing_partition_column — Validates error handling for missing ds column
test_reset_local_session — Verifies session cleanup and recreation
test_group_by (test_pyspark.py) — Migrated from test_helpers.py to use local platform

Introduce a BYOD (Bring Your Own Data) local testing platform that enables executing Chronon GroupBy, Join, and StagingQuery definitions against user-provided DataFrames without external infrastructure. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

camkweston changed the title ~~Add LocalTestingPlatform for in-process PySpark feature execution~~ Add LocalTestingPlatform for local PySpark feature execution Apr 1, 2026

Cam Weston and others added 2 commits March 31, 2026 23:10

Revert .bazelrc spark_version change

87712da

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Working POC

7c3c4fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LocalTestingPlatform for local PySpark feature execution#1104

Add LocalTestingPlatform for local PySpark feature execution#1104
camkweston wants to merge 3 commits intoairbnb:mainfrom
camkweston:camweston/local-testing-platform

camkweston commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

camkweston commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters: Enabling agentic feature engineering

What's included

Usage example

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

camkweston commented Apr 1, 2026 •

edited

Loading