Add LocalTestingPlatform for local PySpark feature execution#1104
Open
camkweston wants to merge 3 commits intoairbnb:mainfrom
Open
Add LocalTestingPlatform for local PySpark feature execution#1104camkweston wants to merge 3 commits intoairbnb:mainfrom
camkweston wants to merge 3 commits intoairbnb:mainfrom
Conversation
Introduce a BYOD (Bring Your Own Data) local testing platform that enables executing Chronon GroupBy, Join, and StagingQuery definitions against user-provided DataFrames without external infrastructure. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a LocalTestingPlatform — a BYOD (Bring Your Own Data) local testing platform that enables executing Chronon GroupBy, Join, and StagingQuery definitions against user-provided DataFrames entirely in-process, without any external infrastructure (no Hive metastore, no HDFS, no Airflow).
Why this matters: Enabling agentic feature engineering
The local testing platform is designed to close the feedback loop for AI-assisted feature engineering. Today, iterating on Chronon feature definitions requires compiling configs, submitting Spark jobs, and waiting for cluster results — a cycle that can take minutes to hours. This latency makes it impractical for an AI coding agent (like Claude Code) to autonomously iterate on feature definitions.
With the LocalTestingPlatform, an agent can:
This turns feature engineering from a slow, infrastructure-dependent process into a rapid local loop where an AI agent can explore, prototype, and validate features autonomously. The BYOD model means the agent only needs to provide DataFrames mapped to table names — the platform handles SparkSession management, Hive table registration, JAR discovery, and JVM orchestration transparently.
What's included
Core platform (
api/py/ai/chronon/pyspark/local.py):run_local_group_by(),run_local_join(),run_local_staging_query()reset_local_session()for clean-slate testingCHRONON_SPARK_JARenv var (Bazel) or SBT build dirsregister_tables()creates properly partitioned Hive tables from DataFramesLocalTestingPlatformsubclass with correctcatalog.TableUtilspath resolutionTests (
api/py/test/test_local_platform.py,api/py/test/test_pyspark.py):group_bys/quickstart/purchases,joins/quickstart/training_set)import_module_set_name()to derivemetaData.nameandmetaData.teamvia the standard Chronon module naming conventiontest_helpers.py— itsrun_group_by_with_inputs()is superseded by the local platform which supports GroupBy, Join, and StagingQuery (not just TEMPORAL GroupBy)Build integration:
spark/BUILD.bazel: Addedfilegrouptarget exposing the deploy JAR cross-packageapi/py/BUILD.bazel: Addedpyspark_testtarget with JAR dependency and sample datarequirements.txt: Addedpyspark==3.5.5andtyping-extensionsBug fix (
api/py/ai/chronon/utils.py):get_max_window_for_gb_in_days()crash on aggregations withwindows=None(e.g.LAST_K) — pre-existing bug exposed by running the quickstart samples through the local platformUsage example
Test plan
test_run_local_group_by— Executes quickstart purchases GroupBy against synthetic datatest_run_local_join— Executes quickstart training_set Join (purchases + returns + users) against synthetic datatest_session_reuse— Verifies singleton SparkSession patterntest_register_tables_missing_partition_column— Validates error handling for missingdscolumntest_reset_local_session— Verifies session cleanup and recreationtest_group_by(test_pyspark.py) — Migrated fromtest_helpers.pyto use local platform