Skip to content

Add LocalTestingPlatform for local PySpark feature execution#1104

Open
camkweston wants to merge 3 commits intoairbnb:mainfrom
camkweston:camweston/local-testing-platform
Open

Add LocalTestingPlatform for local PySpark feature execution#1104
camkweston wants to merge 3 commits intoairbnb:mainfrom
camkweston:camweston/local-testing-platform

Conversation

@camkweston
Copy link
Copy Markdown

@camkweston camkweston commented Apr 1, 2026

Summary

This PR introduces a LocalTestingPlatform — a BYOD (Bring Your Own Data) local testing platform that enables executing Chronon GroupBy, Join, and StagingQuery definitions against user-provided DataFrames entirely in-process, without any external infrastructure (no Hive metastore, no HDFS, no Airflow).

Why this matters: Enabling agentic feature engineering

The local testing platform is designed to close the feedback loop for AI-assisted feature engineering. Today, iterating on Chronon feature definitions requires compiling configs, submitting Spark jobs, and waiting for cluster results — a cycle that can take minutes to hours. This latency makes it impractical for an AI coding agent (like Claude Code) to autonomously iterate on feature definitions.

With the LocalTestingPlatform, an agent can:

  1. Read existing GroupBy/Join definitions from the repo
  2. Create or modify feature definitions in Python
  3. Generate synthetic or sample data as DataFrames
  4. Execute the full Chronon computation engine locally in seconds
  5. Inspect results, validate correctness, and iterate — all without leaving the development environment

This turns feature engineering from a slow, infrastructure-dependent process into a rapid local loop where an AI agent can explore, prototype, and validate features autonomously. The BYOD model means the agent only needs to provide DataFrames mapped to table names — the platform handles SparkSession management, Hive table registration, JAR discovery, and JVM orchestration transparently.

What's included

Core platform (api/py/ai/chronon/pyspark/local.py):

  • Three convenience functions: run_local_group_by(), run_local_join(), run_local_staging_query()
  • reset_local_session() for clean-slate testing
  • Singleton SparkSession with in-memory Derby metastore (no external dependencies)
  • Automatic JAR discovery via CHRONON_SPARK_JAR env var (Bazel) or SBT build dirs
  • register_tables() creates properly partitioned Hive tables from DataFrames
  • LocalTestingPlatform subclass with correct catalog.TableUtils path resolution

Tests (api/py/test/test_local_platform.py, api/py/test/test_pyspark.py):

  • Integration tests using the existing quickstart sample definitions (group_bys/quickstart/purchases, joins/quickstart/training_set)
  • Tests use import_module_set_name() to derive metaData.name and metaData.team via the standard Chronon module naming convention
  • Removed test_helpers.py — its run_group_by_with_inputs() is superseded by the local platform which supports GroupBy, Join, and StagingQuery (not just TEMPORAL GroupBy)

Build integration:

  • spark/BUILD.bazel: Added filegroup target exposing the deploy JAR cross-package
  • api/py/BUILD.bazel: Added pyspark_test target with JAR dependency and sample data
  • requirements.txt: Added pyspark==3.5.5 and typing-extensions

Bug fix (api/py/ai/chronon/utils.py):

  • Fixed get_max_window_for_gb_in_days() crash on aggregations with windows=None (e.g. LAST_K) — pre-existing bug exposed by running the quickstart samples through the local platform

Usage example

from ai.chronon.pyspark.local import run_local_group_by

# Use any existing Chronon GroupBy definition
from group_bys.quickstart.purchases import v1 as purchases_gb

# Provide your own data
result = run_local_group_by(
    group_by=purchases_gb,
    tables={"data.purchases": my_purchases_df},
    start_date="20240101",
    end_date="20240131",
)
result.show()

Test plan

  • test_run_local_group_by — Executes quickstart purchases GroupBy against synthetic data
  • test_run_local_join — Executes quickstart training_set Join (purchases + returns + users) against synthetic data
  • test_session_reuse — Verifies singleton SparkSession pattern
  • test_register_tables_missing_partition_column — Validates error handling for missing ds column
  • test_reset_local_session — Verifies session cleanup and recreation
  • test_group_by (test_pyspark.py) — Migrated from test_helpers.py to use local platform

Introduce a BYOD (Bring Your Own Data) local testing platform that
enables executing Chronon GroupBy, Join, and StagingQuery definitions
against user-provided DataFrames without external infrastructure.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@camkweston camkweston changed the title Add LocalTestingPlatform for in-process PySpark feature execution Add LocalTestingPlatform for local PySpark feature execution Apr 1, 2026
Cam Weston and others added 2 commits March 31, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant