Skip to content

kafka/protocol: make schemata codegen reproducible#30468

Merged
travisdowns merged 1 commit into
redpanda-data:devfrom
travisdowns:td-kafka-schemata-reproducible
May 14, 2026
Merged

kafka/protocol: make schemata codegen reproducible#30468
travisdowns merged 1 commit into
redpanda-data:devfrom
travisdowns:td-kafka-schemata-reproducible

Conversation

@travisdowns
Copy link
Copy Markdown
Member

@travisdowns travisdowns commented May 13, 2026

The kafka schemata generator emits C++ source from JSON schemas. Earlier
versions iterated Python sets when emitting #include lines, so the
byte-level output of each codegen action varied with PYTHONHASHSEED
(which CPython picks fresh per interpreter).

The resulting object file output by compilation was unchanged by the
header swaps, so
the final binary was also unchanged and this was invisible to a top-level
hash check of "redpanda" binary.

But Bazel keys its action cache on input content
hashes
— non-deterministic codegen output invalidates every
downstream compile's cache key. On a shared remote cache that means
every developer compiling the kafka layer misses on every action that
consumes the generated headers.

This PR:

  • Sorts the header sets at the two sites where they're iterated by
    Jinja: StructType.headers() (per-schema header template) and the
    extra_schema_headers set passed into COMBINED_SOURCE_TEMPLATE
    (per-schema source template). Two-line, behavior-preserving fix.
  • Adds a small py_test that runs the generator with two very
    different PYTHONHASHSEED values across a handful of representative
    schemata and fails if any output byte differs. Test runs in ~4s.

Found while running a broader Bazel hermeticity check on dev: the
top-level //:redpanda binary already hashes identically across runs
(thanks to #30187), but a wider bazel-out diff surfaced ~20 schemata
.h files differing on #include line order. With this fix those
files are now byte-identical across runs.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

The generator iterated Python sets when emitting #include lines, so the
output of each codegen action varied with PYTHONHASHSEED (which CPython
picks fresh per interpreter). The C++ files were preprocessor-equivalent
across runs, so the final binary was unchanged, but Bazel keys its
action cache on input content hashes — non-deterministic codegen output
invalidates every downstream compile's cache key. On a shared remote
cache this means every developer compiling the kafka layer misses on
every action that consumes the generated headers.

Sort the header sets at the two sites where they're iterated by Jinja:
StructType.headers() (per-schema header template) and the
extra_schema_headers passed into COMBINED_SOURCE_TEMPLATE (per-schema
source template).

Add a small py_test that runs the generator with two very different
PYTHONHASHSEED values across a handful of representative schemata and
fails if any output byte differs.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes non-determinism in the kafka schemata code generator by sorting Python sets at two iteration sites, ensuring byte-identical output across runs with different PYTHONHASHSEED values. This restores Bazel remote action cache hits for downstream compiles that consume the generated headers.

Changes:

  • Sort StructType.headers() output and extra_schema_headers before passing to Jinja templates.
  • Add a py_test that runs the generator with two PYTHONHASHSEED values across six representative schemata and asserts byte-equivalent output.
  • Register the new test in the schemata BUILD file with the JSON data dependencies it needs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/v/kafka/protocol/schemata/generator.py Sort header sets at the two Jinja iteration sites for deterministic output.
src/v/kafka/protocol/schemata/generator_reproducibility_test.py New regression test that runs codegen under varying PYTHONHASHSEED and diffs the output.
src/v/kafka/protocol/schemata/BUILD Declare the new py_test with required schema JSON and generator data deps.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84419
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/84419#019e2364-523a-4a4f-bf63-021095c8ffb5 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0336, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/84419#019e2365-8575-41b9-bfb5-1c59fb93395e 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0336, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks

@travisdowns
Copy link
Copy Markdown
Member Author

The resulting C++ files were preprocessor-equivalent across runs, so
the final binary was unchanged and this was invisible to a top-level

This is wrong, changing include order is definitely not "pre-processor equivalent" the input file to the compiler will be different! They can (and do in this case) produce equivalent .o files however, which I think is what claude is getting at.

@travisdowns travisdowns merged commit 978a55c into redpanda-data:dev May 14, 2026
24 checks passed
@dotnwat
Copy link
Copy Markdown
Member

dotnwat commented May 14, 2026

The resulting C++ files were preprocessor-equivalent across runs, so
the final binary was unchanged and this was invisible to a top-level

This is wrong, changing include order is definitely not "pre-processor equivalent" the input file to the compiler will be different! They can (and do in this case) produce equivalent .o files however, which I think is what claude is getting at.

Just to clarify, we are only critiquing Claude's statement about "pre-processor equivalent", and this PR is still correctly producing consistent order of includes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants