kafka/protocol: make schemata codegen reproducible by travisdowns · Pull Request #30468 · redpanda-data/redpanda

travisdowns · 2026-05-13T21:55:53Z

The kafka schemata generator emits C++ source from JSON schemas. Earlier
versions iterated Python sets when emitting #include lines, so the
byte-level output of each codegen action varied with PYTHONHASHSEED
(which CPython picks fresh per interpreter).

The resulting object file output by compilation was unchanged by the
header swaps, so
the final binary was also unchanged and this was invisible to a top-level
hash check of "redpanda" binary.

But Bazel keys its action cache on input content
hashes — non-deterministic codegen output invalidates every
downstream compile's cache key. On a shared remote cache that means
every developer compiling the kafka layer misses on every action that
consumes the generated headers.

This PR:

Sorts the header sets at the two sites where they're iterated by
Jinja: StructType.headers() (per-schema header template) and the
extra_schema_headers set passed into COMBINED_SOURCE_TEMPLATE
(per-schema source template). Two-line, behavior-preserving fix.
Adds a small py_test that runs the generator with two very
different PYTHONHASHSEED values across a handful of representative
schemata and fails if any output byte differs. Test runs in ~4s.

Found while running a broader Bazel hermeticity check on dev: the
top-level //:redpanda binary already hashes identically across runs
(thanks to #30187), but a wider bazel-out diff surfaced ~20 schemata
.h files differing on #include line order. With this fix those
files are now byte-identical across runs.

Backports Required

Release Notes

none

The generator iterated Python sets when emitting #include lines, so the output of each codegen action varied with PYTHONHASHSEED (which CPython picks fresh per interpreter). The C++ files were preprocessor-equivalent across runs, so the final binary was unchanged, but Bazel keys its action cache on input content hashes — non-deterministic codegen output invalidates every downstream compile's cache key. On a shared remote cache this means every developer compiling the kafka layer misses on every action that consumes the generated headers. Sort the header sets at the two sites where they're iterated by Jinja: StructType.headers() (per-schema header template) and the extra_schema_headers passed into COMBINED_SOURCE_TEMPLATE (per-schema source template). Add a small py_test that runs the generator with two very different PYTHONHASHSEED values across a handful of representative schemata and fails if any output byte differs.

Copilot

Pull request overview

Fixes non-determinism in the kafka schemata code generator by sorting Python sets at two iteration sites, ensuring byte-identical output across runs with different PYTHONHASHSEED values. This restores Bazel remote action cache hits for downstream compiles that consume the generated headers.

Changes:

Sort StructType.headers() output and extra_schema_headers before passing to Jinja templates.
Add a py_test that runs the generator with two PYTHONHASHSEED values across six representative schemata and asserts byte-equivalent output.
Register the new test in the schemata BUILD file with the JSON data dependencies it needs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
src/v/kafka/protocol/schemata/generator.py	Sort header sets at the two Jinja iteration sites for deterministic output.
src/v/kafka/protocol/schemata/generator_reproducibility_test.py	New regression test that runs codegen under varying `PYTHONHASHSEED` and diffs the output.
src/v/kafka/protocol/schemata/BUILD	Declare the new `py_test` with required schema JSON and generator data deps.

vbotbuildovich · 2026-05-13T23:12:05Z

CI test results

test results on build#84419

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkBasicTests	test_link_creation_checks	{"source_cluster_spec": {"cluster_type": "redpanda"}}	integration	https://buildkite.com/redpanda/redpanda/builds/84419#019e2364-523a-4a4f-bf63-021095c8ffb5	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0336, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS)	ShadowLinkBasicTests	test_link_creation_checks	{"source_cluster_spec": {"cluster_type": "redpanda"}}	integration	https://buildkite.com/redpanda/redpanda/builds/84419#019e2365-8575-41b9-bfb5-1c59fb93395e	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0336, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks

travisdowns · 2026-05-14T15:18:45Z

The resulting C++ files were preprocessor-equivalent across runs, so
the final binary was unchanged and this was invisible to a top-level

This is wrong, changing include order is definitely not "pre-processor equivalent" the input file to the compiler will be different! They can (and do in this case) produce equivalent .o files however, which I think is what claude is getting at.

dotnwat · 2026-05-14T16:46:01Z

The resulting C++ files were preprocessor-equivalent across runs, so
the final binary was unchanged and this was invisible to a top-level

This is wrong, changing include order is definitely not "pre-processor equivalent" the input file to the compiler will be different! They can (and do in this case) produce equivalent .o files however, which I think is what claude is getting at.

Just to clarify, we are only critiquing Claude's statement about "pre-processor equivalent", and this PR is still correctly producing consistent order of includes?

Copilot AI review requested due to automatic review settings May 13, 2026 21:55

github-actions Bot added area/build area/redpanda labels May 13, 2026

travisdowns requested review from StephanDollberg and pgellert May 13, 2026 21:56

Copilot started reviewing on behalf of travisdowns May 13, 2026 21:56 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

dotnwat approved these changes May 13, 2026

View reviewed changes

pgellert approved these changes May 14, 2026

View reviewed changes

StephanDollberg approved these changes May 14, 2026

View reviewed changes

travisdowns merged commit 978a55c into redpanda-data:dev May 14, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka/protocol: make schemata codegen reproducible#30468

kafka/protocol: make schemata codegen reproducible#30468
travisdowns merged 1 commit into
redpanda-data:devfrom
travisdowns:td-kafka-schemata-reproducible

travisdowns commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

vbotbuildovich commented May 13, 2026

Uh oh!

travisdowns commented May 14, 2026

Uh oh!

Uh oh!

dotnwat commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

travisdowns commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vbotbuildovich commented May 13, 2026

CI test results

Uh oh!

travisdowns commented May 14, 2026

Uh oh!

Uh oh!

dotnwat commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

travisdowns commented May 13, 2026 •

edited

Loading