kafka/protocol: make schemata codegen reproducible#30468
Conversation
The generator iterated Python sets when emitting #include lines, so the output of each codegen action varied with PYTHONHASHSEED (which CPython picks fresh per interpreter). The C++ files were preprocessor-equivalent across runs, so the final binary was unchanged, but Bazel keys its action cache on input content hashes — non-deterministic codegen output invalidates every downstream compile's cache key. On a shared remote cache this means every developer compiling the kafka layer misses on every action that consumes the generated headers. Sort the header sets at the two sites where they're iterated by Jinja: StructType.headers() (per-schema header template) and the extra_schema_headers passed into COMBINED_SOURCE_TEMPLATE (per-schema source template). Add a small py_test that runs the generator with two very different PYTHONHASHSEED values across a handful of representative schemata and fails if any output byte differs.
There was a problem hiding this comment.
Pull request overview
Fixes non-determinism in the kafka schemata code generator by sorting Python sets at two iteration sites, ensuring byte-identical output across runs with different PYTHONHASHSEED values. This restores Bazel remote action cache hits for downstream compiles that consume the generated headers.
Changes:
- Sort
StructType.headers()output andextra_schema_headersbefore passing to Jinja templates. - Add a
py_testthat runs the generator with twoPYTHONHASHSEEDvalues across six representative schemata and asserts byte-equivalent output. - Register the new test in the schemata
BUILDfile with the JSON data dependencies it needs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/v/kafka/protocol/schemata/generator.py | Sort header sets at the two Jinja iteration sites for deterministic output. |
| src/v/kafka/protocol/schemata/generator_reproducibility_test.py | New regression test that runs codegen under varying PYTHONHASHSEED and diffs the output. |
| src/v/kafka/protocol/schemata/BUILD | Declare the new py_test with required schema JSON and generator data deps. |
CI test resultstest results on build#84419
|
This is wrong, changing include order is definitely not "pre-processor equivalent" the input file to the compiler will be different! They can (and do in this case) produce equivalent .o files however, which I think is what claude is getting at. |
Just to clarify, we are only critiquing Claude's statement about "pre-processor equivalent", and this PR is still correctly producing consistent order of includes? |
The kafka schemata generator emits C++ source from JSON schemas. Earlier
versions iterated Python sets when emitting
#includelines, so thebyte-level output of each codegen action varied with
PYTHONHASHSEED(which CPython picks fresh per interpreter).
The resulting object file output by compilation was unchanged by the
header swaps, so
the final binary was also unchanged and this was invisible to a top-level
hash check of "redpanda" binary.
But Bazel keys its action cache on input content
hashes — non-deterministic codegen output invalidates every
downstream compile's cache key. On a shared remote cache that means
every developer compiling the kafka layer misses on every action that
consumes the generated headers.
This PR:
Jinja:
StructType.headers()(per-schema header template) and theextra_schema_headersset passed intoCOMBINED_SOURCE_TEMPLATE(per-schema source template). Two-line, behavior-preserving fix.
py_testthat runs the generator with two verydifferent
PYTHONHASHSEEDvalues across a handful of representativeschemata and fails if any output byte differs. Test runs in ~4s.
Found while running a broader Bazel hermeticity check on
dev: thetop-level
//:redpandabinary already hashes identically across runs(thanks to #30187), but a wider bazel-out diff surfaced ~20 schemata
.hfiles differing on#includeline order. With this fix thosefiles are now byte-identical across runs.
Backports Required
Release Notes