Datalake schema registry context by wdberkeley · Pull Request #30132 · redpanda-data/redpanda

wdberkeley · 2026-04-10T21:02:23Z

Add per-topic Schema Registry context support for datalake/Iceberg translation.

New topic property redpanda.schema.registry.context: binds a topic to a specific SR context namespace (e.g. .my_context) for schema ID resolution. Validated on set — must start with ., cannot contain :, cannot be the reserved .__GLOBAL name. Defaults to the default context (.).
Datalake translator wiring: the translator and coordinator now resolve schema IDs in the topic's configured context instead of always using the default. In-memory schema and resolved-type caches are keyed by (context, schema_id) to prevent cross-context cache poisoning.
E2E tests: ducktape tests verify context isolation (same schema ID in different contexts produces different Iceberg columns) and that lookups in the wrong context route records to the DLQ.

Design note

The SR context is not persisted alongside the schema_identifier in the coordinator STM. The coordinator reads the context from the topic's current configuration at resolution time. To safely change a topic's context mid-stream: disable translation, let the coordinator commit pending entries, change the context, then re-enable.

Backports Required

Release Notes

Features

Iceberg translation now support schema registry contexts. To configure a topic to resolve schemas in a context, configure the redpanda.schema.registry.context topic property with the context name.

Copilot

Pull request overview

Adds per-topic Schema Registry context support for datalake/Iceberg translation, ensuring schema resolution and caching are isolated by SR context.

Changes:

Introduces a new topic property redpanda.schema.registry.context, including validation, alter-config handling, and config reporting.
Wires the datalake translator/coordinator to resolve schema IDs within the topic’s configured SR context and isolates in-memory caches by (context, schema_id).
Adds unit + ducktape E2E tests covering non-default context resolution, strict no-fallback behavior, and cache isolation.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/offline_log_viewer/controller.py	Extends topic-properties decoding to include `schema_registry_context` for newer serde versions.
tests/rptest/tests/datalake/schema_registry_context_test.py	New ducktape E2E coverage for context isolation and DLQ behavior when resolving in the wrong context.
tests/rptest/clients/types.py	Adds TopicSpec constant for the new topic property name.
src/v/pandaproxy/schema_registry/types.h / types.cc	Adds `validate_context()` helper for context format validation.
src/v/kafka/server/handlers/topics/{types.h,types.cc,validators.h}	Declares the topic property and validates it on CreateTopics.
src/v/kafka/server/handlers/{alter_configs.cc,incremental_alter_configs.cc}	Supports altering the new property via (incremental) AlterConfigs.
src/v/kafka/server/handlers/configs/{config_utils.h,config_response_utils.cc,storage_mode_properties.h}	Adds validation and DescribeConfigs/reporting support for the new context type/property.
src/v/datalake/{schema_identifier.h,record_schema_resolver.h,record_schema_resolver.cc,datalake_manager.cc}	Makes schema/type caches context-aware and threads context through resolvers used by translators.
src/v/datalake/coordinator/coordinator.cc	Resolves identifiers using the topic’s current configured context at resolution time.
src/v/datalake/tests/{test_utils.cc,record_schema_resolver_test.cc}	Updates and adds unit tests for context-aware resolution and cache isolation.
src/v/cluster/{topic_properties.h,topic_properties.cc,types.h,types.cc,topic_table.cc}	Persists and propagates the new topic property through cluster topic configuration/update plumbing.
src/v/cluster/tests/topic_properties_generator.h	Generates randomized topic properties including non-default contexts for tests.
src/v/cluster_link/utils/topic_properties_utils.cc	Propagates the new property through cluster-link update parsing.
src/v/{kafka/server/BUILD,cluster/BUILD,cluster_link/utils/BUILD}	Adds build deps for schema registry types where needed.

github-actions · 2026-04-10T21:59:13Z

The latest Buf updates on your PR. Results from workflow Buf CI / validate (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 10, 2026, 9:59 PM

vbotbuildovich · 2026-04-10T23:04:32Z

Retry command for Build#83041

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/describe_topics_test.py::DescribeTopicsTest.test_describe_topics_with_documentation_and_types

vbotbuildovich · 2026-04-10T23:39:10Z

CI test results

test results on build#83041

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FAIL	DescribeTopicsTest	test_describe_topics_with_documentation_and_types	null	integration	https://buildkite.com/redpanda/redpanda/builds/83041#019d7970-8009-4e62-858b-4f1b87af817b	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DescribeTopicsTest&test_method=test_describe_topics_with_documentation_and_types
FAIL	DescribeTopicsTest	test_describe_topics_with_documentation_and_types	null	integration	https://buildkite.com/redpanda/redpanda/builds/83041#019d7971-b2d6-47cb-a6e6-af5c89a6e55d	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DescribeTopicsTest&test_method=test_describe_topics_with_documentation_and_types

test results on build#83415

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/83415#019dacca-2c38-4c88-b133-bf409d16c7c2	9/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0688, p0=0.5096, reject_threshold=0.0100. adj_baseline=0.1925, p1=0.3989, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

test results on build#83764

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FAIL	AvailabilityTests	test_recovery_after_catastrophic_failure	null	integration	https://buildkite.com/redpanda/redpanda/builds/83764#019dd517-b954-438f-b7d0-6249b1cd4a8e	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AvailabilityTests&test_method=test_recovery_after_catastrophic_failure
FLAKY(PASS)	ShadowLinkingMetricsTests	test_link_metrics	null	integration	https://buildkite.com/redpanda/redpanda/builds/83764#019dd517-b954-438f-b7d0-6249b1cd4a8e	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingMetricsTests&test_method=test_link_metrics

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 2 comments.

vbotbuildovich · 2026-04-28T18:35:51Z

Retry command for Build#83764

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/availability_test.py::AvailabilityTests.test_recovery_after_catastrophic_failure

nvartolomei

lgtm at high level - will take a break and review carefully

nvartolomei · 2026-05-06T13:45:27Z

 #include "model/fundamental.h"
 #include "model/metadata.h"
 #include "model/namespace.h"
+#include "pandaproxy/schema_registry/types.h"


that's a relatively unrelated dependency with quite a few additional transitive dependencies to pull in

not worth imho

split validation into a separate file?

also string in - optional error out/std::expected<void, string>?

also return type should be context_invalid and not subject_invalid - it does exist already

the throwing variant can be built on top if SR needs it

not duplicating sounds like nice idea but you are duplicating the rules anyway albeit in text only version

return fmt::format(
"redpanda.schema.registry.context `{}' is invalid: must start "
"with '.', must not contain ':', and must not be the reserved "
"'.__GLOBAL' context",

Refactored as suggested.

nvartolomei · 2026-05-06T13:55:56Z

+        .error_code,
+      kafka::error_code::none);
+
+    // Changing schema_registry_context while translation is enabled must fail.


what happens when a user sets both during i.e. topic creation? do we sequence them correctly?

also, by claude

alter_configs full-replace can silently strip the context while iceberg is enabled.

src/v/kafka/server/handlers/alter_configs.cc:454-477 rejects an explicit set of redpanda.schema.registry.context while iceberg_mode != disabled. But alter_configs is full-replace: at
line 98, std::apply(apply_op(op_t::remove), update.properties.serde_fields()) initializes every property's op to remove. If the user sends an alter request that omits the
schema_registry_context key, the property is removed (reset to default), and no branch in the loop fires the iceberg-state check. The coordinator will then resolve schema ids against
the default context for in-flight entries committed under a non-default context — the exact poisoning scenario the author tries to prevent. Fix: after the loop, if
update.properties.schema_registry_context.op == remove AND it would actually change the value AND iceberg is currently enabled, return invalid_config. Or model with op_t::none default
(precedent at lines 125-127 for remote_*/iceberg_mode).

what happens when a user sets both during i.e. topic creation? do we sequence them correctly?

Added test applying both to show it works.

model with op_t::none default
(precedent at lines 125-127 for remote_*/iceberg_mode)

Done, with test.

andrwng · 2026-05-12T02:20:52Z

+    auto topic_md = topic_table_.get_topic_metadata_ref(
+      model::topic_namespace_view{model::kafka_namespace, topic});
+    auto sr_ctx = topic_md ? topic_md->get()
+                               .get_configuration()
+                               .properties.schema_registry_context.value_or(
+                                 pandaproxy::schema_registry::default_context)
+                           : pandaproxy::schema_registry::default_context;


Did you consider having the translators pass the context to the coordinator in the RPC? It's a bit surprising to me that that isn't the case, particularly because the RPC request contains other topic + schema information already. I guess we would expect them to always be the same...

Yeah, but the coordinator already has access to the topic config so that it can read the context info. There's an invariant that context doesn't change while translation is active, so there can't be a mismatch between what the coordinator sees as the context and what the translator did.

andrwng · 2026-05-12T02:20:53Z

@@ -30,6 +30,24 @@ struct schema_identifier
    bool operator==(const schema_identifier&) const = default;


Based on the commit message, there might be a misunderstanding about what is persisted -- at least, I'm under the impression that the schema_identifer is not persisted and that it's only serialized over RPC.

Yeah that is confusing. "Persisted" there is referring to the wire compatibility implications of changing schema_indentifier... not that it's persisted :) . Will update the description.

andrwng · 2026-05-12T02:20:54Z

+    def _make_confluent_record(self, schema_id, schema_dict, record):
+        """Build a Confluent wire-format payload: magic byte + 4-byte
+        schema ID + Avro binary-encoded record."""
+        parsed = avro.schema.parse(json.dumps(schema_dict))
+        buf = io.BytesIO()
+        buf.write(struct.pack(">bI", 0, schema_id))
+        encoder = avro.io.BinaryEncoder(buf)
+        writer = avro.io.DatumWriter(parsed)
+        writer.write(record, encoder)


This seems off to me, but I'm not an expert in confluent kafka python. Is this actually the correct way to write Avro with a schema in a given context? I would have expected this is all handled by the library. If this isn't supported by the library or something, please add a comment explaining why we need to create the bytes manually

I'll have the clanker rewrite it to form the messages legit.

The incremental topic update reader stopped at reader_version=8, silently dropping fields added in later serde versions: v9: message_timestamp_before_max_ms, message_timestamp_after_max_ms (added in 98fc4e2) v10: remote_label, storage_mode (added in 03678e1) Bump the reader to version=10 and decode the missing fields.

Add a new topic property `redpanda.schema.registry.context` that binds a topic to a specific Schema Registry context for schema id resolution. This lets the in-broker Iceberg translator (and future schema id validation) look up schemas in the correct SR context namespace. The property is stored as `std::optional<context>`; nullopt means the default SR context ("."). Validation rejects values that don't start with '.', contain ':', or match the reserved '.__GLOBAL' context name. Validation logic lives in a shared `validate_context()` helper in pandaproxy/schema_registry/types.h. Pure plumbing: the property is visible and settable via create-topic, alter-configs, incremental-alter-configs, and describe-configs, but has no runtime effect yet (wired to the datalake resolver in the next commit). Also plumbed through cluster-link property propagation and the offline log viewer.

Wire the new `schema_registry_context` topic property into the datalake translator's schema resolution path. Both `record_schema_resolver` and `latest_subject_schema_resolver` now accept a context parameter and use it instead of the hardcoded `default_context` when calling the Schema Registry. Extend the shared schema and resolved-type caches with context-aware keys (`context_schema_cache_key` and `context_schema_identifier`) so that topics bound to different SR contexts on the same shard don't poison each other's cache entries. A topic's context can't be changed while translation is enabled. This prevents races in translation and commit.

Add a ducktape integration test verifying the full end-to-end path for the `redpanda.schema.registry.context` topic property: SR schema registration in contexts, topic property configuration, translator schema resolution, and typed Iceberg columns. test_context_isolation: two topics bound to different SR contexts resolve different schemas from the same numeric schema ID, producing different Iceberg table column layouts. test_wrong_context_dlq: schema ID not present in the configured context sends records to the dead-letter-queue table.

Copilot AI review requested due to automatic review settings April 10, 2026 21:02

github-actions Bot added area/build area/redpanda labels Apr 10, 2026

wdberkeley force-pushed the datalake-schema-registry-context branch from fb6fc42 to 3661d1d Compare April 10, 2026 21:02

Copilot started reviewing on behalf of wdberkeley April 10, 2026 21:03 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread src/v/kafka/server/handlers/configs/config_utils.h

Comment thread tests/rptest/tests/datalake/schema_registry_context_test.py Outdated

wdberkeley force-pushed the datalake-schema-registry-context branch from 3661d1d to d32b1bf Compare April 10, 2026 21:58

wdberkeley requested review from a team, kbatuigas and r-vasquez as code owners April 10, 2026 21:58

github-actions Bot added the area/rpk label Apr 10, 2026

wdberkeley force-pushed the datalake-schema-registry-context branch from d32b1bf to 864f6f8 Compare April 10, 2026 21:59

wdberkeley removed request for a team, kbatuigas and r-vasquez April 10, 2026 21:59

wdberkeley force-pushed the datalake-schema-registry-context branch 2 times, most recently from d159f56 to 8c59496 Compare April 20, 2026 15:47

wdberkeley requested a review from Copilot April 20, 2026 18:36

Copilot started reviewing on behalf of wdberkeley April 20, 2026 18:36 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread src/v/datalake/record_schema_resolver.cc

Comment thread tools/offline_log_viewer/controller.py

wdberkeley force-pushed the datalake-schema-registry-context branch from 8c59496 to 81b9cf9 Compare April 20, 2026 21:01

wdberkeley force-pushed the datalake-schema-registry-context branch 3 times, most recently from c22e833 to 7cd9274 Compare April 28, 2026 16:53

wdberkeley requested review from andrwng and nvartolomei April 29, 2026 16:40

pgellert self-requested a review May 6, 2026 12:28

nvartolomei reviewed May 6, 2026

View reviewed changes

Comment thread src/v/kafka/server/handlers/alter_configs.cc

andrwng reviewed May 12, 2026

View reviewed changes

wdberkeley added 4 commits May 12, 2026 14:21

wdberkeley force-pushed the datalake-schema-registry-context branch from 7cd9274 to 4e64c71 Compare May 12, 2026 21:47

		@@ -30,6 +30,24 @@ struct schema_identifier
		bool operator==(const schema_identifier&) const = default;

Conversation

wdberkeley commented Apr 10, 2026

Backports Required

Release Notes

Features

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

vbotbuildovich commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#83041

Uh oh!

vbotbuildovich commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

vbotbuildovich commented Apr 28, 2026

Retry command for Build#83764

Uh oh!

nvartolomei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vbotbuildovich commented Apr 10, 2026 •

edited

Loading

vbotbuildovich commented Apr 10, 2026 •

edited

Loading