Skip to content

Datalake schema registry context#30132

Open
wdberkeley wants to merge 4 commits into
devfrom
datalake-schema-registry-context
Open

Datalake schema registry context#30132
wdberkeley wants to merge 4 commits into
devfrom
datalake-schema-registry-context

Conversation

@wdberkeley
Copy link
Copy Markdown
Contributor

Add per-topic Schema Registry context support for datalake/Iceberg translation.

  • New topic property redpanda.schema.registry.context: binds a topic to a specific SR context namespace (e.g. .my_context) for schema ID resolution. Validated on set — must start with ., cannot contain :, cannot be the reserved .__GLOBAL name. Defaults to the default context (.).
  • Datalake translator wiring: the translator and coordinator now resolve schema IDs in the topic's configured context instead of always using the default. In-memory schema and resolved-type caches are keyed by (context, schema_id) to prevent cross-context cache poisoning.
  • E2E tests: ducktape tests verify context isolation (same schema ID in different contexts produces different Iceberg columns) and that lookups in the wrong context route records to the DLQ.

Design note

The SR context is not persisted alongside the schema_identifier in the coordinator STM. The coordinator reads the context from the topic's current configuration at resolution time. To safely change a topic's context mid-stream: disable translation, let the coordinator commit pending entries, change the context, then re-enable.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Features

  • Iceberg translation now support schema registry contexts. To configure a topic to resolve schemas in a context, configure the redpanda.schema.registry.context topic property with the context name.

Copilot AI review requested due to automatic review settings April 10, 2026 21:02
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch from fb6fc42 to 3661d1d Compare April 10, 2026 21:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-topic Schema Registry context support for datalake/Iceberg translation, ensuring schema resolution and caching are isolated by SR context.

Changes:

  • Introduces a new topic property redpanda.schema.registry.context, including validation, alter-config handling, and config reporting.
  • Wires the datalake translator/coordinator to resolve schema IDs within the topic’s configured SR context and isolates in-memory caches by (context, schema_id).
  • Adds unit + ducktape E2E tests covering non-default context resolution, strict no-fallback behavior, and cache isolation.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/offline_log_viewer/controller.py Extends topic-properties decoding to include schema_registry_context for newer serde versions.
tests/rptest/tests/datalake/schema_registry_context_test.py New ducktape E2E coverage for context isolation and DLQ behavior when resolving in the wrong context.
tests/rptest/clients/types.py Adds TopicSpec constant for the new topic property name.
src/v/pandaproxy/schema_registry/types.h / types.cc Adds validate_context() helper for context format validation.
src/v/kafka/server/handlers/topics/{types.h,types.cc,validators.h} Declares the topic property and validates it on CreateTopics.
src/v/kafka/server/handlers/{alter_configs.cc,incremental_alter_configs.cc} Supports altering the new property via (incremental) AlterConfigs.
src/v/kafka/server/handlers/configs/{config_utils.h,config_response_utils.cc,storage_mode_properties.h} Adds validation and DescribeConfigs/reporting support for the new context type/property.
src/v/datalake/{schema_identifier.h,record_schema_resolver.h,record_schema_resolver.cc,datalake_manager.cc} Makes schema/type caches context-aware and threads context through resolvers used by translators.
src/v/datalake/coordinator/coordinator.cc Resolves identifiers using the topic’s current configured context at resolution time.
src/v/datalake/tests/{test_utils.cc,record_schema_resolver_test.cc} Updates and adds unit tests for context-aware resolution and cache isolation.
src/v/cluster/{topic_properties.h,topic_properties.cc,types.h,types.cc,topic_table.cc} Persists and propagates the new topic property through cluster topic configuration/update plumbing.
src/v/cluster/tests/topic_properties_generator.h Generates randomized topic properties including non-default contexts for tests.
src/v/cluster_link/utils/topic_properties_utils.cc Propagates the new property through cluster-link update parsing.
src/v/{kafka/server/BUILD,cluster/BUILD,cluster_link/utils/BUILD} Adds build deps for schema registry types where needed.

Comment thread src/v/kafka/server/handlers/configs/config_utils.h
Comment thread tests/rptest/tests/datalake/schema_registry_context_test.py Outdated
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch from 3661d1d to d32b1bf Compare April 10, 2026 21:58
@wdberkeley wdberkeley requested review from a team, kbatuigas and r-vasquez as code owners April 10, 2026 21:58
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch from d32b1bf to 864f6f8 Compare April 10, 2026 21:59
@github-actions
Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf CI / validate (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 10, 2026, 9:59 PM

@wdberkeley wdberkeley removed request for a team, kbatuigas and r-vasquez April 10, 2026 21:59
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Apr 10, 2026

Retry command for Build#83041

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/describe_topics_test.py::DescribeTopicsTest.test_describe_topics_with_documentation_and_types

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Apr 10, 2026

CI test results

test results on build#83041
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL DescribeTopicsTest test_describe_topics_with_documentation_and_types null integration https://buildkite.com/redpanda/redpanda/builds/83041#019d7970-8009-4e62-858b-4f1b87af817b 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DescribeTopicsTest&test_method=test_describe_topics_with_documentation_and_types
FAIL DescribeTopicsTest test_describe_topics_with_documentation_and_types null integration https://buildkite.com/redpanda/redpanda/builds/83041#019d7971-b2d6-47cb-a6e6-af5c89a6e55d 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DescribeTopicsTest&test_method=test_describe_topics_with_documentation_and_types
test results on build#83415
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/83415#019dacca-2c38-4c88-b133-bf409d16c7c2 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0688, p0=0.5096, reject_threshold=0.0100. adj_baseline=0.1925, p1=0.3989, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#83764
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL AvailabilityTests test_recovery_after_catastrophic_failure null integration https://buildkite.com/redpanda/redpanda/builds/83764#019dd517-b954-438f-b7d0-6249b1cd4a8e 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AvailabilityTests&test_method=test_recovery_after_catastrophic_failure
FLAKY(PASS) ShadowLinkingMetricsTests test_link_metrics null integration https://buildkite.com/redpanda/redpanda/builds/83764#019dd517-b954-438f-b7d0-6249b1cd4a8e 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingMetricsTests&test_method=test_link_metrics

@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch 2 times, most recently from d159f56 to 8c59496 Compare April 20, 2026 15:47
@wdberkeley wdberkeley requested a review from Copilot April 20, 2026 18:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 2 comments.

Comment thread src/v/datalake/record_schema_resolver.cc
Comment thread tools/offline_log_viewer/controller.py
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch from 8c59496 to 81b9cf9 Compare April 20, 2026 21:01
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch 3 times, most recently from c22e833 to 7cd9274 Compare April 28, 2026 16:53
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#83764

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/availability_test.py::AvailabilityTests.test_recovery_after_catastrophic_failure

@pgellert pgellert self-requested a review May 6, 2026 12:28
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm at high level - will take a break and review carefully

#include "model/fundamental.h"
#include "model/metadata.h"
#include "model/namespace.h"
#include "pandaproxy/schema_registry/types.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a relatively unrelated dependency with quite a few additional transitive dependencies to pull in

not worth imho

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split validation into a separate file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also string in - optional error out/std::expected<void, string>?

also return type should be context_invalid and not subject_invalid - it does exist already

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the throwing variant can be built on top if SR needs it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not duplicating sounds like nice idea but you are duplicating the rules anyway albeit in text only version

return fmt::format(
"redpanda.schema.registry.context `{}' is invalid: must start "
"with '.', must not contain ':', and must not be the reserved "
"'.__GLOBAL' context",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored as suggested.

.error_code,
kafka::error_code::none);

// Changing schema_registry_context while translation is enabled must fail.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when a user sets both during i.e. topic creation? do we sequence them correctly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, by claude

alter_configs full-replace can silently strip the context while iceberg is enabled.

  • src/v/kafka/server/handlers/alter_configs.cc:454-477 rejects an explicit set of redpanda.schema.registry.context while iceberg_mode != disabled. But alter_configs is full-replace: at
    line 98, std::apply(apply_op(op_t::remove), update.properties.serde_fields()) initializes every property's op to remove. If the user sends an alter request that omits the
    schema_registry_context key, the property is removed (reset to default), and no branch in the loop fires the iceberg-state check. The coordinator will then resolve schema ids against
    the default context for in-flight entries committed under a non-default context — the exact poisoning scenario the author tries to prevent. Fix: after the loop, if
    update.properties.schema_registry_context.op == remove AND it would actually change the value AND iceberg is currently enabled, return invalid_config. Or model with op_t::none default
    (precedent at lines 125-127 for remote_*/iceberg_mode).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when a user sets both during i.e. topic creation? do we sequence them correctly?

Added test applying both to show it works.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model with op_t::none default
(precedent at lines 125-127 for remote_*/iceberg_mode)

Done, with test.

Comment thread src/v/kafka/server/handlers/alter_configs.cc
Comment on lines +454 to +460
auto topic_md = topic_table_.get_topic_metadata_ref(
model::topic_namespace_view{model::kafka_namespace, topic});
auto sr_ctx = topic_md ? topic_md->get()
.get_configuration()
.properties.schema_registry_context.value_or(
pandaproxy::schema_registry::default_context)
: pandaproxy::schema_registry::default_context;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider having the translators pass the context to the coordinator in the RPC? It's a bit surprising to me that that isn't the case, particularly because the RPC request contains other topic + schema information already. I guess we would expect them to always be the same...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the coordinator already has access to the topic config so that it can read the context info. There's an invariant that context doesn't change while translation is active, so there can't be a mismatch between what the coordinator sees as the context and what the translator did.

@@ -30,6 +30,24 @@ struct schema_identifier
bool operator==(const schema_identifier&) const = default;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the commit message, there might be a misunderstanding about what is persisted -- at least, I'm under the impression that the schema_identifer is not persisted and that it's only serialized over RPC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is confusing. "Persisted" there is referring to the wire compatibility implications of changing schema_indentifier... not that it's persisted :) . Will update the description.

Comment on lines +91 to +99
def _make_confluent_record(self, schema_id, schema_dict, record):
"""Build a Confluent wire-format payload: magic byte + 4-byte
schema ID + Avro binary-encoded record."""
parsed = avro.schema.parse(json.dumps(schema_dict))
buf = io.BytesIO()
buf.write(struct.pack(">bI", 0, schema_id))
encoder = avro.io.BinaryEncoder(buf)
writer = avro.io.DatumWriter(parsed)
writer.write(record, encoder)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems off to me, but I'm not an expert in confluent kafka python. Is this actually the correct way to write Avro with a schema in a given context? I would have expected this is all handled by the library. If this isn't supported by the library or something, please add a comment explaining why we need to create the bytes manually

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have the clanker rewrite it to form the messages legit.

The incremental topic update reader stopped at reader_version=8,
silently dropping fields added in later serde versions:

  v9: message_timestamp_before_max_ms, message_timestamp_after_max_ms
      (added in 98fc4e2)
  v10: remote_label, storage_mode
      (added in 03678e1)

Bump the reader to version=10 and decode the missing fields.
Add a new topic property `redpanda.schema.registry.context` that binds
a topic to a specific Schema Registry context for schema id resolution.
This lets the in-broker Iceberg translator (and future schema id
validation) look up schemas in the correct SR context namespace.

The property is stored as `std::optional<context>`; nullopt means the
default SR context ("."). Validation rejects values that don't start
with '.', contain ':', or match the reserved '.__GLOBAL' context name.
Validation logic lives in a shared `validate_context()` helper in
pandaproxy/schema_registry/types.h.

Pure plumbing: the property is visible and settable via create-topic,
alter-configs, incremental-alter-configs, and describe-configs, but has
no runtime effect yet (wired to the datalake resolver in the next
commit). Also plumbed through cluster-link property propagation and the
offline log viewer.
Wire the new `schema_registry_context` topic property into the datalake
translator's schema resolution path. Both `record_schema_resolver` and
`latest_subject_schema_resolver` now accept a context parameter and use
it instead of the hardcoded `default_context` when calling the Schema
Registry.

Extend the shared schema and resolved-type caches with context-aware
keys (`context_schema_cache_key` and `context_schema_identifier`) so
that topics bound to different SR contexts on the same shard don't
poison each other's cache entries.

A topic's context can't be changed while translation is enabled. This
prevents races in translation and commit.
Add a ducktape integration test verifying the full end-to-end path for
the `redpanda.schema.registry.context` topic property: SR schema
registration in contexts, topic property configuration, translator
schema resolution, and typed Iceberg columns.

test_context_isolation: two topics bound to different SR contexts resolve
different schemas from the same numeric schema ID, producing different
Iceberg table column layouts.

test_wrong_context_dlq: schema ID not present in the configured context
sends records to the dead-letter-queue table.
@wdberkeley wdberkeley force-pushed the datalake-schema-registry-context branch from 7cd9274 to 4e64c71 Compare May 12, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants