cl/test: rnot: add cloud topic workloads by Lazin · Pull Request #30435 · redpanda-data/redpanda

Lazin · 2026-05-11T16:58:08Z

Adds cloud-topic and tiered-cloud-topic workloads to the shadow linking random node ops test so we exercise plain cloud and tiered_cloud storage modes alongside the existing si, compacted, and transactional workloads. Enables the explicit-only tiered_cloud_topics feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide; allow-lists the expected cloud-topics shutdown warnings.

Fixes the bug in the write-at-offset code path in the cloud topics frontend. The frontend was converting batches of all types as placeholders. This caused the stall in the target cluster. The second commit in the PR fixes this.

Finally, the test adds new workload that constantly flips between cloud and tiered_cloud modes. The goal is to have a mix of raft_data and ct_placeholder batches in the partition which is being shadowed.

Backports Required

Release Notes

none

Copilot

Pull request overview

Extends the shadow-linking random node operations test to also exercise cloud-topics and tiered-cloud-topics storage modes during node operations, including enabling the necessary cluster config/feature flags and allow-listing expected cloud-topics shutdown/retry log messages.

Changes:

Enable cloud_topics_enabled on both clusters and activate the explicit-only tiered_cloud_topics feature before topic creation.
Increase preallocated client nodes and ducktape cluster node count to support two additional concurrent workloads.
Add two new workloads (cloud-topic, tiered-cloud-topic) using redpanda.storage.mode topic config and allow-list expected cloud-topics shadow-link logs.

                extra_rp_conf={
                    "group_new_member_join_timeout": 3000,
+                    CLOUD_TOPICS_CONFIG_STR: True,
                },


@@ -276,6 +280,7 @@ def __init__(self, test_ctx: TestContext):
                "retention_local_trim_interval": 5000,
                "partition_autobalancing_tick_interval_ms": 2000,
                "group_new_member_join_timeout": 3000,
+                CLOUD_TOPICS_CONFIG_STR: True,
            },


+        # enabled on both clusters before any tiered_cloud topic can be
+        # created.
+        self.source_cluster.service.set_feature_active(
+            "tiered_cloud_topics", True, timeout_sec=30
+        )


vbotbuildovich · 2026-05-11T18:15:00Z

CI test results

test results on build#84286

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkBasicTests	test_link_creation_checks	{"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/84286#019e1802-6e95-4755-8941-7102b14e57e6	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0121, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS)	Datalake3rdPartyMaintenanceTest	test_e2e_basic	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-339b-4231-8b6b-79f43ff986f4	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=Datalake3rdPartyMaintenanceTest&test_method=test_e2e_basic
FLAKY(PASS)	WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-33a0-4837-90cb-d329d573bfe5	9/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0975, p0=0.6415, reject_threshold=0.0100. adj_baseline=0.2649, p1=0.2121, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

test results on build#84317

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkingReplicationTests	test_with_restart	{"storage_mode": "cloud"}	integration	https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0305, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_with_restart
FAIL	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": false, "workload_set": "cloud_combos"}	integration	https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": false, "workload_set": "cloud_combos"}	integration	https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7282-4ecf-a3d6-f8c8957b188f	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true, "workload_set": "cloud_combos"}	integration	https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c906-4a15-981b-e4328ad8375c	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true, "workload_set": "cloud_combos"}	integration	https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7284-431b-ad2e-457808edc9a3	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations

test results on build#84334

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkBasicTests	test_link_creation_checks	{"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/84334#019e1cca-dd8e-4c47-96c7-7b6c2f245bba	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0225, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS)	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true, "workload_set": "basic"}	integration	https://buildkite.com/redpanda/redpanda/builds/84334#019e1ccb-528e-440b-9643-891f40a27ca9	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations

vbotbuildovich · 2026-05-12T11:33:14Z

Retry command for Build#84317

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false,"workload_set":"cloud_combos"}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"cloud_combos"}

pgellert

Looks good to me, but I'll let someone from the cloud topics team approve

pgellert · 2026-05-13T13:02:40Z

+        for (auto&& b : passthrough_batches) {
+            final_batches.push_back(std::move(b));
+        }


I think this would be simpler here:

Suggested change

for (auto&& b : passthrough_batches) {

final_batches.push_back(std::move(b));

}

final_batches = std::move(passthrough_batches);

this was simplified quite a bit

andrwng · 2026-05-15T19:13:25Z

+    // (raft_configuration, tx_fence, control batches like transaction
+    // commit/abort markers, etc.) carry their payload in the record


I'm kind of confused by this -- I was under the impression that the only batches we expect here are data batches (which may include tx control batches, but not raft configuration). Is that the case? What non-raft_data batches do we expect here? If we're just being conservative here, could we update the comment to indicate that?

I added the assertion and the comment below.

andrwng · 2026-05-15T19:13:27Z

+    // Per-input-position slots: true means the slot will be filled with a
+    // generated placeholder, false means it carries a pass-through batch
+    // already stored in `passthrough_batches`.
+    chunked_vector<bool> is_data_slot;


Maybe it makes sense to make this a set of non-data indexes. At least, I imagine we're more likely to have zero non-data batches in most cases.

andrwng · 2026-05-15T19:13:28Z

+        // Interleave generated placeholders with pass-through batches to
+        // restore the original input order.
+        auto ph_it = placeholders.batches.begin();
+        auto pt_it = passthrough_batches.begin();
+        for (bool is_data : is_data_slot) {
+            if (is_data) {
+                vassert(
+                  ph_it != placeholders.batches.end(),
+                  "placeholder count mismatch for {}",
+                  ntp());
+                final_batches.push_back(std::move(*ph_it++));
+            } else {
+                vassert(
+                  pt_it != passthrough_batches.end(),
+                  "passthrough count mismatch for {}",
+                  ntp());
+                final_batches.push_back(std::move(*pt_it++));
+            }
+        }


I'm wondering if we need the bitmap at all. Do these batches have offsets assigned already? If so, could we merge them by offset?

andrwng · 2026-05-18T17:30:16Z

-    headers.reserve(batches.size());
-    for (const auto& batch : batches) {
-        headers.push_back(batch.header());
+    // Only user data batches (raft_data with !is_control()) are uploaded


Shower thought, write_at_offset might be a natural place to write directly as L1. In the context of shadow linking, we know the data is stable and has assigned offsets.

I thought about this. The problem is that L1 objects are bounded by last stable offsets but the shadowing is bounded only by the high watermark. So the write_at_offset has to replicate batches that belong to transactions which are not yet committed + control batches.

Adds cloud-topic and tiered-cloud-topic workloads to the shadow linking random node ops test so we exercise plain cloud and tiered_cloud storage modes alongside the existing si, compacted, and transactional workloads. Enables the explicit-only tiered_cloud_topics feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide; allow-lists the expected cloud-topics shutdown warnings. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

For storage.mode=cloud topics, replicate_at_offset previously sent every input batch through stage_write/execute_write and wrapped each one as a ctp_placeholder. The placeholder encoding drops the record key, so for control records (e.g. transaction commit/abort markers) the original key bytes are lost and rm_stm's parse_control_batch throws std::out_of_range on the empty iobuf, halting state machine apply at the marker offset. Split the input list into user data batches (raft_data with !is_control()) and pass-through batches (tx_fence, control batches, etc.). Only data batches are uploaded to L0 and wrapped as ctp_placeholders; the rest are forwarded to the write_at_offset_stm unchanged. The original input ordering is preserved by interleaving the generated placeholders with the pass-through batches. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

Adds a new "flipping" workload_set matrix variant. A single workload runs against flipping-storage-topic while a background daemon thread toggles redpanda.storage.mode between cloud and tiered_cloud every 3 seconds on the source. Transient alter-config failures (leader changes, partition movement) are logged and retried on the next tick; the target config is not separately verified. Wired through ClusterLinkingWorkloadSpec via optional flip_storage_modes / flip_interval_seconds fields so other workloads can opt in if needed. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

andrwng · 2026-05-20T20:22:47Z

+        const auto& hdr = batch.header();
+        const bool is_data = hdr.type == model::record_batch_type::raft_data
+                             && !hdr.attrs.is_control();
+        const bool is_txn_control = hdr.type
+                                      == model::record_batch_type::raft_data
+                                    && hdr.attrs.is_control();
+        vassert(
+          is_data || is_txn_control,
+          "Unexpected batch type {} (control={}) for {} in "
+          "replicate_at_offset; only raft_data and transactional control "
+          "batches are supported",
+          hdr.type,
+          hdr.attrs.is_control(),
+          ntp());


nit: simpler to

vassert(hdr.type == model::record_batch_type::raft_data, "..."); auto is_data = !hdr.attrs.is_control();

andrwng · 2026-05-20T20:22:48Z

+    auto ph_it = placeholder_batches.begin();
+    auto pt_it = passthrough_batches.begin();


nit: it's a little scary how similar the names are, since it makes it easy to mix them up. Maybe call passthrough_batches control_batches instead? This also avoids introducing new batch concepts (even if the scope of "passthrough" was already quite limited)

andrwng · 2026-05-20T20:22:49Z

+    def _flip_storage_mode_loop(self, stop_event: threading.Event) -> None:
+        modes = self.spec.flip_storage_modes or []
+        idx = 0
+        while not stop_event.wait(self.spec.flip_interval_seconds):
+            mode = modes[idx % len(modes)]
+            idx += 1
+            try:
+                self.source_rpk.alter_topic_config(
+                    self.spec.topic,
+                    "redpanda.storage.mode",
+                    mode,
+                )
+                self.logger.debug(
+                    f"Flipped storage mode of {self.spec.topic} to {mode}"
+                )
+            except Exception as e:
+                # Transient errors (leadership changes, partition movement,
+                # etc.) are expected; keep retrying on the next tick.
+                self.logger.warning(
+                    f"Failed to flip storage mode of {self.spec.topic} to {mode}: {e}"
+                )
+


Wondering if this makes sense for regular RNOT too. If so, maybe it belongs in a shared class?

andrwng · 2026-05-20T20:22:50Z

+    )
+    @matrix(
+        failures=[False, True],
+        workload_set=["basic", "cloud_combos", "flipping"],


Curious what the rationale is for adding this as another matrix, vs adding the workload to cloud_combos.

Copilot AI review requested due to automatic review settings May 11, 2026 16:58

Copilot started reviewing on behalf of Lazin May 11, 2026 16:59 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Lazin requested a review from pgellert May 11, 2026 18:58

Lazin force-pushed the ct/shadown-linking-rnot-test branch from 745e0db to c0ffaf5 Compare May 12, 2026 10:13

github-actions Bot added the area/redpanda label May 12, 2026

Lazin force-pushed the ct/shadown-linking-rnot-test branch from 6c0c15f to b0d9665 Compare May 12, 2026 15:18

Lazin requested review from WillemKauf and dotnwat May 12, 2026 15:56

pgellert reviewed May 13, 2026

View reviewed changes

dotnwat requested review from andrwng and nvartolomei May 13, 2026 20:37

andrwng reviewed May 15, 2026

View reviewed changes

andrwng reviewed May 18, 2026

View reviewed changes

Lazin added 3 commits May 20, 2026 11:30

Lazin force-pushed the ct/shadown-linking-rnot-test branch from b0d9665 to 2a0a5ea Compare May 20, 2026 16:34

Lazin requested review from andrwng and pgellert May 20, 2026 18:08

andrwng reviewed May 20, 2026

View reviewed changes

		// (raft_configuration, tx_fence, control batches like transaction
		// commit/abort markers, etc.) carry their payload in the record

		auto ph_it = placeholder_batches.begin();
		auto pt_it = passthrough_batches.begin();

Conversation

Lazin commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

vbotbuildovich commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

vbotbuildovich commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#84317

Uh oh!

pgellert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Lazin commented May 11, 2026 •

edited

Loading

vbotbuildovich commented May 11, 2026 •

edited

Loading

vbotbuildovich commented May 12, 2026 •

edited

Loading