Skip to content

cl/test: rnot: add cloud topic workloads#30435

Open
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:ct/shadown-linking-rnot-test
Open

cl/test: rnot: add cloud topic workloads#30435
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:ct/shadown-linking-rnot-test

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented May 11, 2026

Adds cloud-topic and tiered-cloud-topic workloads to the shadow linking random node ops test so we exercise plain cloud and tiered_cloud storage modes alongside the existing si, compacted, and transactional workloads. Enables the explicit-only tiered_cloud_topics feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide; allow-lists the expected cloud-topics shutdown warnings.

Fixes the bug in the write-at-offset code path in the cloud topics frontend. The frontend was converting batches of all types as placeholders. This caused the stall in the target cluster. The second commit in the PR fixes this.

Finally, the test adds new workload that constantly flips between cloud and tiered_cloud modes. The goal is to have a mix of raft_data and ct_placeholder batches in the partition which is being shadowed.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings May 11, 2026 16:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the shadow-linking random node operations test to also exercise cloud-topics and tiered-cloud-topics storage modes during node operations, including enabling the necessary cluster config/feature flags and allow-listing expected cloud-topics shutdown/retry log messages.

Changes:

  • Enable cloud_topics_enabled on both clusters and activate the explicit-only tiered_cloud_topics feature before topic creation.
  • Increase preallocated client nodes and ducktape cluster node count to support two additional concurrent workloads.
  • Add two new workloads (cloud-topic, tiered-cloud-topic) using redpanda.storage.mode topic config and allow-list expected cloud-topics shadow-link logs.

Comment on lines 264 to 267
extra_rp_conf={
"group_new_member_join_timeout": 3000,
CLOUD_TOPICS_CONFIG_STR: True,
},
Comment on lines 269 to 284
@@ -276,6 +280,7 @@ def __init__(self, test_ctx: TestContext):
"retention_local_trim_interval": 5000,
"partition_autobalancing_tick_interval_ms": 2000,
"group_new_member_join_timeout": 3000,
CLOUD_TOPICS_CONFIG_STR: True,
},
Comment on lines +329 to +333
# enabled on both clusters before any tiered_cloud topic can be
# created.
self.source_cluster.service.set_feature_active(
"tiered_cloud_topics", True, timeout_sec=30
)
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 11, 2026

CI test results

test results on build#84286
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1802-6e95-4755-8941-7102b14e57e6 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0121, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS) Datalake3rdPartyMaintenanceTest test_e2e_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-339b-4231-8b6b-79f43ff986f4 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=Datalake3rdPartyMaintenanceTest&test_method=test_e2e_basic
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-33a0-4837-90cb-d329d573bfe5 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0975, p0=0.6415, reject_threshold=0.0100. adj_baseline=0.2649, p1=0.2121, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#84317
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_with_restart {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0305, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_with_restart
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": false, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": false, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7282-4ecf-a3d6-f8c8957b188f 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c906-4a15-981b-e4328ad8375c 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7284-431b-ad2e-457808edc9a3 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
test results on build#84334
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/84334#019e1cca-dd8e-4c47-96c7-7b6c2f245bba 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0225, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS) ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "basic"} integration https://buildkite.com/redpanda/redpanda/builds/84334#019e1ccb-528e-440b-9643-891f40a27ca9 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations

@Lazin Lazin requested a review from pgellert May 11, 2026 18:58
@Lazin Lazin force-pushed the ct/shadown-linking-rnot-test branch from 745e0db to c0ffaf5 Compare May 12, 2026 10:13
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 12, 2026

Retry command for Build#84317

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false,"workload_set":"cloud_combos"}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"cloud_combos"}

Lazin added 3 commits May 12, 2026 11:17
Adds cloud-topic and tiered-cloud-topic workloads to the shadow
linking random node ops test so we exercise plain cloud and
tiered_cloud storage modes alongside the existing si, compacted, and
transactional workloads. Enables the explicit-only tiered_cloud_topics
feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide;
allow-lists the expected cloud-topics shutdown warnings.

Signed-off-by: Evgeny Lazin <[email protected]>
For storage.mode=cloud topics, replicate_at_offset previously sent
every input batch through stage_write/execute_write and wrapped each
one as a ctp_placeholder. The placeholder encoding drops the record
key, so for control records (e.g. transaction commit/abort markers)
the original key bytes are lost and rm_stm's parse_control_batch
throws std::out_of_range on the empty iobuf, halting state machine
apply at the marker offset.

Split the input list into user data batches (raft_data with
!is_control()) and pass-through batches (raft_configuration, tx_fence,
control batches, etc.). Only data batches are uploaded to L0 and
wrapped as ctp_placeholders; the rest are forwarded to the
write_at_offset_stm unchanged. The original input ordering is
preserved by interleaving the generated placeholders with the
pass-through batches.

Signed-off-by: Evgeny Lazin <[email protected]>
Adds a new "flipping" workload_set matrix variant. A single workload
runs against flipping-storage-topic while a background daemon thread
toggles redpanda.storage.mode between cloud and tiered_cloud every 3
seconds on the source. Transient alter-config failures (leader
changes, partition movement) are logged and retried on the next tick;
the target config is not separately verified.

Wired through ClusterLinkingWorkloadSpec via optional
flip_storage_modes / flip_interval_seconds fields so other workloads
can opt in if needed.

Signed-off-by: Evgeny Lazin <[email protected]>
@Lazin Lazin force-pushed the ct/shadown-linking-rnot-test branch from 6c0c15f to b0d9665 Compare May 12, 2026 15:18
@Lazin Lazin requested review from WillemKauf and dotnwat May 12, 2026 15:56
Copy link
Copy Markdown
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll let someone from the cloud topics team approve

Comment on lines +1283 to +1285
for (auto&& b : passthrough_batches) {
final_batches.push_back(std::move(b));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be simpler here:

Suggested change
for (auto&& b : passthrough_batches) {
final_batches.push_back(std::move(b));
}
final_batches = std::move(passthrough_batches);

@dotnwat dotnwat requested review from andrwng and nvartolomei May 13, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants