Fix race condition when calculating the LSO by mmaslankaprv · Pull Request #28360 · redpanda-data/redpanda

mmaslankaprv · 2025-11-05T08:59:42Z

Backports Required

Release Notes

none

mmaslankaprv · 2025-11-05T09:15:01Z

/ci-repeat 1

vbotbuildovich · 2025-11-05T13:40:27Z

CI test results

test results on build#75633

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DatalakeBatchingTest	test_batching	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea75-43d0-8382-4c4fb8b6caeb	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeBatchingTest&test_method=test_batching
DatalakeClusterRestoreTest	test_basic	{"catalog_type": "nessie", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e	FLAKY	4/21		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest	test_basic	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a	FLAKY	5/21	upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d	FLAKY	11/21	upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea77-4f96-8901-87be56c64e3b	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest	test_restore_partition_spec	{"catalog_type": "nessie", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3984-4dfc-8417-c128e7e1fd05	FLAKY	4/21	upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeClusterRestoreTest	test_restore_partition_spec	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3986-4c78-bd52-008b47d8f269	FLAKY	8/21	upstream reliability is '100.0'. current run reliability is '38.095238095238095'. drift is 61.90476 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeClusterRestoreTest	test_restore_partition_spec	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3987-441c-9366-014909806170	FLAKY	4/21	upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea72-4f2d-babb-04e498f6fb45	FLAKY	7/21	upstream reliability is '100.0'. current run reliability is '33.33333333333333'. drift is 66.66667 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea77-4f96-8901-87be56c64e3b	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3984-4dfc-8417-c128e7e1fd05	FLAKY	1/21	upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea69-46a9-a79f-b47e8e13ff8d	FLAKY	7/21	upstream reliability is '100.0'. current run reliability is '33.33333333333333'. drift is 66.66667 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3986-4c78-bd52-008b47d8f269	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6a-4ce3-aa47-2f924c57f591	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3987-441c-9366-014909806170	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest	test_basic	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "trino"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6c-4656-a5d8-83804b1b674d	FLAKY	3/21	upstream reliability is '100.0'. current run reliability is '14.285714285714285'. drift is 85.71429 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DeleteRecordsTest	test_delete_records_concurrent_truncations	{"cloud_storage_enabled": false, "truncate_point": "at_high_watermark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3988-430c-baa9-e25d72f9dc19	FLAKY	13/21	upstream reliability is '100.0'. current run reliability is '61.904761904761905'. drift is 38.09524 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
AWSRoleFetchTests	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e	FLAKY	4/21		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AWSRoleFetchTests&test_method=test_write
STSRoleFetchTests	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a	FLAKY	8/21	upstream reliability is '100.0'. current run reliability is '38.095238095238095'. drift is 61.90476 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=STSRoleFetchTests&test_method=test_write
EndToEndShadowIndexingTest	test_write	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d	FLAKY	4/21	upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTest&test_method=test_write
EndToEndShadowIndexingTestWithDisruptions	test_write_with_node_failures	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398a-4573-8efc-1c6de4ed0f72	FLAKY	11/21	upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTestWithDisruptions&test_method=test_write_with_node_failures
FollowerFetchingTest	test_with_leadership_transfers	{"fetch_from": "fetch-from-tiered-storage"}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e	FAIL	0/2		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
NodeWiseRecoveryTest	test_node_wise_recovery	{"dead_node_count": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398b-4ea4-bd34-37f2f5e10458	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
NodeWiseRecoveryTest	test_node_wise_recovery	{"dead_node_count": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e	FAIL	0/20		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3989-46c1-971f-b44aeab39816	FLAKY	7/21	upstream reliability is '95.17045454545455'. current run reliability is '33.33333333333333'. drift is 61.83712 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ShadowIndexingCompactedTopicTest	test_upload	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3989-46c1-971f-b44aeab39816	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowIndexingCompactedTopicTest&test_method=test_upload
ShadowIndexingCompactedTopicTest	test_upload	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398a-4573-8efc-1c6de4ed0f72	FAIL	0/21	The test has failed across all retries	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowIndexingCompactedTopicTest&test_method=test_upload
ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6f-4140-a6d1-97514827e386	FLAKY	14/21	upstream reliability is '100.0'. current run reliability is '66.66666666666666'. drift is 33.33333 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
TestTieredStoragePause	test_safe_pause_resume	{"allow_gaps_cluster_level": true, "allow_gaps_topic_level": true}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestTieredStoragePause&test_method=test_safe_pause_resume
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14	FLAKY	17/21	upstream reliability is '89.08098271155596'. current run reliability is '80.95238095238095'. drift is 8.1286 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

test results on build#75732

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkConsumeGroupsMirroringTest	test_continuous_group_sync	{"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7c-463c-a6ab-5950714c0b25	FLAKY	20/21	upstream reliability is '99.11190053285968'. current run reliability is '95.23809523809523'. drift is 3.87381 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7a-490a-a28d-1b7992fd9cf6	FLAKY	15/21	upstream reliability is '95.33777354900094'. current run reliability is '71.42857142857143'. drift is 23.9092 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75732#019a587f-c929-4084-989c-d6107c5398b1	FLAKY	15/21	upstream reliability is '85.87987355110643'. current run reliability is '71.42857142857143'. drift is 14.4513 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
WriteCachingFailureInjectionTest	test_unavoidable_data_loss	null	integration	https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7e-4a79-937c-d4e7b0eeecac	FLAKY	20/21	upstream reliability is '94.70672389127324'. current run reliability is '95.23809523809523'. drift is -0.53137 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss
TxUpgradeCompactionTest	upgrade_with_compaction_test	null	integration	https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b78-4e4d-bbfd-cf399583670f	FLAKY	18/21	upstream reliability is '98.75776397515527'. current run reliability is '85.71428571428571'. drift is 13.04348 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test

test results on build#75814

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e3-4216-915e-58548e574743	FLAKY	36/40	upstream reliability is '99.51159951159951'. current run reliability is '83.33333333333334'. drift is 16.17827 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e4-4970-99d9-70a73f0827ae	FLAKY	39/40	upstream reliability is '99.51159951159951'. current run reliability is '95.23809523809523'. drift is 4.2735 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e5-4863-938c-5e48676ceeba	FLAKY	34/40	upstream reliability is '99.51159951159951'. current run reliability is '75.0'. drift is 24.5116 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e6-463e-aa79-bcc7058c24bd	FLAKY	36/40	upstream reliability is '99.51159951159951'. current run reliability is '81.81818181818183'. drift is 17.69342 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal

test results on build#76098

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkingReplicationTests	test_replication_with_failures	null	integration	https://buildkite.com/redpanda/redpanda/builds/76098#019a7779-5b17-4a60-afbf-e5c8fcc0ad31	FLAKY	19/21	upstream reliability is '96.31490787269682'. current run reliability is '90.47619047619048'. drift is 5.83872 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
LogCompactionTxRemovalUpgradeTest	test_tx_control_batch_removal_with_upgrade	{"test_case_name": "Mixed aborts and commits"}	integration	https://buildkite.com/redpanda/redpanda/builds/76098#019a7774-1c73-4975-b3f2-e1fdfae76c4a	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade

test results on build#76132

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}}	integration	https://buildkite.com/redpanda/redpanda/builds/76132#019a78f6-5ffe-427d-989e-73806037ffbd	FLAKY	16/21	upstream reliability is '88.78748370273793'. current run reliability is '76.19047619047619'. drift is 12.59701 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ControllerLogLimitMirrorMakerTests	test_mirror_maker_with_limits	null	integration	https://buildkite.com/redpanda/redpanda/builds/76132#019a78f4-f6c5-439c-becf-7ce92f97c3fc	FLAKY	20/21	upstream reliability is '98.95833333333334'. current run reliability is '95.23809523809523'. drift is 3.72024 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

mmaslankaprv · 2025-11-06T08:36:57Z

/ci-repeat 1

Copilot

Pull Request Overview

This PR re-enables previously disabled transactional control batch removal functionality during log compaction. The changes activate feature flags and tests that were temporarily disabled while the coordinated compaction feature was being developed.

Key changes:

Activates the coordinated_compaction feature flag to enable transactional batch removal during compaction
Re-enables previously disabled tests for transaction control batch removal in both unit and integration test suites
Adds a new test test_lso_bound_by_open_tx to validate LSO (Last Stable Offset) calculation with concurrent snapshots and transactions
Introduces an _lso_lock rwlock to prevent race conditions when calculating LSO during transaction operations

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/rptest/transactions/tx_upgrade_test.py	Re-enabled `TxUpgradeCompactionTest` class and converted assertion-based batch checking to wait-based validation
tests/rptest/tests/log_compaction_test.py	Re-enabled `LogCompactionTxRemovalTest` and `LogCompactionTxRemovalUpgradeTest`, converted batch checking to polling-based approach
src/v/storage/tests/compaction_e2e_test.cc	Removed early return statements from `AbortTransactions` and `CommitTransactions` tests
src/v/storage/segment_utils.cc	Enabled `unset_transactional_bit_enabled` by checking coordinated_compaction feature flag
src/v/storage/segment_deduplication_utils.cc	Enabled `unset_transactional_bit_enabled` by checking coordinated_compaction feature flag
src/v/kafka/server/tests/group_tx_compaction_test.cc	Re-enabled batch validation logic for transaction control batches
src/v/compaction/utils.cc	Enabled `remove_user_tx_fence_enabled` by checking coordinated_compaction feature flag and config
src/v/cluster/tests/tx_compaction_utils.h	Re-enabled fence batch counting and validation in compaction verification
src/v/cluster/tests/rm_stm_tests.cc	Replaced hardcoded max timeout values with named constant, added new test for LSO race condition, fixed typo in comment
src/v/cluster/tests/rm_stm_test_fixture.h	Changed producer expiration from `max()` to named `large_timeout` constant
src/v/cluster/rm_stm.h	Fixed typo "exipration" to "expiration"
src/v/cluster/rm_stm.cc	Added `_lso_lock` to prevent LSO calculation races, improved LSO logic with detailed warning comment about edge cases

mmaslankaprv · 2025-11-07T08:04:03Z

/ci-repeat 5
skip-redpanda-build
dt-repeat=20
tests/rptest/tests/log_compaction_test.py::LogCompactionTxRemovalTest

WillemKauf

Awesome 🥳

bharathv

nice idea, couple of questions.

bharathv · 2025-11-07T15:29:52Z

+          "lso update in progress, last_known_lso: {}, last_applied: {}",
+          _last_known_lso,
+          last_applied);
+        return _last_known_lso;


One minor optimization is to advance lso right after obtaining _lso_lock.write_lock(). Doing so ensures that lso reflects the last_visible_index at that time, instead of depending on the previous lso() call, which might be outdated.

I am not sure i understand that part

Consider this situation: when last_stable_offset() is called at time t, the _last_known_lso is set to 100.
Now, suppose several transactions have occurred, and the actual LSO is now at 900. When the new begin (at offset 901) and last_stable_offset() are executed concurrently at say t + 1hr, a race condition occurs and runs into this check and the function conservatively returns 100, since that’s the last recorded known LSO.
Ideally, the function could return 900 instead. One possible solution would be to recompute the _last_known_offset after acquiring the _lso_lock in begin, but before performing replication, I think.

As the stm is shutting down, the state the function is looking at may be partial and there is a chance LSO is overestimated which can include open transactions.

When a reset is in progress (eg: raft snapshot), the stm may clear all the inflight transactions and reset the next apply offset. Due to scheduling points, if an LSO request comes in racily, it may not see a consistent state of the stm. This commit returns the last known lso as a conservative approach until the reset is finished.

Move it on to the stack. A deeper fix is to pass it by copy but thats a bigger change and can be done later.

mmaslankaprv · 2025-11-12T09:23:09Z

/ci-repeat 1

bashtanov

LGTM, nits only

Using long_max milliseconds will result in overflow during internal duration casts. instead use a large-ish timeout equivalent to no timeout.

This commit introduces a new read/write lock that protects the LSO updates. The lock is held for write when there is any operation that when applied may influence calculation of LSO i.e. begin_tx and end transaction. The lock is being acquired for read when `rm_stm` is being asked about the LSO. If the lock is not acquired - a transaction operation is in progress we fallback to previous LSO, otherwise the calculation is based on the state that is guaranteed to be up to date. Signed-off-by: Michał Maślanka <michal@redpanda.com>

…hes" This reverts commit 94c2b01.

WillemKauf

Looks good. See thread here: https://redpandadata.slack.com/archives/C07FJGU5AKV/p1763062534096009 on discussion for release of the feature (i.e are we backporting to v25.3.x for release in a future minor?)

github-actions Bot added the area/redpanda label Nov 5, 2025

This comment was marked as outdated.

Sign in to view

mmaslankaprv force-pushed the lso-fix branch from 8223801 to c3379b9 Compare November 6, 2025 08:06

mmaslankaprv changed the title ~~Lso fix~~ Fix race condition when calculating the LSO Nov 6, 2025

WillemKauf force-pushed the lso-fix branch from 04e6ab4 to 4672f89 Compare November 6, 2025 17:03

mmaslankaprv marked this pull request as ready for review November 6, 2025 17:04

Copilot AI review requested due to automatic review settings November 6, 2025 17:04

Copilot AI reviewed Nov 6, 2025

View reviewed changes

mmaslankaprv requested review from WillemKauf, bashtanov, bharathv and joe-redpanda November 7, 2025 09:00

WillemKauf previously approved these changes Nov 7, 2025

View reviewed changes

bharathv reviewed Nov 7, 2025

View reviewed changes

joe-redpanda reviewed Nov 7, 2025

View reviewed changes

Comment thread src/v/cluster/rm_stm.cc

bharathv added 4 commits November 12, 2025 08:25

tx/rm_stm: do not compute lso at shutdown

6957819

As the stm is shutting down, the state the function is looking at may be partial and there is a chance LSO is overestimated which can include open transactions.

tx/rm_stm: misc comment edits

473ad03

tx/rm_stm: avoid a dangling rvalue reference from function arg

d6617de

Move it on to the stack. A deeper fix is to pass it by copy but thats a bigger change and can be done later.

mmaslankaprv dismissed WillemKauf’s stale review via 1f7ad08 November 12, 2025 08:02

mmaslankaprv force-pushed the lso-fix branch from 4672f89 to 1f7ad08 Compare November 12, 2025 08:02

mmaslankaprv requested review from WillemKauf, bharathv and joe-redpanda November 12, 2025 13:34

bashtanov previously approved these changes Nov 12, 2025

View reviewed changes

bharathv and others added 5 commits November 12, 2025 17:09

tx/rm_stm/tests: add a test for monotonicity of lso

560e370

tx/rm_stm: avoid overlfow in timeout.

2c21f2b

Using long_max milliseconds will result in overflow during internal duration casts. instead use a large-ish timeout equivalent to no timeout.

tx/rm_stm: disable test_lso_bound_by_open_tx

eae1061

Revert "tree-wide: disable compaction of transactional control batc…

879bbfe

…hes" This reverts commit 94c2b01.

mmaslankaprv dismissed bashtanov’s stale review via 879bbfe November 12, 2025 16:11

mmaslankaprv force-pushed the lso-fix branch from 1f7ad08 to 879bbfe Compare November 12, 2025 16:11

mmaslankaprv requested a review from bashtanov November 12, 2025 18:36

WillemKauf approved these changes Nov 13, 2025

View reviewed changes

mmaslankaprv merged commit 083fdd6 into redpanda-data:dev Nov 13, 2025
19 checks passed

WillemKauf mentioned this pull request Nov 13, 2025

config: add log_compaction_tx_batch_removal_enabled #28530

Closed

8 tasks

Conversation

mmaslankaprv commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

mmaslankaprv commented Nov 5, 2025

Uh oh!

This comment was marked as outdated.

vbotbuildovich commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

mmaslankaprv commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mmaslankaprv commented Nov 7, 2025

Uh oh!

WillemKauf left a comment

Choose a reason for hiding this comment

Uh oh!

bharathv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bharathv Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

mmaslankaprv Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

bharathv Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mmaslankaprv commented Nov 12, 2025

Uh oh!

bashtanov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillemKauf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mmaslankaprv commented Nov 5, 2025 •

edited

Loading

vbotbuildovich commented Nov 5, 2025 •

edited

Loading