Skip to content

Fix race condition when calculating the LSO#28360

Merged
mmaslankaprv merged 9 commits into
redpanda-data:devfrom
mmaslankaprv:lso-fix
Nov 13, 2025
Merged

Fix race condition when calculating the LSO#28360
mmaslankaprv merged 9 commits into
redpanda-data:devfrom
mmaslankaprv:lso-fix

Conversation

@mmaslankaprv
Copy link
Copy Markdown
Member

@mmaslankaprv mmaslankaprv commented Nov 5, 2025

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@mmaslankaprv
Copy link
Copy Markdown
Member Author

/ci-repeat 1

@vbotbuildovich

This comment was marked as outdated.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Nov 5, 2025

CI test results

test results on build#75633
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DatalakeBatchingTest test_batching {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea75-43d0-8382-4c4fb8b6caeb FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeBatchingTest&test_method=test_batching
DatalakeClusterRestoreTest test_basic {"catalog_type": "nessie", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e FLAKY 4/21 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest test_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a FLAKY 5/21 upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d FLAKY 11/21 upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea77-4f96-8901-87be56c64e3b FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_basic
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "nessie", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3984-4dfc-8417-c128e7e1fd05 FLAKY 4/21 upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3986-4c78-bd52-008b47d8f269 FLAKY 8/21 upstream reliability is '100.0'. current run reliability is '38.095238095238095'. drift is 61.90476 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3987-441c-9366-014909806170 FLAKY 4/21 upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
DatalakeDelayedTranslationTest test_basic {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea72-4f2d-babb-04e498f6fb45 FLAKY 7/21 upstream reliability is '100.0'. current run reliability is '33.33333333333333'. drift is 66.66667 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea77-4f96-8901-87be56c64e3b FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3984-4dfc-8417-c128e7e1fd05 FLAKY 1/21 upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea69-46a9-a79f-b47e8e13ff8d FLAKY 7/21 upstream reliability is '100.0'. current run reliability is '33.33333333333333'. drift is 66.66667 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3986-4c78-bd52-008b47d8f269 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6a-4ce3-aa47-2f924c57f591 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3987-441c-9366-014909806170 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DatalakeDelayedTranslationTest test_basic {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6c-4656-a5d8-83804b1b674d FLAKY 3/21 upstream reliability is '100.0'. current run reliability is '14.285714285714285'. drift is 85.71429 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDelayedTranslationTest&test_method=test_basic
DeleteRecordsTest test_delete_records_concurrent_truncations {"cloud_storage_enabled": false, "truncate_point": "at_high_watermark"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3988-430c-baa9-e25d72f9dc19 FLAKY 13/21 upstream reliability is '100.0'. current run reliability is '61.904761904761905'. drift is 38.09524 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
AWSRoleFetchTests test_write null integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e FLAKY 4/21 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AWSRoleFetchTests&test_method=test_write
STSRoleFetchTests test_write null integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398d-4381-b5a7-b38dbbeb198a FLAKY 8/21 upstream reliability is '100.0'. current run reliability is '38.095238095238095'. drift is 61.90476 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=STSRoleFetchTests&test_method=test_write
EndToEndShadowIndexingTest test_write {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398e-44b5-84a7-b60f2d295c2d FLAKY 4/21 upstream reliability is '100.0'. current run reliability is '19.047619047619047'. drift is 80.95238 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTest&test_method=test_write
EndToEndShadowIndexingTestWithDisruptions test_write_with_node_failures {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398a-4573-8efc-1c6de4ed0f72 FLAKY 11/21 upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTestWithDisruptions&test_method=test_write_with_node_failures
FollowerFetchingTest test_with_leadership_transfers {"fetch_from": "fetch-from-tiered-storage"} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e FAIL 0/2 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398b-4ea4-bd34-37f2f5e10458 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 2} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398c-4fc0-a8df-a573f6d6447e FAIL 0/20 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3989-46c1-971f-b44aeab39816 FLAKY 7/21 upstream reliability is '95.17045454545455'. current run reliability is '33.33333333333333'. drift is 61.83712 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ShadowIndexingCompactedTopicTest test_upload {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-3989-46c1-971f-b44aeab39816 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowIndexingCompactedTopicTest&test_method=test_upload
ShadowIndexingCompactedTopicTest test_upload {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a537c-398a-4573-8efc-1c6de4ed0f72 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowIndexingCompactedTopicTest&test_method=test_upload
ShadowLinkingRandomOpsTest test_node_operations {"failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6f-4140-a6d1-97514827e386 FLAKY 14/21 upstream reliability is '100.0'. current run reliability is '66.66666666666666'. drift is 33.33333 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
TestTieredStoragePause test_safe_pause_resume {"allow_gaps_cluster_level": true, "allow_gaps_topic_level": true} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestTieredStoragePause&test_method=test_safe_pause_resume
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/75633#019a5380-ea6d-44e7-86ed-d0a11263ff14 FLAKY 17/21 upstream reliability is '89.08098271155596'. current run reliability is '80.95238095238095'. drift is 8.1286 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#75732
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkConsumeGroupsMirroringTest test_continuous_group_sync {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7c-463c-a6ab-5950714c0b25 FLAKY 20/21 upstream reliability is '99.11190053285968'. current run reliability is '95.23809523809523'. drift is 3.87381 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7a-490a-a28d-1b7992fd9cf6 FLAKY 15/21 upstream reliability is '95.33777354900094'. current run reliability is '71.42857142857143'. drift is 23.9092 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75732#019a587f-c929-4084-989c-d6107c5398b1 FLAKY 15/21 upstream reliability is '85.87987355110643'. current run reliability is '71.42857142857143'. drift is 14.4513 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b7e-4a79-937c-d4e7b0eeecac FLAKY 20/21 upstream reliability is '94.70672389127324'. current run reliability is '95.23809523809523'. drift is -0.53137 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss
TxUpgradeCompactionTest upgrade_with_compaction_test null integration https://buildkite.com/redpanda/redpanda/builds/75732#019a5889-4b78-4e4d-bbfd-cf399583670f FLAKY 18/21 upstream reliability is '98.75776397515527'. current run reliability is '85.71428571428571'. drift is 13.04348 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxUpgradeCompactionTest&test_method=upgrade_with_compaction_test
test results on build#75814
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e3-4216-915e-58548e574743 FLAKY 36/40 upstream reliability is '99.51159951159951'. current run reliability is '83.33333333333334'. drift is 16.17827 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e4-4970-99d9-70a73f0827ae FLAKY 39/40 upstream reliability is '99.51159951159951'. current run reliability is '95.23809523809523'. drift is 4.2735 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e5-4863-938c-5e48676ceeba FLAKY 34/40 upstream reliability is '99.51159951159951'. current run reliability is '75.0'. drift is 24.5116 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75814#019a5d59-08e6-463e-aa79-bcc7058c24bd FLAKY 36/40 upstream reliability is '99.51159951159951'. current run reliability is '81.81818181818183'. drift is 17.69342 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
test results on build#76098
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/76098#019a7779-5b17-4a60-afbf-e5c8fcc0ad31 FLAKY 19/21 upstream reliability is '96.31490787269682'. current run reliability is '90.47619047619048'. drift is 5.83872 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
LogCompactionTxRemovalUpgradeTest test_tx_control_batch_removal_with_upgrade {"test_case_name": "Mixed aborts and commits"} integration https://buildkite.com/redpanda/redpanda/builds/76098#019a7774-1c73-4975-b3f2-e1fdfae76c4a FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade
test results on build#76132
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/76132#019a78f6-5ffe-427d-989e-73806037ffbd FLAKY 16/21 upstream reliability is '88.78748370273793'. current run reliability is '76.19047619047619'. drift is 12.59701 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ControllerLogLimitMirrorMakerTests test_mirror_maker_with_limits null integration https://buildkite.com/redpanda/redpanda/builds/76132#019a78f4-f6c5-439c-becf-7ce92f97c3fc FLAKY 20/21 upstream reliability is '98.95833333333334'. current run reliability is '95.23809523809523'. drift is 3.72024 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

@mmaslankaprv
Copy link
Copy Markdown
Member Author

/ci-repeat 1

@mmaslankaprv mmaslankaprv changed the title Lso fix Fix race condition when calculating the LSO Nov 6, 2025
@mmaslankaprv mmaslankaprv marked this pull request as ready for review November 6, 2025 17:04
Copilot AI review requested due to automatic review settings November 6, 2025 17:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR re-enables previously disabled transactional control batch removal functionality during log compaction. The changes activate feature flags and tests that were temporarily disabled while the coordinated compaction feature was being developed.

Key changes:

  • Activates the coordinated_compaction feature flag to enable transactional batch removal during compaction
  • Re-enables previously disabled tests for transaction control batch removal in both unit and integration test suites
  • Adds a new test test_lso_bound_by_open_tx to validate LSO (Last Stable Offset) calculation with concurrent snapshots and transactions
  • Introduces an _lso_lock rwlock to prevent race conditions when calculating LSO during transaction operations

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/rptest/transactions/tx_upgrade_test.py Re-enabled TxUpgradeCompactionTest class and converted assertion-based batch checking to wait-based validation
tests/rptest/tests/log_compaction_test.py Re-enabled LogCompactionTxRemovalTest and LogCompactionTxRemovalUpgradeTest, converted batch checking to polling-based approach
src/v/storage/tests/compaction_e2e_test.cc Removed early return statements from AbortTransactions and CommitTransactions tests
src/v/storage/segment_utils.cc Enabled unset_transactional_bit_enabled by checking coordinated_compaction feature flag
src/v/storage/segment_deduplication_utils.cc Enabled unset_transactional_bit_enabled by checking coordinated_compaction feature flag
src/v/kafka/server/tests/group_tx_compaction_test.cc Re-enabled batch validation logic for transaction control batches
src/v/compaction/utils.cc Enabled remove_user_tx_fence_enabled by checking coordinated_compaction feature flag and config
src/v/cluster/tests/tx_compaction_utils.h Re-enabled fence batch counting and validation in compaction verification
src/v/cluster/tests/rm_stm_tests.cc Replaced hardcoded max timeout values with named constant, added new test for LSO race condition, fixed typo in comment
src/v/cluster/tests/rm_stm_test_fixture.h Changed producer expiration from max() to named large_timeout constant
src/v/cluster/rm_stm.h Fixed typo "exipration" to "expiration"
src/v/cluster/rm_stm.cc Added _lso_lock to prevent LSO calculation races, improved LSO logic with detailed warning comment about edge cases

@mmaslankaprv
Copy link
Copy Markdown
Member Author

/ci-repeat 5
skip-redpanda-build
dt-repeat=20
tests/rptest/tests/log_compaction_test.py::LogCompactionTxRemovalTest

WillemKauf
WillemKauf previously approved these changes Nov 7, 2025
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🥳

Copy link
Copy Markdown
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea, couple of questions.

Comment thread src/v/cluster/rm_stm.h
Comment thread src/v/cluster/rm_stm.cc Outdated
Comment thread src/v/cluster/rm_stm.cc
"lso update in progress, last_known_lso: {}, last_applied: {}",
_last_known_lso,
last_applied);
return _last_known_lso;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor optimization is to advance lso right after obtaining _lso_lock.write_lock(). Doing so ensures that lso reflects the last_visible_index at that time, instead of depending on the previous lso() call, which might be outdated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure i understand that part

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this situation: when last_stable_offset() is called at time t, the _last_known_lso is set to 100.
Now, suppose several transactions have occurred, and the actual LSO is now at 900. When the new begin (at offset 901) and last_stable_offset() are executed concurrently at say t + 1hr, a race condition occurs and runs into this check and the function conservatively returns 100, since that’s the last recorded known LSO.
Ideally, the function could return 900 instead. One possible solution would be to recompute the _last_known_offset after acquiring the _lso_lock in begin, but before performing replication, I think.

Comment thread src/v/cluster/rm_stm.cc
As the stm is shutting down, the state the function is looking at may
be partial and there is a chance LSO is overestimated which can include
open transactions.
When a reset is in progress (eg: raft snapshot), the stm may clear all
the inflight transactions and reset the next apply offset. Due to
scheduling points, if an LSO request comes in racily, it may not see a
consistent state of the stm.

This commit returns the last known lso as a conservative approach until
the reset is finished.
Move it on to the stack. A deeper fix is to pass it by copy but thats a
bigger change and can be done later.
@mmaslankaprv
Copy link
Copy Markdown
Member Author

/ci-repeat 1

bashtanov
bashtanov previously approved these changes Nov 12, 2025
Copy link
Copy Markdown
Contributor

@bashtanov bashtanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nits only

Comment thread src/v/cluster/rm_stm.cc
Comment thread src/v/cluster/tests/rm_stm_tests.cc Outdated
Comment thread src/v/cluster/tests/rm_stm_tests.cc Outdated
Comment thread src/v/cluster/tests/rm_stm_tests.cc Outdated
Comment thread src/v/cluster/rm_stm.cc Outdated
Comment thread src/v/cluster/rm_stm.cc Outdated
bharathv and others added 5 commits November 12, 2025 17:09
Using long_max milliseconds will result in overflow during internal
duration casts. instead use a large-ish timeout equivalent to no
timeout.
This commit introduces a new read/write lock that protects the LSO updates.
The lock is held for write when there is any operation that when applied
may influence calculation of LSO i.e. begin_tx and end transaction.

The lock is being acquired for read when `rm_stm` is being asked about
the LSO. If the lock is not acquired - a transaction operation is in
progress we fallback to previous LSO, otherwise the calculation is based
on the state that is guaranteed to be up to date.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. See thread here: https://redpandadata.slack.com/archives/C07FJGU5AKV/p1763062534096009 on discussion for release of the feature (i.e are we backporting to v25.3.x for release in a future minor?)

@mmaslankaprv mmaslankaprv merged commit 083fdd6 into redpanda-data:dev Nov 13, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants