fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476
fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476kvaps wants to merge 3 commits intoLINBIT:masterfrom
Conversation
1294106 to
279be43
Compare
## What this PR does This PR updates piraeus-server patches to address several critical production issues with DRBD resources and LUKS encryption: 1. **Add fix-duplicate-tcp-ports.diff** - Prevents duplicate TCP ports after toggle-disk operations (upstream PR #476) 2. **Update skip-adjust-when-device-inaccessible.diff** - Comprehensive fix for multiple issues: - Resources stuck in StandAlone state after node reboot - Unknown state race condition during satellite restart - Encrypted LUKS resource deletion failures - Network reconnect blocked by unavailable child device checks These patches resolve scenarios where DRBD resources fail to automatically reconnect after node reboots and improve LUKS resource lifecycle management. Upstream PRs: - LINBIT/linstor-server#476 - LINBIT/linstor-server#477 ### Release note ```release-note [linstor] Fix DRBD resources stuck in StandAlone state after reboot and encrypted resource deletion issues ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Prevents duplicate TCP port conflicts after disk toggle operations * Fixes resources stuck in StandAlone or Unknown state after reboot * Resolves issues with encrypted resource deletion * Improves handling of temporarily inaccessible storage devices <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Updated analysis — the original fix in this PR is incompleteAfter extensive debugging on a production cluster, we found that removing the redundant Root causeDuring toggle-disk operations, The controller correctly avoids collisions in its own number pool. But if the satellite misses the update (e.g. during controller restart, network issue, or Evidence from production
Proposed expanded fixPreserve existing TCP ports during toggle-disk by copying them into the private void copyDrbdTcpPortsIfExists(Resource rsc, LayerPayload payload) {
Set<AbsRscLayerObject<Resource>> drbdRscDataSet = LayerRscUtils.getRscDataByLayer(
getLayerData(apiCtx, rsc), DeviceLayerKind.DRBD);
if (!drbdRscDataSet.isEmpty()) {
DrbdRscData<Resource> drbdRscData = (DrbdRscData<Resource>) drbdRscDataSet.iterator().next();
Collection<TcpPortNumber> tcpPorts = drbdRscData.getTcpPortList();
if (tcpPorts != null && !tcpPorts.isEmpty()) {
Set<Integer> portInts = new TreeSet<>();
for (TcpPortNumber port : tcpPorts) {
portInts.add(port.value);
}
payload.drbdRsc.tcpPorts = portInts;
}
}
}Called from:
This ensures the same TCP ports are reused when |
|
Pushed the expanded fix (cf28218) that preserves TCP ports during toggle-disk operations. The new commit adds
Combined with the original commit (removing redundant Note: We also discovered a separate issue where the K8s CRD backend doesn't persist |
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this, removeLayerData() frees ports and ensureStackDataExists() may allocate different ones, causing port mismatches between controller and satellites if the satellite misses the update. Also add dh_strip_nondeterminism override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
| rscDataToProcess.addAll(rscData.getChildren()); | ||
| } | ||
|
|
||
| ensureStackDataExists(rscRef, null, new LayerPayload()); |
There was a problem hiding this comment.
I see how this is redundant with the call in CtrlRscToggleDiskApiCallHandler line 1327 . However here are my thoughts to this:
If anything, it is the call in CtrlRscToggleDiskApiCallHandler line 1327 that is the redundant one, since the resetStoragePools method could technically be called from other places in the future. Removing it as you wanted would add the risk that this future new code/feature forgets to call ensureStackDataExists which will leave a corrupt layer-tree within LINSTOR.
That would mean that the line 1327 of CtrlRscToggleDiskApiCallHandler could be moved into the end of the then-case of if (removeDisk) (i.e. currently between lines 1320 and 1321). However, with the same argument as before, adding some future new branch in the finishOperationInTransaction could also lead to forgetting to call ensureStackDataExists.
So here is what I'd rather suggest:
- Either keep the line 280 in
CtrlRscLayerDataFactoryto ensure that calling this method cannot be forgotten and instead just add a comment beforeCtrlRscToggleDiskApiCallHandlerline 1327 that we simply accept this redundancy - Make calling
ensureStackDataExistsinCtrlRscLayerDataFactoryline 280 optional with something like this:
public void resetStoragePools(Resource rscRef)
{
resetStoragePools(rscRef, true);
}
public void resetStoragePools(Resource rscRef, boolean callEnsureStackDataExistsRef)
{
...
if (callEnsureStackDataExistsRef)
{
ensureStackDataExists(...);
}
...
}And change the call in CtrlRscToggleDiskApiCallHandler line 1325 to ctrlLayerStackHelper.resetStoragePools(rsc, false);
The latter approach should prevent accidentally forgetting to call this in the future (worst case we add again a redundancy but never miss calling ensureStackDataExists entirely) but also gives the caller enough control to prevent redundancy.
| payload.drbdRsc.replacingOldLayerRscId = drbdRscData.getRscLayerId(); | ||
| payload.drbdRsc.nodeId = drbdRscData.getNodeId().value; | ||
| } | ||
| copyDrbdTcpPortsIfExists(rsc, payload); |
There was a problem hiding this comment.
I think this call is quite unintuitive. I understand why you added it, but if I see a method that is called copyA and notice that it also copies B sounds strange.
Instead of calling the method here I'd add a new method instead:
private void copyDrbdSettings(Resource rscRef, LayerPayload payloadRef)
{
copyDrbdNodeIdIfExists(rscRef, payloadRef);
copyDrbdTcpPortsIfExists(rscRef, payloadRef);
}and replace the two old calls to copyDrbdNodeIdIfExists with this new copyDrbdSettings method.
|
Thanks for the review! Addressed both points:
|
Remove redundant ensureStackDataExists() call with empty payload from resetStoragePools() method that was causing TCP port conflicts after toggle-disk operations. Root Cause: ----------- The resetStoragePools() method, introduced in 2019 (commit 95cc17d), calls ensureStackDataExists() with an empty LayerPayload. This worked correctly when TCP ports were stored at RscDfn level. After the TCP port migration to per-node level (commit f754943, May 2025), this empty payload results in DrbdRscData being created without TCP ports assigned. The controller then sends a Pojo with an empty port Set to satellites. On satellites, when DrbdRscData is initialized with an empty port list, initPorts() uses preferredNewPortsRef from peer resources. Since SatelliteDynamicNumberPool.tryAllocate() always returns true (no-op), any port from preferredNewPortsRef is accepted without conflict checking, leading to duplicate TCP port assignments. Impact: ------- This regression affects toggle-disk operations, particularly: - Snapshot creation/restore operations - Manual toggle-disk operations - Any operation calling resetStoragePools() Symptoms include: - DRBD resources failing to adjust with "port is also used" errors - Resources stuck in StandAlone or Connecting states - Multiple resources on the same node using identical TCP ports Solution: --------- Remove the ensureStackDataExists() call from resetStoragePools() as it is redundant. The calling code (e.g., CtrlRscToggleDiskApiCallHandler line 1071) already invokes ensureStackDataExists() with the correct payload immediately after resetStoragePools(). This fix ensures: 1. resetStoragePools() only resets storage pool assignments 2. Layer data creation with proper TCP ports happens via the caller's ensureStackDataExists() with correct payload 3. No DrbdRscData objects are created without TCP port assignments Related Issues: --------------- Fixes LINBIT#454 - Duplicate TCP ports after backup/restore operations Related to user reports of resources stuck in StandAlone after node reboots when toggle-disk or backup operations were in progress. Testing: -------- Verified that: - Toggle-disk operations no longer create resources without TCP ports - Backup/restore operations complete without TCP port conflicts - Resources maintain unique TCP ports across toggle-disk cycles Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
When removeLayerData() deletes DrbdRscData during toggle-disk, the TCP ports are freed from the number pool. ensureStackDataExists() then allocates new ports which may differ from the old ones if other resources claimed them in the meantime. The controller correctly avoids collisions in its own number pool, but if the satellite misses the update (e.g. due to controller restart or connectivity issues), it keeps the old ports while peers receive the new ones, causing DRBD connections to fail with StandAlone state. Add copyDrbdTcpPortsIfExists() to save existing TCP ports into the LayerPayload before removeLayerData() deletes them. Call it from copyDrbdNodeIdIfExists() (covers both toggle-disk paths) and from the needsDeactivate path (shared storage pool case). This ensures the same TCP ports are reused when DrbdRscData is recreated, eliminating the window for port mismatch between controller and satellites. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
- Introduce copyDrbdSettings() wrapper instead of calling copyDrbdTcpPortsIfExists() from within copyDrbdNodeIdIfExists() - Replace calls to copyDrbdNodeIdIfExists() with copyDrbdSettings() in both toggle-disk paths - Make ensureStackDataExists() in resetStoragePools() optional via boolean parameter to prevent accidental omission in future callers while allowing CtrlRscToggleDiskApiCallHandler to skip the redundant call Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
3a3040a to
bcc8990
Compare
## What this PR does Updates the `fix-duplicate-tcp-ports` patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this fix, `removeLayerData()` frees TCP ports from the number pool, and `ensureStackDataExists()` may allocate different ports. If the satellite misses the update (e.g. due to controller restart), it keeps the old ports while peers receive the new ones, causing DRBD connections to fail with StandAlone state. The fix adds `copyDrbdTcpPortsIfExists()` which saves existing TCP ports into the `LayerPayload` before `removeLayerData()` deletes them. Also adds `dh_strip_nondeterminism` override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) ### Release note \`\`\`release-note [linstor] Fix TCP port mismatches after toggle-disk operations that could cause DRBD resources to enter StandAlone state \`\`\` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed an issue where DRBD TCP ports were not correctly preserved during disk toggle operations, which could result in TCP port mismatches between the controller and satellite nodes. * Improved robustness of the build and packaging process by addressing non-determinism handling for Java library dependencies. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this, removeLayerData() frees ports and ensureStackDataExists() may allocate different ones, causing port mismatches between controller and satellites if the satellite misses the update. Also add dh_strip_nondeterminism override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> (cherry picked from commit 812d413)
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this, removeLayerData() frees ports and ensureStackDataExists() may allocate different ones, causing port mismatches between controller and satellites if the satellite misses the update. Also add dh_strip_nondeterminism override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> (cherry picked from commit 812d413)
|
We've integrated this change into Cozystack as part of cozystack/cozystack#2331 |
This PR fixes a regression introduced after the TCP port migration to per-node level (commit f754943, May 2025) that causes duplicate TCP port assignments during toggle-disk operations.
Root Cause
The
resetStoragePools()method, originally written in 2019 (commit 95cc17d), callsensureStackDataExists()with an emptyLayerPayload. This was correct when TCP ports were stored at RscDfn level but became a regression after the port migration:DrbdRscDatacreated without TCP portsinitPorts()usespreferredNewPortsReffrom peer resourcesSatelliteDynamicNumberPool.tryAllocate()always returns true (no-op)Impact
Affected operations:
resetStoragePools()Symptoms:
"port is also used"errorsStandAloneorConnectingstatesSolution
Remove the redundant
ensureStackDataExists()call fromresetStoragePools(). The calling code (e.g.,CtrlRscToggleDiskApiCallHandler:1071) already invokesensureStackDataExists()with the correct payload immediately afterresetStoragePools().This ensures:
resetStoragePools()only resets storage pool assignmentsensureStackDataExists()DrbdRscDataobjects created without TCP port assignmentsFixes
Testing
Verified that: