fix(drbd): prevent duplicate TCP ports after toggle-disk operations by kvaps · Pull Request #476 · LINBIT/linstor-server

kvaps · 2026-01-12T12:43:10Z

This PR fixes a regression introduced after the TCP port migration to per-node level (commit f754943, May 2025) that causes duplicate TCP port assignments during toggle-disk operations.

Root Cause

The resetStoragePools() method, originally written in 2019 (commit 95cc17d), calls ensureStackDataExists() with an empty LayerPayload. This was correct when TCP ports were stored at RscDfn level but became a regression after the port migration:

Empty payload → DrbdRscData created without TCP ports
Controller sends Pojo with empty port Set to satellites
Satellite's initPorts() uses preferredNewPortsRef from peer resources
SatelliteDynamicNumberPool.tryAllocate() always returns true (no-op)
Random ports from peer resources get assigned → duplicate TCP ports

Impact

Affected operations:

Snapshot creation/restore operations
Manual toggle-disk operations
Any operation calling resetStoragePools()

Symptoms:

DRBD resources fail to adjust: "port is also used" errors
Resources stuck in StandAlone or Connecting states
Multiple resources on same node using identical TCP ports

Solution

Remove the redundant ensureStackDataExists() call from resetStoragePools(). The calling code (e.g., CtrlRscToggleDiskApiCallHandler:1071) already invokes ensureStackDataExists() with the correct payload immediately after resetStoragePools().

This ensures:

resetStoragePools() only resets storage pool assignments
Layer data creation with proper TCP ports happens via caller's ensureStackDataExists()
No DrbdRscData objects created without TCP port assignments

Fixes

Closes Error creating rollback entry - Resource SecObjectProtection not found #454 - Duplicate TCP ports after backup/restore operations
Fixes user reports of resources stuck in StandAlone after node reboots when toggle-disk/backup operations were in progress

Testing

Verified that:

Toggle-disk operations no longer create resources without TCP ports
Backup/restore operations complete without TCP port conflicts
Resources maintain unique TCP ports across toggle-disk cycles

## What this PR does This PR updates piraeus-server patches to address several critical production issues with DRBD resources and LUKS encryption: 1. **Add fix-duplicate-tcp-ports.diff** - Prevents duplicate TCP ports after toggle-disk operations (upstream PR #476) 2. **Update skip-adjust-when-device-inaccessible.diff** - Comprehensive fix for multiple issues: - Resources stuck in StandAlone state after node reboot - Unknown state race condition during satellite restart - Encrypted LUKS resource deletion failures - Network reconnect blocked by unavailable child device checks These patches resolve scenarios where DRBD resources fail to automatically reconnect after node reboots and improve LUKS resource lifecycle management. Upstream PRs: - LINBIT/linstor-server#476 - LINBIT/linstor-server#477 ### Release note ```release-note [linstor] Fix DRBD resources stuck in StandAlone state after reboot and encrypted resource deletion issues ```  ## Summary by CodeRabbit * **Bug Fixes** * Prevents duplicate TCP port conflicts after disk toggle operations * Fixes resources stuck in StandAlone or Unknown state after reboot * Resolves issues with encrypted resource deletion * Improves handling of temporarily inaccessible storage devices <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

kvaps · 2026-03-28T04:35:14Z

Updated analysis — the original fix in this PR is incomplete

After extensive debugging on a production cluster, we found that removing the redundant ensureStackDataExists() from resetStoragePools() is correct but doesn't address the root cause of TCP port mismatches.

Root cause

During toggle-disk operations, removeLayerData() deletes DrbdRscData (freeing TCP ports from the number pool), then ensureStackDataExists() creates new DrbdRscData with an empty LayerPayload (no tcpPorts). The controller's initPorts() allocates new ports from the pool — which may differ from the old ports if other resources claimed them in the meantime.

The controller correctly avoids collisions in its own number pool. But if the satellite misses the update (e.g. during controller restart, network issue, or drbdadm adjust failure), it keeps the old ports while peers receive the new ones, causing DRBD connection failures (StandAlone/Connecting).

Evidence from production

Controller had zero port collisions — all ports were unique
Satellite .res files had different ports from what the controller assigned
Example: pvc-d2c5ade3 on plo-csxhk-004 — controller=7187, satellite=7188
Peer node (gld-csxhk-006) correctly had 7187 in its .res for plo-csxhk-004
Same pattern confirmed on multiple nodes with 3+ resources affected

Proposed expanded fix

Preserve existing TCP ports during toggle-disk by copying them into the LayerPayload before removeLayerData() deletes them — similar to how copyDrbdNodeIdIfExists() already preserves the node-id.

private void copyDrbdTcpPortsIfExists(Resource rsc, LayerPayload payload) {
    Set<AbsRscLayerObject<Resource>> drbdRscDataSet = LayerRscUtils.getRscDataByLayer(
        getLayerData(apiCtx, rsc), DeviceLayerKind.DRBD);
    if (!drbdRscDataSet.isEmpty()) {
        DrbdRscData<Resource> drbdRscData = (DrbdRscData<Resource>) drbdRscDataSet.iterator().next();
        Collection<TcpPortNumber> tcpPorts = drbdRscData.getTcpPortList();
        if (tcpPorts != null && !tcpPorts.isEmpty()) {
            Set<Integer> portInts = new TreeSet<>();
            for (TcpPortNumber port : tcpPorts) {
                portInts.add(port.value);
            }
            payload.drbdRsc.tcpPorts = portInts;
        }
    }
}

Called from:

copyDrbdNodeIdIfExists() — covers both toggle-disk paths (normal and finishOperationInTransaction)
needsDeactivate path — shared storage pool case where node-id changes but ports should be preserved

This ensures the same TCP ports are reused when DrbdRscData is recreated, eliminating the window for port mismatch between controller and satellites.

kvaps · 2026-03-28T07:30:33Z

Pushed the expanded fix (cf28218) that preserves TCP ports during toggle-disk operations.

The new commit adds copyDrbdTcpPortsIfExists() which saves existing TCP ports into the LayerPayload before removeLayerData() deletes them. This is called from:

copyDrbdNodeIdIfExists() — covers both normal toggle-disk paths
needsDeactivate path — shared storage pool case

Combined with the original commit (removing redundant ensureStackDataExists from resetStoragePools), this prevents TCP port reassignment during toggle-disk operations.

Note: We also discovered a separate issue where the K8s CRD backend doesn't persist tcp_port_list for DrbdRscData entries created after the v1.31.1 migration (e.g. via toggle-disk). The migration correctly populates existing entries, but new entries created afterward are missing the field. This causes the controller to re-allocate ports on restart for those entries. This issue is not addressed in this PR and should be tracked separately.

Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this, removeLayerData() frees ports and ensureStackDataExists() may allocate different ones, causing port mismatches between controller and satellites if the satellite misses the update. Also add dh_strip_nondeterminism override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

ghernadi · 2026-03-30T05:58:07Z

controller/src/main/java/com/linbit/linstor/layer/resource/CtrlRscLayerDataFactory.java

                rscDataToProcess.addAll(rscData.getChildren());
            }
-
-            ensureStackDataExists(rscRef, null, new LayerPayload());


I see how this is redundant with the call in CtrlRscToggleDiskApiCallHandler line 1327 . However here are my thoughts to this:

If anything, it is the call in CtrlRscToggleDiskApiCallHandler line 1327 that is the redundant one, since the resetStoragePools method could technically be called from other places in the future. Removing it as you wanted would add the risk that this future new code/feature forgets to call ensureStackDataExists which will leave a corrupt layer-tree within LINSTOR.

That would mean that the line 1327 of CtrlRscToggleDiskApiCallHandler could be moved into the end of the then-case of if (removeDisk) (i.e. currently between lines 1320 and 1321). However, with the same argument as before, adding some future new branch in the finishOperationInTransaction could also lead to forgetting to call ensureStackDataExists.

So here is what I'd rather suggest:

Either keep the line 280 in CtrlRscLayerDataFactory to ensure that calling this method cannot be forgotten and instead just add a comment before CtrlRscToggleDiskApiCallHandler line 1327 that we simply accept this redundancy

Make calling ensureStackDataExists in CtrlRscLayerDataFactory line 280 optional with something like this:

public void resetStoragePools(Resource rscRef) { resetStoragePools(rscRef, true); } public void resetStoragePools(Resource rscRef, boolean callEnsureStackDataExistsRef) { ... if (callEnsureStackDataExistsRef) { ensureStackDataExists(...); } ... }

And change the call in CtrlRscToggleDiskApiCallHandler line 1325 to ctrlLayerStackHelper.resetStoragePools(rsc, false);

The latter approach should prevent accidentally forgetting to call this in the future (worst case we add again a redundancy but never miss calling ensureStackDataExists entirely) but also gives the caller enough control to prevent redundancy.

ghernadi · 2026-03-30T06:03:22Z

.../java/com/linbit/linstor/core/apicallhandler/controller/CtrlRscToggleDiskApiCallHandler.java

            payload.drbdRsc.replacingOldLayerRscId = drbdRscData.getRscLayerId();
            payload.drbdRsc.nodeId = drbdRscData.getNodeId().value;
        }
+        copyDrbdTcpPortsIfExists(rsc, payload);


I think this call is quite unintuitive. I understand why you added it, but if I see a method that is called copyA and notice that it also copies B sounds strange.

Instead of calling the method here I'd add a new method instead:

private void copyDrbdSettings(Resource rscRef, LayerPayload payloadRef) { copyDrbdNodeIdIfExists(rscRef, payloadRef); copyDrbdTcpPortsIfExists(rscRef, payloadRef); }

and replace the two old calls to copyDrbdNodeIdIfExists with this new copyDrbdSettings method.

kvaps · 2026-03-30T12:54:00Z

Thanks for the review! Addressed both points:

resetStoragePools now has an optional callEnsureStackDataExistsRef parameter (defaults to true), and CtrlRscToggleDiskApiCallHandler passes false to skip the redundant call.
Replaced the unintuitive copyDrbdTcpPortsIfExists call inside copyDrbdNodeIdIfExists with a new copyDrbdSettings wrapper that calls both methods. All call sites now use copyDrbdSettings except the needsDeactivate path which only needs copyDrbdTcpPortsIfExists (node-id changes for shared storage).

Remove redundant ensureStackDataExists() call with empty payload from resetStoragePools() method that was causing TCP port conflicts after toggle-disk operations. Root Cause: ----------- The resetStoragePools() method, introduced in 2019 (commit 95cc17d), calls ensureStackDataExists() with an empty LayerPayload. This worked correctly when TCP ports were stored at RscDfn level. After the TCP port migration to per-node level (commit f754943, May 2025), this empty payload results in DrbdRscData being created without TCP ports assigned. The controller then sends a Pojo with an empty port Set to satellites. On satellites, when DrbdRscData is initialized with an empty port list, initPorts() uses preferredNewPortsRef from peer resources. Since SatelliteDynamicNumberPool.tryAllocate() always returns true (no-op), any port from preferredNewPortsRef is accepted without conflict checking, leading to duplicate TCP port assignments. Impact: ------- This regression affects toggle-disk operations, particularly: - Snapshot creation/restore operations - Manual toggle-disk operations - Any operation calling resetStoragePools() Symptoms include: - DRBD resources failing to adjust with "port is also used" errors - Resources stuck in StandAlone or Connecting states - Multiple resources on the same node using identical TCP ports Solution: --------- Remove the ensureStackDataExists() call from resetStoragePools() as it is redundant. The calling code (e.g., CtrlRscToggleDiskApiCallHandler line 1071) already invokes ensureStackDataExists() with the correct payload immediately after resetStoragePools(). This fix ensures: 1. resetStoragePools() only resets storage pool assignments 2. Layer data creation with proper TCP ports happens via the caller's ensureStackDataExists() with correct payload 3. No DrbdRscData objects are created without TCP port assignments Related Issues: --------------- Fixes LINBIT#454 - Duplicate TCP ports after backup/restore operations Related to user reports of resources stuck in StandAlone after node reboots when toggle-disk or backup operations were in progress. Testing: -------- Verified that: - Toggle-disk operations no longer create resources without TCP ports - Backup/restore operations complete without TCP port conflicts - Resources maintain unique TCP ports across toggle-disk cycles Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

When removeLayerData() deletes DrbdRscData during toggle-disk, the TCP ports are freed from the number pool. ensureStackDataExists() then allocates new ports which may differ from the old ones if other resources claimed them in the meantime. The controller correctly avoids collisions in its own number pool, but if the satellite misses the update (e.g. due to controller restart or connectivity issues), it keeps the old ports while peers receive the new ones, causing DRBD connections to fail with StandAlone state. Add copyDrbdTcpPortsIfExists() to save existing TCP ports into the LayerPayload before removeLayerData() deletes them. Call it from copyDrbdNodeIdIfExists() (covers both toggle-disk paths) and from the needsDeactivate path (shared storage pool case). This ensures the same TCP ports are reused when DrbdRscData is recreated, eliminating the window for port mismatch between controller and satellites. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

- Introduce copyDrbdSettings() wrapper instead of calling copyDrbdTcpPortsIfExists() from within copyDrbdNodeIdIfExists() - Replace calls to copyDrbdNodeIdIfExists() with copyDrbdSettings() in both toggle-disk paths - Make ensureStackDataExists() in resetStoragePools() optional via boolean parameter to prevent accidental omission in future callers while allowing CtrlRscToggleDiskApiCallHandler to skip the redundant call Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

## What this PR does Updates the `fix-duplicate-tcp-ports` patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this fix, `removeLayerData()` frees TCP ports from the number pool, and `ensureStackDataExists()` may allocate different ports. If the satellite misses the update (e.g. due to controller restart), it keeps the old ports while peers receive the new ones, causing DRBD connections to fail with StandAlone state. The fix adds `copyDrbdTcpPortsIfExists()` which saves existing TCP ports into the `LayerPayload` before `removeLayerData()` deletes them. Also adds `dh_strip_nondeterminism` override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) ### Release note \`\`\`release-note [linstor] Fix TCP port mismatches after toggle-disk operations that could cause DRBD resources to enter StandAlone state \`\`\`  ## Summary by CodeRabbit * **Bug Fixes** * Fixed an issue where DRBD TCP ports were not correctly preserved during disk toggle operations, which could result in TCP port mismatches between the controller and satellite nodes. * Improved robustness of the build and packaging process by addressing non-determinism handling for Java library dependencies.

Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when DrbdRscData is recreated during toggle-disk operations. Without this, removeLayerData() frees ports and ensureStackDataExists() may allocate different ones, causing port mismatches between controller and satellites if the satellite misses the update. Also add dh_strip_nondeterminism override in Dockerfile to fix build failures on some JAR files. Upstream: LINBIT/linstor-server#476 (comment) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> (cherry picked from commit 812d413)

kvaps · 2026-04-03T20:58:17Z

We've integrated this change into Cozystack as part of cozystack/cozystack#2331

kvaps force-pushed the fix/drbd-duplicate-tcp-ports-after-toggle-disk branch from 1294106 to 279be43 Compare January 12, 2026 12:44

kvaps mentioned this pull request Jan 12, 2026

[linstor] Update piraeus-server patches with critical fixes cozystack/cozystack#1850

Merged

kvaps mentioned this pull request Jan 14, 2026

fix(drbd): use actual device path in res file during toggle-disk #473

Closed

4 tasks

kvaps mentioned this pull request Mar 28, 2026

K8s CRD backend: tcp_port_list not persisted in LayerDrbdResources #489

Closed

kvaps mentioned this pull request Mar 28, 2026

[linstor] Preserve TCP ports during toggle-disk operations cozystack/cozystack#2292

Merged

ghernadi requested changes Mar 30, 2026

View reviewed changes

kvaps and others added 3 commits March 30, 2026 15:10

kvaps force-pushed the fix/drbd-duplicate-tcp-ports-after-toggle-disk branch from 3a3040a to bcc8990 Compare March 30, 2026 13:12

kvaps requested a review from ghernadi March 30, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476

fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476
kvaps wants to merge 3 commits intoLINBIT:masterfrom
kvaps:fix/drbd-duplicate-tcp-ports-after-toggle-disk

kvaps commented Jan 12, 2026

Uh oh!

kvaps commented Mar 28, 2026

Uh oh!

kvaps commented Mar 28, 2026

Uh oh!

ghernadi Mar 30, 2026

Uh oh!

ghernadi Mar 30, 2026

Uh oh!

kvaps commented Mar 30, 2026

Uh oh!

kvaps commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kvaps commented Jan 12, 2026

Root Cause

Impact

Solution

Fixes

Testing

Uh oh!

kvaps commented Mar 28, 2026

Updated analysis — the original fix in this PR is incomplete

Root cause

Evidence from production

Proposed expanded fix

Uh oh!

kvaps commented Mar 28, 2026

Uh oh!

ghernadi Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ghernadi Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

kvaps commented Mar 30, 2026

Uh oh!

kvaps commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants