Skip to content

fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476

Open
kvaps wants to merge 3 commits intoLINBIT:masterfrom
kvaps:fix/drbd-duplicate-tcp-ports-after-toggle-disk
Open

fix(drbd): prevent duplicate TCP ports after toggle-disk operations#476
kvaps wants to merge 3 commits intoLINBIT:masterfrom
kvaps:fix/drbd-duplicate-tcp-ports-after-toggle-disk

Conversation

@kvaps
Copy link
Copy Markdown

@kvaps kvaps commented Jan 12, 2026

This PR fixes a regression introduced after the TCP port migration to per-node level (commit f754943, May 2025) that causes duplicate TCP port assignments during toggle-disk operations.

Root Cause

The resetStoragePools() method, originally written in 2019 (commit 95cc17d), calls ensureStackDataExists() with an empty LayerPayload. This was correct when TCP ports were stored at RscDfn level but became a regression after the port migration:

  1. Empty payload → DrbdRscData created without TCP ports
  2. Controller sends Pojo with empty port Set to satellites
  3. Satellite's initPorts() uses preferredNewPortsRef from peer resources
  4. SatelliteDynamicNumberPool.tryAllocate() always returns true (no-op)
  5. Random ports from peer resources get assigned → duplicate TCP ports

Impact

Affected operations:

  • Snapshot creation/restore operations
  • Manual toggle-disk operations
  • Any operation calling resetStoragePools()

Symptoms:

  • DRBD resources fail to adjust: "port is also used" errors
  • Resources stuck in StandAlone or Connecting states
  • Multiple resources on same node using identical TCP ports

Solution

Remove the redundant ensureStackDataExists() call from resetStoragePools(). The calling code (e.g., CtrlRscToggleDiskApiCallHandler:1071) already invokes ensureStackDataExists() with the correct payload immediately after resetStoragePools().

This ensures:

  • resetStoragePools() only resets storage pool assignments
  • Layer data creation with proper TCP ports happens via caller's ensureStackDataExists()
  • No DrbdRscData objects created without TCP port assignments

Fixes

Testing

Verified that:

  • Toggle-disk operations no longer create resources without TCP ports
  • Backup/restore operations complete without TCP port conflicts
  • Resources maintain unique TCP ports across toggle-disk cycles

@kvaps kvaps force-pushed the fix/drbd-duplicate-tcp-ports-after-toggle-disk branch from 1294106 to 279be43 Compare January 12, 2026 12:44
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 12, 2026
## What this PR does

This PR updates piraeus-server patches to address several critical
production issues with DRBD resources and LUKS encryption:

1. **Add fix-duplicate-tcp-ports.diff** - Prevents duplicate TCP ports
after toggle-disk operations (upstream PR #476)

2. **Update skip-adjust-when-device-inaccessible.diff** - Comprehensive
fix for multiple issues:
   - Resources stuck in StandAlone state after node reboot
   - Unknown state race condition during satellite restart
   - Encrypted LUKS resource deletion failures
   - Network reconnect blocked by unavailable child device checks

These patches resolve scenarios where DRBD resources fail to
automatically reconnect after node reboots and improve LUKS resource
lifecycle management.

Upstream PRs:
- LINBIT/linstor-server#476
- LINBIT/linstor-server#477

### Release note

```release-note
[linstor] Fix DRBD resources stuck in StandAlone state after reboot and encrypted resource deletion issues
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
  * Prevents duplicate TCP port conflicts after disk toggle operations
  * Fixes resources stuck in StandAlone or Unknown state after reboot
  * Resolves issues with encrypted resource deletion
  * Improves handling of temporarily inaccessible storage devices

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@kvaps
Copy link
Copy Markdown
Author

kvaps commented Mar 28, 2026

Updated analysis — the original fix in this PR is incomplete

After extensive debugging on a production cluster, we found that removing the redundant ensureStackDataExists() from resetStoragePools() is correct but doesn't address the root cause of TCP port mismatches.

Root cause

During toggle-disk operations, removeLayerData() deletes DrbdRscData (freeing TCP ports from the number pool), then ensureStackDataExists() creates new DrbdRscData with an empty LayerPayload (no tcpPorts). The controller's initPorts() allocates new ports from the pool — which may differ from the old ports if other resources claimed them in the meantime.

The controller correctly avoids collisions in its own number pool. But if the satellite misses the update (e.g. during controller restart, network issue, or drbdadm adjust failure), it keeps the old ports while peers receive the new ones, causing DRBD connection failures (StandAlone/Connecting).

Evidence from production

  • Controller had zero port collisions — all ports were unique
  • Satellite .res files had different ports from what the controller assigned
  • Example: pvc-d2c5ade3 on plo-csxhk-004 — controller=7187, satellite=7188
  • Peer node (gld-csxhk-006) correctly had 7187 in its .res for plo-csxhk-004
  • Same pattern confirmed on multiple nodes with 3+ resources affected

Proposed expanded fix

Preserve existing TCP ports during toggle-disk by copying them into the LayerPayload before removeLayerData() deletes them — similar to how copyDrbdNodeIdIfExists() already preserves the node-id.

private void copyDrbdTcpPortsIfExists(Resource rsc, LayerPayload payload) {
    Set<AbsRscLayerObject<Resource>> drbdRscDataSet = LayerRscUtils.getRscDataByLayer(
        getLayerData(apiCtx, rsc), DeviceLayerKind.DRBD);
    if (!drbdRscDataSet.isEmpty()) {
        DrbdRscData<Resource> drbdRscData = (DrbdRscData<Resource>) drbdRscDataSet.iterator().next();
        Collection<TcpPortNumber> tcpPorts = drbdRscData.getTcpPortList();
        if (tcpPorts != null && !tcpPorts.isEmpty()) {
            Set<Integer> portInts = new TreeSet<>();
            for (TcpPortNumber port : tcpPorts) {
                portInts.add(port.value);
            }
            payload.drbdRsc.tcpPorts = portInts;
        }
    }
}

Called from:

  1. copyDrbdNodeIdIfExists() — covers both toggle-disk paths (normal and finishOperationInTransaction)
  2. needsDeactivate path — shared storage pool case where node-id changes but ports should be preserved

This ensures the same TCP ports are reused when DrbdRscData is recreated, eliminating the window for port mismatch between controller and satellites.

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Mar 28, 2026

Pushed the expanded fix (cf28218) that preserves TCP ports during toggle-disk operations.

The new commit adds copyDrbdTcpPortsIfExists() which saves existing TCP ports into the LayerPayload before removeLayerData() deletes them. This is called from:

  1. copyDrbdNodeIdIfExists() — covers both normal toggle-disk paths
  2. needsDeactivate path — shared storage pool case

Combined with the original commit (removing redundant ensureStackDataExists from resetStoragePools), this prevents TCP port reassignment during toggle-disk operations.

Note: We also discovered a separate issue where the K8s CRD backend doesn't persist tcp_port_list for DrbdRscData entries created after the v1.31.1 migration (e.g. via toggle-disk). The migration correctly populates existing entries, but new entries created afterward are missing the field. This causes the controller to re-allocate ports on restart for those entries. This issue is not addressed in this PR and should be tracked separately.

kvaps added a commit to kvaps/cozystack that referenced this pull request Mar 28, 2026
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when
DrbdRscData is recreated during toggle-disk operations. Without this,
removeLayerData() frees ports and ensureStackDataExists() may allocate
different ones, causing port mismatches between controller and satellites
if the satellite misses the update.

Also add dh_strip_nondeterminism override in Dockerfile to fix build
failures on some JAR files.

Upstream: LINBIT/linstor-server#476 (comment)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
rscDataToProcess.addAll(rscData.getChildren());
}

ensureStackDataExists(rscRef, null, new LayerPayload());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this is redundant with the call in CtrlRscToggleDiskApiCallHandler line 1327 . However here are my thoughts to this:

If anything, it is the call in CtrlRscToggleDiskApiCallHandler line 1327 that is the redundant one, since the resetStoragePools method could technically be called from other places in the future. Removing it as you wanted would add the risk that this future new code/feature forgets to call ensureStackDataExists which will leave a corrupt layer-tree within LINSTOR.

That would mean that the line 1327 of CtrlRscToggleDiskApiCallHandler could be moved into the end of the then-case of if (removeDisk) (i.e. currently between lines 1320 and 1321). However, with the same argument as before, adding some future new branch in the finishOperationInTransaction could also lead to forgetting to call ensureStackDataExists.

So here is what I'd rather suggest:

  • Either keep the line 280 in CtrlRscLayerDataFactory to ensure that calling this method cannot be forgotten and instead just add a comment before CtrlRscToggleDiskApiCallHandler line 1327 that we simply accept this redundancy
  • Make calling ensureStackDataExists in CtrlRscLayerDataFactory line 280 optional with something like this:
public void resetStoragePools(Resource rscRef)
{
    resetStoragePools(rscRef, true);
}

public void resetStoragePools(Resource rscRef, boolean callEnsureStackDataExistsRef)
{
    ... 
    if (callEnsureStackDataExistsRef)
    {
        ensureStackDataExists(...);
    }
    ...
}

And change the call in CtrlRscToggleDiskApiCallHandler line 1325 to ctrlLayerStackHelper.resetStoragePools(rsc, false);

The latter approach should prevent accidentally forgetting to call this in the future (worst case we add again a redundancy but never miss calling ensureStackDataExists entirely) but also gives the caller enough control to prevent redundancy.

payload.drbdRsc.replacingOldLayerRscId = drbdRscData.getRscLayerId();
payload.drbdRsc.nodeId = drbdRscData.getNodeId().value;
}
copyDrbdTcpPortsIfExists(rsc, payload);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call is quite unintuitive. I understand why you added it, but if I see a method that is called copyA and notice that it also copies B sounds strange.

Instead of calling the method here I'd add a new method instead:

private void copyDrbdSettings(Resource rscRef, LayerPayload payloadRef)
{
    copyDrbdNodeIdIfExists(rscRef, payloadRef);
    copyDrbdTcpPortsIfExists(rscRef, payloadRef);
}

and replace the two old calls to copyDrbdNodeIdIfExists with this new copyDrbdSettings method.

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Mar 30, 2026

Thanks for the review! Addressed both points:

  1. resetStoragePools now has an optional callEnsureStackDataExistsRef parameter (defaults to true), and CtrlRscToggleDiskApiCallHandler passes false to skip the redundant call.

  2. Replaced the unintuitive copyDrbdTcpPortsIfExists call inside copyDrbdNodeIdIfExists with a new copyDrbdSettings wrapper that calls both methods. All call sites now use copyDrbdSettings except the needsDeactivate path which only needs copyDrbdTcpPortsIfExists (node-id changes for shared storage).

kvaps and others added 3 commits March 30, 2026 15:10
Remove redundant ensureStackDataExists() call with empty payload from
resetStoragePools() method that was causing TCP port conflicts after
toggle-disk operations.

Root Cause:
-----------
The resetStoragePools() method, introduced in 2019 (commit 95cc17d),
calls ensureStackDataExists() with an empty LayerPayload. This worked
correctly when TCP ports were stored at RscDfn level.

After the TCP port migration to per-node level (commit f754943, May
2025), this empty payload results in DrbdRscData being created without
TCP ports assigned. The controller then sends a Pojo with an empty port
Set to satellites.

On satellites, when DrbdRscData is initialized with an empty port list,
initPorts() uses preferredNewPortsRef from peer resources. Since
SatelliteDynamicNumberPool.tryAllocate() always returns true (no-op),
any port from preferredNewPortsRef is accepted without conflict checking,
leading to duplicate TCP port assignments.

Impact:
-------
This regression affects toggle-disk operations, particularly:
- Snapshot creation/restore operations
- Manual toggle-disk operations
- Any operation calling resetStoragePools()

Symptoms include:
- DRBD resources failing to adjust with "port is also used" errors
- Resources stuck in StandAlone or Connecting states
- Multiple resources on the same node using identical TCP ports

Solution:
---------
Remove the ensureStackDataExists() call from resetStoragePools() as it
is redundant. The calling code (e.g., CtrlRscToggleDiskApiCallHandler
line 1071) already invokes ensureStackDataExists() with the correct
payload immediately after resetStoragePools().

This fix ensures:
1. resetStoragePools() only resets storage pool assignments
2. Layer data creation with proper TCP ports happens via the caller's
   ensureStackDataExists() with correct payload
3. No DrbdRscData objects are created without TCP port assignments

Related Issues:
---------------
Fixes LINBIT#454 - Duplicate TCP ports after backup/restore operations
Related to user reports of resources stuck in StandAlone after node
reboots when toggle-disk or backup operations were in progress.

Testing:
--------
Verified that:
- Toggle-disk operations no longer create resources without TCP ports
- Backup/restore operations complete without TCP port conflicts
- Resources maintain unique TCP ports across toggle-disk cycles

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
When removeLayerData() deletes DrbdRscData during toggle-disk, the TCP
ports are freed from the number pool. ensureStackDataExists() then
allocates new ports which may differ from the old ones if other resources
claimed them in the meantime.

The controller correctly avoids collisions in its own number pool, but
if the satellite misses the update (e.g. due to controller restart or
connectivity issues), it keeps the old ports while peers receive the
new ones, causing DRBD connections to fail with StandAlone state.

Add copyDrbdTcpPortsIfExists() to save existing TCP ports into the
LayerPayload before removeLayerData() deletes them. Call it from
copyDrbdNodeIdIfExists() (covers both toggle-disk paths) and from
the needsDeactivate path (shared storage pool case).

This ensures the same TCP ports are reused when DrbdRscData is
recreated, eliminating the window for port mismatch between controller
and satellites.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
- Introduce copyDrbdSettings() wrapper instead of calling
  copyDrbdTcpPortsIfExists() from within copyDrbdNodeIdIfExists()
- Replace calls to copyDrbdNodeIdIfExists() with copyDrbdSettings()
  in both toggle-disk paths
- Make ensureStackDataExists() in resetStoragePools() optional via
  boolean parameter to prevent accidental omission in future callers
  while allowing CtrlRscToggleDiskApiCallHandler to skip the redundant
  call

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the fix/drbd-duplicate-tcp-ports-after-toggle-disk branch from 3a3040a to bcc8990 Compare March 30, 2026 13:12
@kvaps kvaps requested a review from ghernadi March 30, 2026 18:06
kvaps added a commit to cozystack/cozystack that referenced this pull request Mar 30, 2026
## What this PR does

Updates the `fix-duplicate-tcp-ports` patch to preserve existing TCP
ports when DrbdRscData is recreated during toggle-disk operations.

Without this fix, `removeLayerData()` frees TCP ports from the number
pool, and `ensureStackDataExists()` may allocate different ports. If the
satellite misses the update (e.g. due to controller restart), it keeps
the old ports while peers receive the new ones, causing DRBD connections
to fail with StandAlone state.

The fix adds `copyDrbdTcpPortsIfExists()` which saves existing TCP ports
into the `LayerPayload` before `removeLayerData()` deletes them.

Also adds `dh_strip_nondeterminism` override in Dockerfile to fix build
failures on some JAR files.

Upstream:
LINBIT/linstor-server#476 (comment)

### Release note

\`\`\`release-note
[linstor] Fix TCP port mismatches after toggle-disk operations that
could cause DRBD resources to enter StandAlone state
\`\`\`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed an issue where DRBD TCP ports were not correctly preserved
during disk toggle operations, which could result in TCP port mismatches
between the controller and satellite nodes.
* Improved robustness of the build and packaging process by addressing
non-determinism handling for Java library dependencies.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
github-actions bot pushed a commit to cozystack/cozystack that referenced this pull request Mar 30, 2026
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when
DrbdRscData is recreated during toggle-disk operations. Without this,
removeLayerData() frees ports and ensureStackDataExists() may allocate
different ones, causing port mismatches between controller and satellites
if the satellite misses the update.

Also add dh_strip_nondeterminism override in Dockerfile to fix build
failures on some JAR files.

Upstream: LINBIT/linstor-server#476 (comment)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
(cherry picked from commit 812d413)
github-actions bot pushed a commit to cozystack/cozystack that referenced this pull request Mar 30, 2026
Update fix-duplicate-tcp-ports patch to preserve existing TCP ports when
DrbdRscData is recreated during toggle-disk operations. Without this,
removeLayerData() frees ports and ensureStackDataExists() may allocate
different ones, causing port mismatches between controller and satellites
if the satellite misses the update.

Also add dh_strip_nondeterminism override in Dockerfile to fix build
failures on some JAR files.

Upstream: LINBIT/linstor-server#476 (comment)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
(cherry picked from commit 812d413)
@kvaps
Copy link
Copy Markdown
Author

kvaps commented Apr 3, 2026

We've integrated this change into Cozystack as part of cozystack/cozystack#2331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error creating rollback entry - Resource SecObjectProtection not found

2 participants