fix(linstor): verify resource deletion completes; warn if stuck in DELETING by jmsperu · Pull Request #13076 · apache/cloudstack

jmsperu · 2026-04-28T08:31:45Z

Summary

The LINSTOR plugin treats a successful HTTP response from resourceDefinitionDelete as proof the resource is gone and immediately drops the volume from CloudStack's accounting. In practice LINSTOR can return success while the resource lingers in DELETING state — for example when a DRBD peer is unreachable, quorum was lost, or a satellite is down. The plugin has no retry, no verification, and no sweeper. We've found hundreds of stuck DELETING resources accumulated over weeks because nothing surfaces the divergence between the CS view and the LINSTOR view.

What this PR does

Adds two helpers to LinstorUtil:

isResourceDefinitionGone(api, rscName) — existence check via resourceDefinitionList
waitForResourceDefinitionDeleted(api, rscName, timeoutMillis) — polls every 1s until the resource is gone OR timeout elapses; returns true on confirmed-gone, false on timeout

Calls waitForResourceDefinitionDeleted from both delete paths:

LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition (driver / management-server side)
LinstorStorageAdaptor.deRefOrDeleteResource (host / KVM agent side)

Default timeout 30 seconds (DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS).

Behaviour

Resource deletes within 30s (the normal case) — no log change, no behaviour change.
Resource still in DELETING after 30s — emits a WARN naming the resource and pointing the operator at linstor resource list. Returns success to the caller (the CS-side accounting has already moved on; throwing here would create a different inconsistency).
Controller transiently unreachable during poll — debug-logs each failed poll, keeps trying until deadline.

What this does NOT do (deferred)

Tier-2 sweeper that periodically re-attempts delete on long-stuck resources is not in this PR. The Tier-1 surface here is the minimum needed to make the problem visible in operator logs. A sweeper can land separately if maintainers want it.

Test plan

CI build + unit tests pass (no public API changes; only behaviour additions in private methods)
Manual: delete a volume on a healthy LINSTOR cluster — no new logs, ~30s shorter than wait would otherwise be (immediate return after first poll confirms gone)
Manual: delete a volume on a cluster with a downed satellite — observe the new WARN line in the agent log and confirm the resource is still in linstor resource list until manually cleared

…stuck The LINSTOR plugin treats a successful HTTP response from resourceDefinitionDelete as proof the resource is gone, then drops the volume from CloudStack's accounting. In practice LINSTOR can return success while the resource lingers in DELETING state — for example when a DRBD peer is unreachable, quorum was lost, or a satellite is down. The plugin had no retry, no verification, and no sweeper. Operators have been finding hundreds of stuck DELETING resources accumulated over weeks because nothing surfaced the divergence between the CS view and the LINSTOR view. This change adds two helpers to LinstorUtil: isResourceDefinitionGone(api, rscName) - quick existence check via resourceDefinitionList waitForResourceDefinitionDeleted(api, rscName, timeoutMillis) - polls every second until the resource is gone OR timeout elapses - returns true on confirmed-gone, false on timeout and calls waitForResourceDefinitionDeleted from both delete sites (driver: LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition; adaptor: LinstorStorageAdaptor.deRefOrDeleteResource) with a 30s default timeout. On timeout the plugin logs a WARN with the resource name and a hint pointing at `linstor resource list`. We deliberately do NOT throw on timeout: the CS-side accounting has already moved on, and throwing would create a different inconsistency. This is the minimal Tier-1 fix that surfaces the problem in the operator's view. A follow-up could add a periodic sweeper that attempts force-delete on long-stuck DELETING resources.

Copilot

Pull request overview

This PR adds post-delete verification for LINSTOR resource definitions to surface cases where LINSTOR returns success but the resource definition remains stuck in DELETING, improving operational visibility while keeping CloudStack-side deletion semantics unchanged.

Changes:

Add isResourceDefinitionGone and waitForResourceDefinitionDeleted helpers (with a 30s default timeout) to poll for confirmed deletion.
Invoke the delete-verification helper from both management-server and KVM agent delete paths.
Emit WARN logs when the resource definition still exists after the verification timeout.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/util/LinstorUtil.java	Adds delete-verification helpers and a default timeout constant.
plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/driver/LinstorPrimaryDataStoreDriverImpl.java	Verifies RD deletion after successful API delete; warns if still present.
plugins/storage/volume/linstor/src/main/java/com/cloud/hypervisor/kvm/storage/LinstorStorageAdaptor.java	Verifies RD deletion after successful API delete in agent path; warns if still present.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T08:40:29Z

+    public static boolean waitForResourceDefinitionDeleted(DevelopersApi api, String rscName, long timeoutMillis) {
+        final long deadline = System.currentTimeMillis() + timeoutMillis;
+        while (true) {
+            try {
+                if (isResourceDefinitionGone(api, rscName)) {
+                    return true;
+                }
+            } catch (ApiException e) {
+                LOGGER.debug("LINSTOR delete-verify poll failed for {}: {}", rscName, e.getMessage());
+                // Keep polling — controller may be transiently unavailable.
+            }
+            if (System.currentTimeMillis() >= deadline) {
+                return false;
+            }
+            try {
+                Thread.sleep(1_000L);
+            } catch (InterruptedException ie) {
+                Thread.currentThread().interrupt();
+                return false;
+            }
+        }


New delete-verification behavior in waitForResourceDefinitionDeleted / isResourceDefinitionGone isn’t covered by the existing LinstorUtilTest suite. Please add unit tests for (a) confirmed-gone case, and (b) timeout/exception paths (ideally without introducing real sleeps) so regressions in the polling logic are caught in CI.

Copilot · 2026-04-28T08:40:30Z

+        List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null);
+        if (all == null) {
+            return true;
+        }
+        return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));


isResourceDefinitionGone currently calls resourceDefinitionList(null, ...), which fetches all resource definitions and then streams to find a match. This can be expensive (and will be repeated every second in waitForResourceDefinitionDeleted). Use the API's name filter instead (e.g., resourceDefinitionList(Collections.singletonList(rscName), ...)) and treat an empty result as "gone" to reduce controller load and payload size.

Suggested change

List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null);

if (all == null) {

return true;

}

return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));

List<ResourceDefinition> matches = api.resourceDefinitionList(Collections.singletonList(rscName), false, null, null, null);

if (matches == null || matches.isEmpty()) {

return true;

}

return matches.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName()));

Copilot · 2026-04-28T08:40:30Z

+    /**
+     * Default per-call timeout for {@link #waitForResourceDefinitionDeleted}. Long enough for a
+     * healthy LINSTOR controller to finish a normal delete; short enough not to block the calling
+     * agent thread for too long if the delete is genuinely stuck.


DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS Javadoc mentions "the calling agent thread", but this constant is also used on the management-server side (e.g., LinstorPrimaryDataStoreDriverImpl). Consider rewording to "calling thread" (or mention both contexts) to keep the comment accurate.

Suggested change

* agent thread for too long if the delete is genuinely stuck.

* thread for too long if the delete is genuinely stuck.

codecov · 2026-04-28T08:44:32Z

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.02%. Comparing base (6f4445c) to head (21c6110).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...cloudstack/storage/datastore/util/LinstorUtil.java	0.00%	24 Missing ⚠️
.../hypervisor/kvm/storage/LinstorStorageAdaptor.java	0.00%	4 Missing ⚠️
...tore/driver/LinstorPrimaryDataStoreDriverImpl.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13076      +/-   ##
============================================
- Coverage     18.02%   18.02%   -0.01%     
+ Complexity    16621    16620       -1     
============================================
  Files          6029     6029              
  Lines        542184   542213      +29     
  Branches      66451    66457       +6     
============================================
- Hits          97740    97737       -3     
- Misses       433428   433460      +32     
  Partials      11016    11016

Flag	Coverage Δ
uitests	`3.52% <ø> (ø)`
unittests	`19.18% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jmsperu requested a review from rp- as a code owner April 28, 2026 08:31

boring-cyborg Bot added component:primary-storage component:storage labels Apr 28, 2026

sureshanaparti requested a review from Copilot April 28, 2026 08:35

sureshanaparti added this to the 4.23.0 milestone Apr 28, 2026

Copilot started reviewing on behalf of sureshanaparti April 28, 2026 08:35 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

jmsperu mentioned this pull request Apr 28, 2026

fix(linstor): surface ambiguous template fallbacks and legacy orphan cleanup #13078

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076

fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/linstor-delete-verify

jmsperu commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

codecov Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	* agent thread for too long if the delete is genuinely stuck.
	* thread for too long if the delete is genuinely stuck.

Conversation

jmsperu commented Apr 28, 2026

Summary

What this PR does

Behaviour

What this does NOT do (deferred)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 28, 2026 •

edited

Loading