fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076
fix(linstor): verify resource deletion completes; warn if stuck in DELETING#13076jmsperu wants to merge 1 commit intoapache:mainfrom
Conversation
…stuck
The LINSTOR plugin treats a successful HTTP response from
resourceDefinitionDelete as proof the resource is gone, then drops
the volume from CloudStack's accounting. In practice LINSTOR can
return success while the resource lingers in DELETING state — for
example when a DRBD peer is unreachable, quorum was lost, or a
satellite is down. The plugin had no retry, no verification, and no
sweeper. Operators have been finding hundreds of stuck DELETING
resources accumulated over weeks because nothing surfaced the
divergence between the CS view and the LINSTOR view.
This change adds two helpers to LinstorUtil:
isResourceDefinitionGone(api, rscName)
- quick existence check via resourceDefinitionList
waitForResourceDefinitionDeleted(api, rscName, timeoutMillis)
- polls every second until the resource is gone OR timeout elapses
- returns true on confirmed-gone, false on timeout
and calls waitForResourceDefinitionDeleted from both delete sites
(driver: LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition;
adaptor: LinstorStorageAdaptor.deRefOrDeleteResource) with a 30s
default timeout. On timeout the plugin logs a WARN with the resource
name and a hint pointing at `linstor resource list`. We deliberately
do NOT throw on timeout: the CS-side accounting has already moved
on, and throwing would create a different inconsistency.
This is the minimal Tier-1 fix that surfaces the problem in the
operator's view. A follow-up could add a periodic sweeper that
attempts force-delete on long-stuck DELETING resources.
There was a problem hiding this comment.
Pull request overview
This PR adds post-delete verification for LINSTOR resource definitions to surface cases where LINSTOR returns success but the resource definition remains stuck in DELETING, improving operational visibility while keeping CloudStack-side deletion semantics unchanged.
Changes:
- Add
isResourceDefinitionGoneandwaitForResourceDefinitionDeletedhelpers (with a 30s default timeout) to poll for confirmed deletion. - Invoke the delete-verification helper from both management-server and KVM agent delete paths.
- Emit WARN logs when the resource definition still exists after the verification timeout.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/util/LinstorUtil.java | Adds delete-verification helpers and a default timeout constant. |
| plugins/storage/volume/linstor/src/main/java/org/apache/cloudstack/storage/datastore/driver/LinstorPrimaryDataStoreDriverImpl.java | Verifies RD deletion after successful API delete; warns if still present. |
| plugins/storage/volume/linstor/src/main/java/com/cloud/hypervisor/kvm/storage/LinstorStorageAdaptor.java | Verifies RD deletion after successful API delete in agent path; warns if still present. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public static boolean waitForResourceDefinitionDeleted(DevelopersApi api, String rscName, long timeoutMillis) { | ||
| final long deadline = System.currentTimeMillis() + timeoutMillis; | ||
| while (true) { | ||
| try { | ||
| if (isResourceDefinitionGone(api, rscName)) { | ||
| return true; | ||
| } | ||
| } catch (ApiException e) { | ||
| LOGGER.debug("LINSTOR delete-verify poll failed for {}: {}", rscName, e.getMessage()); | ||
| // Keep polling — controller may be transiently unavailable. | ||
| } | ||
| if (System.currentTimeMillis() >= deadline) { | ||
| return false; | ||
| } | ||
| try { | ||
| Thread.sleep(1_000L); | ||
| } catch (InterruptedException ie) { | ||
| Thread.currentThread().interrupt(); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
New delete-verification behavior in waitForResourceDefinitionDeleted / isResourceDefinitionGone isn’t covered by the existing LinstorUtilTest suite. Please add unit tests for (a) confirmed-gone case, and (b) timeout/exception paths (ideally without introducing real sleeps) so regressions in the polling logic are caught in CI.
| List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null); | ||
| if (all == null) { | ||
| return true; | ||
| } | ||
| return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName())); |
There was a problem hiding this comment.
isResourceDefinitionGone currently calls resourceDefinitionList(null, ...), which fetches all resource definitions and then streams to find a match. This can be expensive (and will be repeated every second in waitForResourceDefinitionDeleted). Use the API's name filter instead (e.g., resourceDefinitionList(Collections.singletonList(rscName), ...)) and treat an empty result as "gone" to reduce controller load and payload size.
| List<ResourceDefinition> all = api.resourceDefinitionList(null, false, null, null, null); | |
| if (all == null) { | |
| return true; | |
| } | |
| return all.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName())); | |
| List<ResourceDefinition> matches = api.resourceDefinitionList(Collections.singletonList(rscName), false, null, null, null); | |
| if (matches == null || matches.isEmpty()) { | |
| return true; | |
| } | |
| return matches.stream().noneMatch(rd -> rscName.equalsIgnoreCase(rd.getName())); |
| /** | ||
| * Default per-call timeout for {@link #waitForResourceDefinitionDeleted}. Long enough for a | ||
| * healthy LINSTOR controller to finish a normal delete; short enough not to block the calling | ||
| * agent thread for too long if the delete is genuinely stuck. |
There was a problem hiding this comment.
DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS Javadoc mentions "the calling agent thread", but this constant is also used on the management-server side (e.g., LinstorPrimaryDataStoreDriverImpl). Consider rewording to "calling thread" (or mention both contexts) to keep the comment accurate.
| * agent thread for too long if the delete is genuinely stuck. | |
| * thread for too long if the delete is genuinely stuck. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #13076 +/- ##
============================================
- Coverage 18.02% 18.02% -0.01%
+ Complexity 16621 16620 -1
============================================
Files 6029 6029
Lines 542184 542213 +29
Branches 66451 66457 +6
============================================
- Hits 97740 97737 -3
- Misses 433428 433460 +32
Partials 11016 11016
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
The LINSTOR plugin treats a successful HTTP response from
resourceDefinitionDeleteas proof the resource is gone and immediately drops the volume from CloudStack's accounting. In practice LINSTOR can return success while the resource lingers in DELETING state — for example when a DRBD peer is unreachable, quorum was lost, or a satellite is down. The plugin has no retry, no verification, and no sweeper. We've found hundreds of stuck DELETING resources accumulated over weeks because nothing surfaces the divergence between the CS view and the LINSTOR view.What this PR does
Adds two helpers to
LinstorUtil:isResourceDefinitionGone(api, rscName)— existence check viaresourceDefinitionListwaitForResourceDefinitionDeleted(api, rscName, timeoutMillis)— polls every 1s until the resource is gone OR timeout elapses; returnstrueon confirmed-gone,falseon timeoutCalls
waitForResourceDefinitionDeletedfrom both delete paths:LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition(driver / management-server side)LinstorStorageAdaptor.deRefOrDeleteResource(host / KVM agent side)Default timeout 30 seconds (
DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS).Behaviour
linstor resource list. Returns success to the caller (the CS-side accounting has already moved on; throwing here would create a different inconsistency).What this does NOT do (deferred)
Test plan
waitwould otherwise be (immediate return after first poll confirms gone)linstor resource listuntil manually cleared